I think the files may have been corrupted when I had initially shut down the node that was still in decommisiioning mode
Unfortunately I hadn't done the dfsadmin -report any time soon before I had the incident so I can't be sure that they haven't been there for a while. I always assumed that the fsck command would tell me if there were issues. So will running hadoop fsck -move just move the corrupted replicas and leave the good ones? Will this work even though fsck does not report any corruption? On Jun 9, 2011, at 3:20 PM, Joey Echeverria wrote: > hadoop fsck -move will move the corrupt files to /lost+found, which > will "fix" the report. > > Do you know what created the corrupt files? > > -Joey > > On Thu, Jun 9, 2011 at 3:04 PM, Robert J Berger <rber...@runa.com> wrote: >> I'm still having this problem and am kind of paralyzed until I figure out >> how to eliminate these Blocks with corrupt replicas. >> >> Here is the output of dfsadmin -report and fsck: >> >> dfsadmin -report >> Configured Capacity: 13723995700736 (12.48 TB) >> Present Capacity: 13731775356416 (12.49 TB) >> DFS Remaining: 4079794918277 (3.71 TB) >> DFS Used: 9651980438139 (8.78 TB) >> DFS Used%: 70.29% >> Under replicated blocks: 18 >> Blocks with corrupt replicas: 34 >> Missing blocks: 0 >> >> ------------------------------------------------- >> Datanodes available: 9 (9 total, 0 dead) >> (Not showing the nodes other than the one with Decommission in progress) >> ... >> Name: 10.195.10.175:50010 >> Decommission Status : Decommission in progress >> Configured Capacity: 1731946381312 (1.58 TB) >> DFS Used: 1083853885440 (1009.42 GB) >> Non DFS Used: 0 (0 KB) >> DFS Remaining: 651169222656(606.45 GB) >> DFS Used%: 62.58% >> DFS Remaining%: 37.6% >> Last contact: Wed Jun 08 18:56:54 UTC 2011 >> ... >> >> And the good bits from fsck: >> >> Status: HEALTHY >> Total size: 2832555958232 B (Total open files size: 134217728 B) >> Total dirs: 72151 >> Total files: 65449 (Files currently being written: 9) >> Total blocks (validated): 95076 (avg. block size 29792544 B) (Total >> open file blocks (not validated): 10) >> Minimally replicated blocks: 95076 (100.0 %) >> Over-replicated blocks: 35667 (37.5142 %) >> Under-replicated blocks: 18 (0.018932223 %) >> Mis-replicated blocks: 0 (0.0 %) >> Default replication factor: 3 >> Average block replication: 3.376278 >> Corrupt blocks: 0 >> Missing replicas: 18 (0.0056074243 %) >> Number of data-nodes: 9 >> Number of racks: 1 >> >> >> The filesystem under path '/' is HEALTHY >> >> >> >> On Jun 8, 2011, at 10:38 AM, Robert J Berger wrote: >> >>> Synopsis: >>> * After shutting down a datanode in a cluster, fsck declares CORRUPT with >>> missing blocks, >>> * I restore/restart the datanode and fsck soon declares things healthy >>> * But dfsadmin -report says a small number of blocks have corrupt replicas >>> and an even smaller number of under replicated blocks >>> * After a couple of days that number corrupt replicas and under replicated >>> blocks stays the same >>> >>> Full Story: >>> My Goal is to rebalance blocks across 3 drives each within 2 datanodes in a >>> 9 datanode (Replication=3) cluster running hadoop 0.20.1 >>> (EBS Volumes were added to the datanodes over time so one disk had 95% >>> usage and the others had significantly less) >>> >>> The plan was to decommission the nodes and then wipe the disks and then add >>> them back in to the cluster. >>> >>> Before I started I ran fsck and all was healthy. (Unfortunately I did not >>> really look at the dfsadmin -report at that time, so I can't be sure if >>> there were no blocks with corrupt replicas at this point) >>> >>> I put two nodes into the Decommission process and after waiting about 36 >>> hours it hadn't finished decommissioning ether. So I decided to throw >>> caution to the wind and shut down one of them. (and had taken the node I >>> was shutting down out of the dfs.exclude.file file, also removed the 2nd >>> node from the dfs.exclude.file , dfsadmin -refreshNodes but kept the 2nd >>> node live) >>> >>> After shutting down one node, running fsck showed about 400 blocks as >>> missing. >>> >>> So I brought back up the shutdown node (it took a while as I had to restore >>> it from EBS snapshot) and fsck quickly went back to healthy but with a >>> significant amount of Over replicated blocks >>> >>> I put that node back into the decommissioning state (put just that one node >>> back in the dfs.exclude.file and ran dfsadmin -refreshNodes. >>> >>> After another day or so, its still in the decommissioning mode. Fsck says >>> the cluster is healthy but still 37% over-replicated blocks. >>> >>> But the thing that concerns me is that dfsadmin -report says: >>> >>> Under replicated blocks: 18 >>> Blocks with corrupt replicas: 34 >>> >>> So really two questions: >>> >>> * Is there a way to force these corrupt replicas and under replicated >>> blocks to get fixed? >>> * Is there a way to speed up the decommissioning process (without >>> restarting the cluster) >>> >>> I presume that its not safe for me to take down this node until the >>> decommissioning completes and/or the corrupt replicas are fixed.. >>> >>> And finally, is there a better way to accomplish the original task of >>> rebalancing disks on a datanode? >>> >>> Thanks! >>> Rob >>> __________________ >>> Robert J Berger - CTO >>> Runa Inc. >>> +1 408-838-8896 >>> http://blog.ibd.com >>> >>> >>> >> >> __________________ >> Robert J Berger - CTO >> Runa Inc. >> +1 408-838-8896 >> http://blog.ibd.com >> >> >> >> > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 __________________ Robert J Berger - CTO Runa Inc. +1 408-838-8896 http://blog.ibd.com