hadoop fsck -move will move the corrupt files to /lost+found, which will "fix" the report.
Do you know what created the corrupt files? -Joey On Thu, Jun 9, 2011 at 3:04 PM, Robert J Berger <rber...@runa.com> wrote: > I'm still having this problem and am kind of paralyzed until I figure out how > to eliminate these Blocks with corrupt replicas. > > Here is the output of dfsadmin -report and fsck: > > dfsadmin -report > Configured Capacity: 13723995700736 (12.48 TB) > Present Capacity: 13731775356416 (12.49 TB) > DFS Remaining: 4079794918277 (3.71 TB) > DFS Used: 9651980438139 (8.78 TB) > DFS Used%: 70.29% > Under replicated blocks: 18 > Blocks with corrupt replicas: 34 > Missing blocks: 0 > > ------------------------------------------------- > Datanodes available: 9 (9 total, 0 dead) > (Not showing the nodes other than the one with Decommission in progress) > ... > Name: 10.195.10.175:50010 > Decommission Status : Decommission in progress > Configured Capacity: 1731946381312 (1.58 TB) > DFS Used: 1083853885440 (1009.42 GB) > Non DFS Used: 0 (0 KB) > DFS Remaining: 651169222656(606.45 GB) > DFS Used%: 62.58% > DFS Remaining%: 37.6% > Last contact: Wed Jun 08 18:56:54 UTC 2011 > ... > > And the good bits from fsck: > > Status: HEALTHY > Total size: 2832555958232 B (Total open files size: 134217728 B) > Total dirs: 72151 > Total files: 65449 (Files currently being written: 9) > Total blocks (validated): 95076 (avg. block size 29792544 B) (Total > open file blocks (not validated): 10) > Minimally replicated blocks: 95076 (100.0 %) > Over-replicated blocks: 35667 (37.5142 %) > Under-replicated blocks: 18 (0.018932223 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 3 > Average block replication: 3.376278 > Corrupt blocks: 0 > Missing replicas: 18 (0.0056074243 %) > Number of data-nodes: 9 > Number of racks: 1 > > > The filesystem under path '/' is HEALTHY > > > > On Jun 8, 2011, at 10:38 AM, Robert J Berger wrote: > >> Synopsis: >> * After shutting down a datanode in a cluster, fsck declares CORRUPT with >> missing blocks, >> * I restore/restart the datanode and fsck soon declares things healthy >> * But dfsadmin -report says a small number of blocks have corrupt replicas >> and an even smaller number of under replicated blocks >> * After a couple of days that number corrupt replicas and under replicated >> blocks stays the same >> >> Full Story: >> My Goal is to rebalance blocks across 3 drives each within 2 datanodes in a >> 9 datanode (Replication=3) cluster running hadoop 0.20.1 >> (EBS Volumes were added to the datanodes over time so one disk had 95% usage >> and the others had significantly less) >> >> The plan was to decommission the nodes and then wipe the disks and then add >> them back in to the cluster. >> >> Before I started I ran fsck and all was healthy. (Unfortunately I did not >> really look at the dfsadmin -report at that time, so I can't be sure if >> there were no blocks with corrupt replicas at this point) >> >> I put two nodes into the Decommission process and after waiting about 36 >> hours it hadn't finished decommissioning ether. So I decided to throw >> caution to the wind and shut down one of them. (and had taken the node I was >> shutting down out of the dfs.exclude.file file, also removed the 2nd node >> from the dfs.exclude.file , dfsadmin -refreshNodes but kept the 2nd node >> live) >> >> After shutting down one node, running fsck showed about 400 blocks as >> missing. >> >> So I brought back up the shutdown node (it took a while as I had to restore >> it from EBS snapshot) and fsck quickly went back to healthy but with a >> significant amount of Over replicated blocks >> >> I put that node back into the decommissioning state (put just that one node >> back in the dfs.exclude.file and ran dfsadmin -refreshNodes. >> >> After another day or so, its still in the decommissioning mode. Fsck says >> the cluster is healthy but still 37% over-replicated blocks. >> >> But the thing that concerns me is that dfsadmin -report says: >> >> Under replicated blocks: 18 >> Blocks with corrupt replicas: 34 >> >> So really two questions: >> >> * Is there a way to force these corrupt replicas and under replicated blocks >> to get fixed? >> * Is there a way to speed up the decommissioning process (without restarting >> the cluster) >> >> I presume that its not safe for me to take down this node until the >> decommissioning completes and/or the corrupt replicas are fixed.. >> >> And finally, is there a better way to accomplish the original task of >> rebalancing disks on a datanode? >> >> Thanks! >> Rob >> __________________ >> Robert J Berger - CTO >> Runa Inc. >> +1 408-838-8896 >> http://blog.ibd.com >> >> >> > > __________________ > Robert J Berger - CTO > Runa Inc. > +1 408-838-8896 > http://blog.ibd.com > > > > -- Joseph Echeverria Cloudera, Inc. 443.305.9434