Good question. I didn't pick up on the fact that fsck disagrees with dfsadmin. Have you tried a full restart? Maybe somebody's information is out of date?
-Joey On Fri, Jun 10, 2011 at 6:22 PM, Robert J Berger <rber...@runa.com> wrote: > I think the files may have been corrupted when I had initially shut down the > node that was still in decommisiioning mode > > Unfortunately I hadn't done the dfsadmin -report any time soon before I had > the incident so I can't be sure that they haven't been there for a while. I > always assumed that the fsck command would tell me if there were issues. > > So will running hadoop fsck -move just move the corrupted replicas and leave > the good ones? Will this work even though fsck does not report any corruption? > > On Jun 9, 2011, at 3:20 PM, Joey Echeverria wrote: > >> hadoop fsck -move will move the corrupt files to /lost+found, which >> will "fix" the report. >> >> Do you know what created the corrupt files? >> >> -Joey >> >> On Thu, Jun 9, 2011 at 3:04 PM, Robert J Berger <rber...@runa.com> wrote: >>> I'm still having this problem and am kind of paralyzed until I figure out >>> how to eliminate these Blocks with corrupt replicas. >>> >>> Here is the output of dfsadmin -report and fsck: >>> >>> dfsadmin -report >>> Configured Capacity: 13723995700736 (12.48 TB) >>> Present Capacity: 13731775356416 (12.49 TB) >>> DFS Remaining: 4079794918277 (3.71 TB) >>> DFS Used: 9651980438139 (8.78 TB) >>> DFS Used%: 70.29% >>> Under replicated blocks: 18 >>> Blocks with corrupt replicas: 34 >>> Missing blocks: 0 >>> >>> ------------------------------------------------- >>> Datanodes available: 9 (9 total, 0 dead) >>> (Not showing the nodes other than the one with Decommission in progress) >>> ... >>> Name: 10.195.10.175:50010 >>> Decommission Status : Decommission in progress >>> Configured Capacity: 1731946381312 (1.58 TB) >>> DFS Used: 1083853885440 (1009.42 GB) >>> Non DFS Used: 0 (0 KB) >>> DFS Remaining: 651169222656(606.45 GB) >>> DFS Used%: 62.58% >>> DFS Remaining%: 37.6% >>> Last contact: Wed Jun 08 18:56:54 UTC 2011 >>> ... >>> >>> And the good bits from fsck: >>> >>> Status: HEALTHY >>> Total size: 2832555958232 B (Total open files size: 134217728 B) >>> Total dirs: 72151 >>> Total files: 65449 (Files currently being written: 9) >>> Total blocks (validated): 95076 (avg. block size 29792544 B) (Total >>> open file blocks (not validated): 10) >>> Minimally replicated blocks: 95076 (100.0 %) >>> Over-replicated blocks: 35667 (37.5142 %) >>> Under-replicated blocks: 18 (0.018932223 %) >>> Mis-replicated blocks: 0 (0.0 %) >>> Default replication factor: 3 >>> Average block replication: 3.376278 >>> Corrupt blocks: 0 >>> Missing replicas: 18 (0.0056074243 %) >>> Number of data-nodes: 9 >>> Number of racks: 1 >>> >>> >>> The filesystem under path '/' is HEALTHY >>> >>> >>> >>> On Jun 8, 2011, at 10:38 AM, Robert J Berger wrote: >>> >>>> Synopsis: >>>> * After shutting down a datanode in a cluster, fsck declares CORRUPT with >>>> missing blocks, >>>> * I restore/restart the datanode and fsck soon declares things healthy >>>> * But dfsadmin -report says a small number of blocks have corrupt replicas >>>> and an even smaller number of under replicated blocks >>>> * After a couple of days that number corrupt replicas and under replicated >>>> blocks stays the same >>>> >>>> Full Story: >>>> My Goal is to rebalance blocks across 3 drives each within 2 datanodes in >>>> a 9 datanode (Replication=3) cluster running hadoop 0.20.1 >>>> (EBS Volumes were added to the datanodes over time so one disk had 95% >>>> usage and the others had significantly less) >>>> >>>> The plan was to decommission the nodes and then wipe the disks and then >>>> add them back in to the cluster. >>>> >>>> Before I started I ran fsck and all was healthy. (Unfortunately I did not >>>> really look at the dfsadmin -report at that time, so I can't be sure if >>>> there were no blocks with corrupt replicas at this point) >>>> >>>> I put two nodes into the Decommission process and after waiting about 36 >>>> hours it hadn't finished decommissioning ether. So I decided to throw >>>> caution to the wind and shut down one of them. (and had taken the node I >>>> was shutting down out of the dfs.exclude.file file, also removed the 2nd >>>> node from the dfs.exclude.file , dfsadmin -refreshNodes but kept the 2nd >>>> node live) >>>> >>>> After shutting down one node, running fsck showed about 400 blocks as >>>> missing. >>>> >>>> So I brought back up the shutdown node (it took a while as I had to >>>> restore it from EBS snapshot) and fsck quickly went back to healthy but >>>> with a significant amount of Over replicated blocks >>>> >>>> I put that node back into the decommissioning state (put just that one >>>> node back in the dfs.exclude.file and ran dfsadmin -refreshNodes. >>>> >>>> After another day or so, its still in the decommissioning mode. Fsck says >>>> the cluster is healthy but still 37% over-replicated blocks. >>>> >>>> But the thing that concerns me is that dfsadmin -report says: >>>> >>>> Under replicated blocks: 18 >>>> Blocks with corrupt replicas: 34 >>>> >>>> So really two questions: >>>> >>>> * Is there a way to force these corrupt replicas and under replicated >>>> blocks to get fixed? >>>> * Is there a way to speed up the decommissioning process (without >>>> restarting the cluster) >>>> >>>> I presume that its not safe for me to take down this node until the >>>> decommissioning completes and/or the corrupt replicas are fixed.. >>>> >>>> And finally, is there a better way to accomplish the original task of >>>> rebalancing disks on a datanode? >>>> >>>> Thanks! >>>> Rob >>>> __________________ >>>> Robert J Berger - CTO >>>> Runa Inc. >>>> +1 408-838-8896 >>>> http://blog.ibd.com >>>> >>>> >>>> >>> >>> __________________ >>> Robert J Berger - CTO >>> Runa Inc. >>> +1 408-838-8896 >>> http://blog.ibd.com >>> >>> >>> >>> >> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 > > __________________ > Robert J Berger - CTO > Runa Inc. > +1 408-838-8896 > http://blog.ibd.com > > > > -- Joseph Echeverria Cloudera, Inc. 443.305.9434