It should be safe to run fsck -move. Worst case, corrupt files end up in /lost+found. The job files are probably related to the under replicated blocks. The default replication factor for job files is 10 and I noticed you have 9 datanodes.
The under replication would probably also prevented the node from decommissioning. If you run fsck -move I'd be interested to know if that fixes the corrupt replicas. -Joey On Jun 10, 2011, at 21:09, Robert J Berger <rber...@runa.com> wrote: > I can't really do a full restart unless its the only option. > > I did find some old temporary mapred job files that were considered > under-replicated, so I deleted them and the system that was taking forever to > decommision finished decomissioning (not sure if there was really a causal > connection) > > But the corrupted replicas were still there. > > Would there be any negative consequences of running the fsck -move just to > try it? > > On Jun 10, 2011, at 3:33 PM, Joey Echeverria wrote: > >> Good question. I didn't pick up on the fact that fsck disagrees with >> dfsadmin. Have you tried a full restart? Maybe somebody's information >> is out of date? >> >> -Joey >> >> On Fri, Jun 10, 2011 at 6:22 PM, Robert J Berger <rber...@runa.com> wrote: >>> I think the files may have been corrupted when I had initially shut down >>> the node that was still in decommisiioning mode >>> >>> Unfortunately I hadn't done the dfsadmin -report any time soon before I had >>> the incident so I can't be sure that they haven't been there for a while. I >>> always assumed that the fsck command would tell me if there were issues. >>> >>> So will running hadoop fsck -move just move the corrupted replicas and >>> leave the good ones? Will this work even though fsck does not report any >>> corruption? >>> >>> On Jun 9, 2011, at 3:20 PM, Joey Echeverria wrote: >>> >>>> hadoop fsck -move will move the corrupt files to /lost+found, which >>>> will "fix" the report. >>>> >>>> Do you know what created the corrupt files? >>>> >>>> -Joey >>>> >>>> On Thu, Jun 9, 2011 at 3:04 PM, Robert J Berger <rber...@runa.com> wrote: >>>>> I'm still having this problem and am kind of paralyzed until I figure out >>>>> how to eliminate these Blocks with corrupt replicas. >>>>> >>>>> Here is the output of dfsadmin -report and fsck: >>>>> >>>>> dfsadmin -report >>>>> Configured Capacity: 13723995700736 (12.48 TB) >>>>> Present Capacity: 13731775356416 (12.49 TB) >>>>> DFS Remaining: 4079794918277 (3.71 TB) >>>>> DFS Used: 9651980438139 (8.78 TB) >>>>> DFS Used%: 70.29% >>>>> Under replicated blocks: 18 >>>>> Blocks with corrupt replicas: 34 >>>>> Missing blocks: 0 >>>>> >>>>> ------------------------------------------------- >>>>> Datanodes available: 9 (9 total, 0 dead) >>>>> (Not showing the nodes other than the one with Decommission in progress) >>>>> ... >>>>> Name: 10.195.10.175:50010 >>>>> Decommission Status : Decommission in progress >>>>> Configured Capacity: 1731946381312 (1.58 TB) >>>>> DFS Used: 1083853885440 (1009.42 GB) >>>>> Non DFS Used: 0 (0 KB) >>>>> DFS Remaining: 651169222656(606.45 GB) >>>>> DFS Used%: 62.58% >>>>> DFS Remaining%: 37.6% >>>>> Last contact: Wed Jun 08 18:56:54 UTC 2011 >>>>> ... >>>>> >>>>> And the good bits from fsck: >>>>> >>>>> Status: HEALTHY >>>>> Total size: 2832555958232 B (Total open files size: 134217728 B) >>>>> Total dirs: 72151 >>>>> Total files: 65449 (Files currently being written: 9) >>>>> Total blocks (validated): 95076 (avg. block size 29792544 B) (Total >>>>> open file blocks (not validated): 10) >>>>> Minimally replicated blocks: 95076 (100.0 %) >>>>> Over-replicated blocks: 35667 (37.5142 %) >>>>> Under-replicated blocks: 18 (0.018932223 %) >>>>> Mis-replicated blocks: 0 (0.0 %) >>>>> Default replication factor: 3 >>>>> Average block replication: 3.376278 >>>>> Corrupt blocks: 0 >>>>> Missing replicas: 18 (0.0056074243 %) >>>>> Number of data-nodes: 9 >>>>> Number of racks: 1 >>>>> >>>>> >>>>> The filesystem under path '/' is HEALTHY >>>>> >>>>> >>>>> >>>>> On Jun 8, 2011, at 10:38 AM, Robert J Berger wrote: >>>>> >>>>>> Synopsis: >>>>>> * After shutting down a datanode in a cluster, fsck declares CORRUPT >>>>>> with missing blocks, >>>>>> * I restore/restart the datanode and fsck soon declares things healthy >>>>>> * But dfsadmin -report says a small number of blocks have corrupt >>>>>> replicas and an even smaller number of under replicated blocks >>>>>> * After a couple of days that number corrupt replicas and under >>>>>> replicated blocks stays the same >>>>>> >>>>>> Full Story: >>>>>> My Goal is to rebalance blocks across 3 drives each within 2 datanodes >>>>>> in a 9 datanode (Replication=3) cluster running hadoop 0.20.1 >>>>>> (EBS Volumes were added to the datanodes over time so one disk had 95% >>>>>> usage and the others had significantly less) >>>>>> >>>>>> The plan was to decommission the nodes and then wipe the disks and then >>>>>> add them back in to the cluster. >>>>>> >>>>>> Before I started I ran fsck and all was healthy. (Unfortunately I did >>>>>> not really look at the dfsadmin -report at that time, so I can't be sure >>>>>> if there were no blocks with corrupt replicas at this point) >>>>>> >>>>>> I put two nodes into the Decommission process and after waiting about 36 >>>>>> hours it hadn't finished decommissioning ether. So I decided to throw >>>>>> caution to the wind and shut down one of them. (and had taken the node I >>>>>> was shutting down out of the dfs.exclude.file file, also removed the >>>>>> 2nd node from the dfs.exclude.file , dfsadmin -refreshNodes but kept the >>>>>> 2nd node live) >>>>>> >>>>>> After shutting down one node, running fsck showed about 400 blocks as >>>>>> missing. >>>>>> >>>>>> So I brought back up the shutdown node (it took a while as I had to >>>>>> restore it from EBS snapshot) and fsck quickly went back to healthy but >>>>>> with a significant amount of Over replicated blocks >>>>>> >>>>>> I put that node back into the decommissioning state (put just that one >>>>>> node back in the dfs.exclude.file and ran dfsadmin -refreshNodes. >>>>>> >>>>>> After another day or so, its still in the decommissioning mode. Fsck >>>>>> says the cluster is healthy but still 37% over-replicated blocks. >>>>>> >>>>>> But the thing that concerns me is that dfsadmin -report says: >>>>>> >>>>>> Under replicated blocks: 18 >>>>>> Blocks with corrupt replicas: 34 >>>>>> >>>>>> So really two questions: >>>>>> >>>>>> * Is there a way to force these corrupt replicas and under replicated >>>>>> blocks to get fixed? >>>>>> * Is there a way to speed up the decommissioning process (without >>>>>> restarting the cluster) >>>>>> >>>>>> I presume that its not safe for me to take down this node until the >>>>>> decommissioning completes and/or the corrupt replicas are fixed.. >>>>>> >>>>>> And finally, is there a better way to accomplish the original task of >>>>>> rebalancing disks on a datanode? >>>>>> >>>>>> Thanks! >>>>>> Rob >>>>>> __________________ >>>>>> Robert J Berger - CTO >>>>>> Runa Inc. >>>>>> +1 408-838-8896 >>>>>> http://blog.ibd.com >>>>>> >>>>>> >>>>>> >>>>> >>>>> __________________ >>>>> Robert J Berger - CTO >>>>> Runa Inc. >>>>> +1 408-838-8896 >>>>> http://blog.ibd.com >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Joseph Echeverria >>>> Cloudera, Inc. >>>> 443.305.9434 >>> >>> __________________ >>> Robert J Berger - CTO >>> Runa Inc. >>> +1 408-838-8896 >>> http://blog.ibd.com >>> >>> >>> >>> >> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 > > __________________ > Robert J Berger - CTO > Runa Inc. > +1 408-838-8896 > http://blog.ibd.com > > >