Re: Persistent small number of Blocks with corrupt replicas / Under replicated blocks

Joey Echeverria Thu, 09 Jun 2011 15:21:07 -0700

hadoop fsck -move will move the corrupt files to /lost+found, which
will "fix" the report.


Do you know what created the corrupt files?

-Joey

On Thu, Jun 9, 2011 at 3:04 PM, Robert J Berger <rber...@runa.com> wrote:
> I'm still having this problem and am kind of paralyzed until I figure out how 
> to eliminate these Blocks with corrupt replicas.
>
> Here is the  output of dfsadmin -report and fsck:
>
> dfsadmin -report
> Configured Capacity: 13723995700736 (12.48 TB)
> Present Capacity: 13731775356416 (12.49 TB)
> DFS Remaining: 4079794918277 (3.71 TB)
> DFS Used: 9651980438139 (8.78 TB)
> DFS Used%: 70.29%
> Under replicated blocks: 18
> Blocks with corrupt replicas: 34
> Missing blocks: 0
>
> -------------------------------------------------
> Datanodes available: 9 (9 total, 0 dead)
> (Not showing the nodes other than the one with Decommission in progress)
> ...
> Name: 10.195.10.175:50010
> Decommission Status : Decommission in progress
> Configured Capacity: 1731946381312 (1.58 TB)
> DFS Used: 1083853885440 (1009.42 GB)
> Non DFS Used: 0 (0 KB)
> DFS Remaining: 651169222656(606.45 GB)
> DFS Used%: 62.58%
> DFS Remaining%: 37.6%
> Last contact: Wed Jun 08 18:56:54 UTC 2011
> ...
>
> And the good bits from fsck:
>
> Status: HEALTHY
> Total size:     2832555958232 B (Total open files size: 134217728 B)
> Total dirs:     72151
> Total files:    65449 (Files currently being written: 9)
> Total blocks (validated):       95076 (avg. block size 29792544 B) (Total 
> open file blocks (not validated): 10)
> Minimally replicated blocks:    95076 (100.0 %)
> Over-replicated blocks: 35667 (37.5142 %)
> Under-replicated blocks:        18 (0.018932223 %)
> Mis-replicated blocks:          0 (0.0 %)
> Default replication factor:     3
> Average block replication:      3.376278
> Corrupt blocks:         0
> Missing replicas:               18 (0.0056074243 %)
> Number of data-nodes:           9
> Number of racks:                1
>
>
> The filesystem under path '/' is HEALTHY
>
>
>
> On Jun 8, 2011, at 10:38 AM, Robert J Berger wrote:
>
>> Synopsis:
>> * After shutting down a datanode in  a cluster, fsck declares CORRUPT with 
>> missing blocks,
>> * I restore/restart the datanode and fsck soon declares things healthy
>> * But dfsadmin -report says a small number of blocks have corrupt replicas 
>> and an even smaller number of under replicated blocks
>> * After a couple of days that number corrupt replicas and under replicated 
>> blocks stays the same
>>
>> Full Story:
>> My Goal is to rebalance blocks across 3 drives each within 2 datanodes in a 
>> 9 datanode (Replication=3) cluster running hadoop 0.20.1
>> (EBS Volumes were added to the datanodes over time so one disk had 95% usage 
>> and the others had significantly less)
>>
>> The plan was to decommission the nodes and then wipe the disks and then add 
>> them back in to the cluster.
>>
>> Before I started I ran fsck and all was healthy. (Unfortunately I did not 
>> really look at the dfsadmin -report at that time, so I can't be sure if 
>> there were no blocks with corrupt replicas at this point)
>>
>> I put two nodes into the Decommission process and after waiting about 36 
>> hours it hadn't finished decommissioning ether. So I decided to throw 
>> caution to the wind and shut down one of them. (and had taken the node I was 
>> shutting down  out of the dfs.exclude.file file, also removed the 2nd node 
>> from the dfs.exclude.file , dfsadmin -refreshNodes but kept the 2nd node 
>> live)
>>
>> After shutting down one node, running fsck showed about 400 blocks as 
>> missing.
>>
>> So I brought back up the shutdown node (it took a while as I had to restore 
>> it from EBS snapshot) and fsck quickly went back to healthy but with a 
>> significant amount of Over replicated blocks
>>
>> I put that node back into the decommissioning state (put just that one node 
>> back in the dfs.exclude.file and ran dfsadmin -refreshNodes.
>>
>> After another day or so, its still in the decommissioning mode. Fsck says 
>> the cluster is healthy but still 37% over-replicated blocks.
>>
>> But the thing that concerns me is that  dfsadmin -report says:
>>
>> Under replicated blocks: 18
>> Blocks with corrupt replicas: 34
>>
>> So really two questions:
>>
>> * Is there a way to force these corrupt replicas and under replicated blocks 
>> to get fixed?
>> * Is there a way to speed up the decommissioning process (without restarting 
>> the cluster)
>>
>> I presume that its not safe for me to take down this node until the 
>> decommissioning completes and/or the corrupt replicas are fixed..
>>
>> And finally, is there a better way to accomplish the original task of 
>> rebalancing disks on a datanode?
>>
>> Thanks!
>> Rob
>> __________________
>> Robert J Berger - CTO
>> Runa Inc.
>> +1 408-838-8896
>> http://blog.ibd.com
>>
>>
>>
>
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
>
>
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Persistent small number of Blocks with corrupt replicas / Under replicated blocks

Reply via email to