[ 
https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Kunz updated HADOOP-3392:
-----------------------------------

    Description: 
On one of our clusters we ended up with 11 singly-replicated corrupted blocks 
(checksum errors) such that jobs were failing because of no live blocks 
available.

fsck reports the system as healthy, although it is not.

I argue that fsck should have an option to check whether under-replicated 
blocks are okay.

Even better, the namenode should automatically check under-replicated blocks 
with repeated replication failures for corruption and list them somewhere on 
the GUI. And for checksum errors, there should be an option to undo the 
corruption and recompute the checksums.

Question: Is it at all probable that two or more replications of a block have 
checksum errors? If not, then we could reduce the checking to singly-replicated 
blocks.

  was:
On one of our clusters we ended up with 11 singly-replicated corrupted blocks 
(checksum errors) such that jobs were failing because of no live blocks 
available.

fsck reports the system as healthy, although it is not.

I argue that fsck should have an option to check whether under-replicated 
blocks are okay.

Even better, the namenode should automatically check under-replicated blocks 
with repeated replication failures for corruption and list them somewhere on 
the GUI. And there should be an option to undo the corruption and recompute the 
checksums.

Question: Is it at all probable that two or more replications of a block have 
checksum errors? If not, then we could reduce the checking to singly-replicated 
blocks.


> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks 
> (checksum errors) such that jobs were failing because of no live blocks 
> available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated 
> blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks 
> with repeated replication failures for corruption and list them somewhere on 
> the GUI. And for checksum errors, there should be an option to undo the 
> corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have 
> checksum errors? If not, then we could reduce the checking to 
> singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to