Corrupted blocks leading to job failures
----------------------------------------

                 Key: HADOOP-3392
                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
             Project: Hadoop Core
          Issue Type: Improvement
    Affects Versions: 0.16.0
            Reporter: Christian Kunz


On one of our clusters we ended up with 11 singly-replicated corrupted blocks 
(checksum errors) such that jobs were failing because of no live blocks 
available.

fsck reports the system as healthy, although it is not.

I argue that fsck should have an option to check whether under-replicated 
blocks are okay.

Even better, the namenode should automatically check under-replicated blocks 
with repeated replication failures for corruption and list them somewhere on 
the GUI. And there should be an option to undo the corruption and recompute the 
checksums.

Question: Is it at all probable that two or more replications of a block have 
checksum errors? If not, then we could reduce the checking to singly-replicated 
blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to