Corrupted blocks leading to job failures
----------------------------------------
Key: HADOOP-3392
URL: https://issues.apache.org/jira/browse/HADOOP-3392
Project: Hadoop Core
Issue Type: Improvement
Affects Versions: 0.16.0
Reporter: Christian Kunz
On one of our clusters we ended up with 11 singly-replicated corrupted blocks
(checksum errors) such that jobs were failing because of no live blocks
available.
fsck reports the system as healthy, although it is not.
I argue that fsck should have an option to check whether under-replicated
blocks are okay.
Even better, the namenode should automatically check under-replicated blocks
with repeated replication failures for corruption and list them somewhere on
the GUI. And there should be an option to undo the corruption and recompute the
checksums.
Question: Is it at all probable that two or more replications of a block have
checksum errors? If not, then we could reduce the checking to singly-replicated
blocks.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.