Daeseong Kim wrote:
To solve the checksum errors on the non-ecc memory machines, I
modified some codes in DFSClient.java and DataNode.java.

The idea is very simple.
The original CHUNK structure is
{chunk size}{chunk data}{chunk size}{chunk data}...

The modified CHUNK structure is
{chunk size}{chunk data}{chunk crc}{chunk size}{chunk data}{chunk crc}...

This is very similar to the approach taken in HADOOP-1134:

  https://issues.apache.org/jira/browse/HADOOP-1134

This will be included in the upcoming 0.14 release. HDFS checksums are no longer stored in parallel HDFS files, but directly by the filesystem with each block. I do not know whether this will make Hadoop usable on non-ecc hosts, but it might help.

It primarily improves things in the following ways:

1. Removing checksum files from HDFS frees a lot of memory in the namenode.

2. Data corruption can be detected before data is read. This means that, instead of a job failing because its input is corrupt, tasks from the prior job, which generated that now-corrupt data, can be failed and retried as corruptions are detected at write time. However, if a task repeatedly fails due to corruptions, jobs will still fail, so this may not remedy things entirely for non-ecc hosts.

Doug

Reply via email to