Daeseong Kim wrote:
To solve the checksum errors on the non-ecc memory machines, I
modified some codes in DFSClient.java and DataNode.java.
The idea is very simple.
The original CHUNK structure is
{chunk size}{chunk data}{chunk size}{chunk data}...
The modified CHUNK structure is
{chunk size}{chunk data}{chunk crc}{chunk size}{chunk data}{chunk crc}...
This is very similar to the approach taken in HADOOP-1134:
https://issues.apache.org/jira/browse/HADOOP-1134
This will be included in the upcoming 0.14 release. HDFS checksums are
no longer stored in parallel HDFS files, but directly by the filesystem
with each block. I do not know whether this will make Hadoop usable on
non-ecc hosts, but it might help.
It primarily improves things in the following ways:
1. Removing checksum files from HDFS frees a lot of memory in the namenode.
2. Data corruption can be detected before data is read. This means
that, instead of a job failing because its input is corrupt, tasks from
the prior job, which generated that now-corrupt data, can be failed and
retried as corruptions are detected at write time. However, if a task
repeatedly fails due to corruptions, jobs will still fail, so this may
not remedy things entirely for non-ecc hosts.
Doug