Re: To solve the checksum errors on the non-ecc mem machines.

Doug Cutting Tue, 14 Aug 2007 09:31:01 -0700

Daeseong Kim wrote:

To solve the checksum errors on the non-ecc memory machines, I
modified some codes in DFSClient.java and DataNode.java.


The idea is very simple.
The original CHUNK structure is
{chunk size}{chunk data}{chunk size}{chunk data}...

The modified CHUNK structure is
{chunk size}{chunk data}{chunk crc}{chunk size}{chunk data}{chunk crc}...


This is very similar to the approach taken in HADOOP-1134:

  https://issues.apache.org/jira/browse/HADOOP-1134

This will be included in the upcoming 0.14 release. HDFS checksums areno longer stored in parallel HDFS files, but directly by the filesystemwith each block. I do not know whether this will make Hadoop usable onnon-ecc hosts, but it might help.


It primarily improves things in the following ways:

1. Removing checksum files from HDFS frees a lot of memory in the namenode.

2. Data corruption can be detected before data is read. This meansthat, instead of a job failing because its input is corrupt, tasks fromthe prior job, which generated that now-corrupt data, can be failed andretried as corruptions are detected at write time. However, if a taskrepeatedly fails due to corruptions, jobs will still fail, so this maynot remedy things entirely for non-ecc hosts.


Doug

Re: To solve the checksum errors on the non-ecc mem machines.

Reply via email to