Re: Many Checksum Errors

Raghu Angadi Wed, 16 May 2007 11:39:57 -0700

Doug Cutting wrote:

[ Moving discussion to hadoop-dev.  -drc ]
Raghu Angadi wrote:
This is good validation how important ECC memory is. Currently HDFSclient deletes a block when it notices a checksum error. After movingto Block level CRCs soon, we should make Datanode re-validate theblock before deciding to delete it.
It also emphasizes how important end-to-end checksums are. Data shouldalso be checksummed as soon as possible after it is generated, before ithas a chance to be corrupted.
Ideally, the initial buffer that stores the data should be small, anddata should be checksummed as this initial buffer is flushed.

In my implementation of block-level CRCs (does not affectChecksumFileSystem in HADOOP-928), we don't buffer checksum data at all.As soon as io.bytes.per.checksum are written, checksum is writtendirectly to the backupstream. I have removed stream buffering inmultiple places in DFSClient. But it this is still affected by thebuffering issue you mentioned below.

In thecurrent implementation, the small checksum buffer is the second buffer,the initial buffer is the larger, io.buffer.size buffer. To providemaximum protection against memory errors, this situation should bereversed.
This is discussed in https://issues.apache.org/jira/browse/HADOOP-928.Perhaps a new issue should be filed to reverse the order of thesebuffers, so that data is checksummed before entering the larger,longer-lived buffer?

This reversal still does not help Block-level CRCs. We could removebuffering all together in FileSystem level and let the FSimplementations to decide how to buffer.


Raghu.

Doug

Re: Many Checksum Errors

Reply via email to