[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857190#action_12857190 ]
sam rash commented on HDFS-1057: -------------------------------- update: problem refinement/issues: First, we can't recompute all partial chunk CRCs without a regression: blocks on disk that are not being written to may have errors that we would miss. This means we have to be more careful and effectively only recompute a checksum that we can guarantee is mismatched or very very likely to be wrong. After trying out various ways of detecting if blocks changed and recomputing the CRC for partial chunks, I generalized the problem to this: there is data length which is the size of the data on disk, and metadata length which is the amount of data for which we have meta data (poor name--need a better one). The problem with concurrent reads/write is that these get out of sync. Problems that could occur or have actually occurred: 1. CRC error1: BlockSender is created when data length is > metadata length. Produces a crc error in that the crc in the last chunk is for less data 2. CRC error2: when reading from disk, more metadata is available than blockLength (data size is in fact greater than when BlockSender was created) 3. EOF error: similar to 1, but in theory EOF could be be encountered when reading checksumIn (haven't seen yet) Solution: 1. we need to guarantee when we create a BlockSender (or anything that will direct reading of data), blockLength <= metadata length <= data length (blockLength being what we consider available to read) 2. we detect when the block changes after we get the blockLength (create BlockSender) in order to know if we should recompute partial chunk CRCs In this way, if we guarantee #1, and implement #2, we can know when the CRC will be invalid in recompute them w/o any regression For #1, we already have ongoing creates concept which keeps some in-memory information about blocks being written to (new or append). We add a 'visible length' attribute which we also expose in FSDatasetInterface.getVisibleLength(Block b). This is an in memory length of the block that we update only after flushing both data and metadata to disk. For blocks not being appended to, this function delegates to the original FSDatasetInterface.getLength(Block b) which is unchanged. In this way, if BlockSender uses this for the blockLength, #1 above holds. For #2, we 'memoize' some info about the block when we create it. In particular, we get the actual length of the block on disk and store it. On each packet send, we compare the expected length of our inputstream to what is actually there. If there is more data, the block has changed and the CRC data can't be trusted for the last partial chunk. Test cases : inspired by Todd's (could not apply patch to 0.20), I create two threads and have one write very small data sizes and call sync. Another thread opens the file, reads to the end, remembers that position, and starts off there again and resumes. This reliably produces the errors within a few seconds and runs 10-15s when there is no error (tunable). In order to detect data corruption (which I saw in some particular bugs), I made the data written such that you can tell byte X is X % 127. This helps catch issues that CRC recomputation might hide. > Concurrent readers hit ChecksumExceptions if following a writer to very end > of file > ----------------------------------------------------------------------------------- > > Key: HDFS-1057 > URL: https://issues.apache.org/jira/browse/HDFS-1057 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node > Affects Versions: 0.21.0, 0.22.0 > Reporter: Todd Lipcon > Priority: Critical > > In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before > calling flush(). Therefore, if there is a concurrent reader, it's possible to > race here - the reader will see the new length while those bytes are still in > the buffers of BlockReceiver. Thus the client will potentially see checksum > errors or EOFs. Additionally, the last checksum chunk of the file is made > accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira