[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857190#action_12857190
 ] 

sam rash commented on HDFS-1057:
--------------------------------

update:

problem refinement/issues:

First, we can't recompute all partial chunk CRCs without a regression: blocks 
on disk that are not being written to may have errors that we would miss.  This 
means we have to be more careful and effectively only recompute a checksum that 
we can guarantee is mismatched or very very likely to be wrong.

After trying out various ways of detecting if blocks changed and recomputing 
the CRC for partial chunks, I generalized the problem to this: there is data 
length which is the size of the data on disk, and metadata length which is the 
amount of data for which we have meta data (poor name--need a better one).  The 
problem with concurrent reads/write is that these get out of sync.

Problems that could occur or have actually occurred:

1. CRC error1: BlockSender is created when data length is > metadata length.  
Produces a crc error in that the crc in the last chunk is for less data
2. CRC error2: when reading from disk, more metadata is available than 
blockLength (data size is in fact greater than when BlockSender was created)
3. EOF error: similar to 1, but in theory EOF could be be encountered when 
reading checksumIn (haven't seen yet)

Solution:
1. we need to guarantee when we create a BlockSender (or anything that will 
direct reading of data), blockLength <= metadata length <= data length
(blockLength being what we consider available to read)
2. we detect when the block changes after we get the blockLength (create 
BlockSender) in order to know if we should recompute partial chunk CRCs

In this way, if we guarantee #1, and implement #2, we can know when the CRC 
will be invalid in recompute them w/o any regression 

For #1, we already have ongoing creates concept which keeps some in-memory 
information about blocks being written to (new or append).  We add a 'visible 
length' attribute which we also expose in 
FSDatasetInterface.getVisibleLength(Block b).  This is an in memory length of 
the block that we update only after flushing both data and metadata to disk. 
For blocks not being appended to, this function delegates to the original 
FSDatasetInterface.getLength(Block b) which is unchanged. In this way, if 
BlockSender uses this for the blockLength, #1 above holds.

For #2, we 'memoize' some info about the block when we create it.  In 
particular, we get the actual length of the block on disk and store it.  On 
each packet send, we compare the expected length of our inputstream to what is 
actually there.  If there is more data, the block has changed and the CRC data 
can't be trusted for the last partial chunk.

Test cases : inspired by Todd's (could not apply patch to 0.20), I create two 
threads and have one write very small data sizes and call sync.  Another thread 
opens the file, reads to the end, remembers that position, and starts off there 
again and resumes.  This reliably produces the errors within a few seconds and 
runs 10-15s when there is no error (tunable).  In order to detect data 
corruption (which I saw in some particular bugs), I made the data written such 
that you can tell byte X is X % 127.  This helps catch issues that CRC 
recomputation might hide.

> Concurrent readers hit ChecksumExceptions if following a writer to very end 
> of file
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-1057
>                 URL: https://issues.apache.org/jira/browse/HDFS-1057
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: data-node
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
> calling flush(). Therefore, if there is a concurrent reader, it's possible to 
> race here - the reader will see the new length while those bytes are still in 
> the buffers of BlockReceiver. Thus the client will potentially see checksum 
> errors or EOFs. Additionally, the last checksum chunk of the file is made 
> accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to