[jira] Commented: (HADOOP-3514) Reduce seeks during shuffle, by inline crcs

Jothi Padmanabhan (JIRA) Tue, 01 Jul 2008 23:37:38 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609814#action_12609814
 ]


Jothi Padmanabhan commented on HADOOP-3514:
-------------------------------------------

bq. You may also need to compute two checksums per map output segment: one for 
compressed data and one for uncompressed data.

Having only one (or two, for compressed data cases) checksum for the whole 
file, while reducing the checksum overhead (as measured with data), might not 
be the most efficient way. In this approach, reducers would identify checksum 
problems only after reading and processing all the map output data. If the data 
corruption had happened at the beginning out the output, this would imply doing 
reduce computations that would be discarded anyway. If we had the checksum 
every chunk (512 bytes), we speed up identifying checksum problems early 
(granularity of 512 bytes) and fail.

bq. You can use the uncompressed size as a kind of validation

While using the uncompressed size could certainly be used as an extra validity 
check, I think it cannot be used in isolation (without the CRC checks).  While 
different values for the size of uncompressed data and the expected size imply 
failure, identical values does not guarantee absence of errors. I do not know 
if it is a fool proof method for detecting corruption.


  






> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
>                 Key: HADOOP-3514
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3514
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc 
> into the iFile rather than having a separate file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3514) Reduce seeks during shuffle, by inline crcs

Reply via email to