[
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609814#action_12609814
]
Jothi Padmanabhan commented on HADOOP-3514:
-------------------------------------------
bq. You may also need to compute two checksums per map output segment: one for
compressed data and one for uncompressed data.
Having only one (or two, for compressed data cases) checksum for the whole
file, while reducing the checksum overhead (as measured with data), might not
be the most efficient way. In this approach, reducers would identify checksum
problems only after reading and processing all the map output data. If the data
corruption had happened at the beginning out the output, this would imply doing
reduce computations that would be discarded anyway. If we had the checksum
every chunk (512 bytes), we speed up identifying checksum problems early
(granularity of 512 bytes) and fail.
bq. You can use the uncompressed size as a kind of validation
While using the uncompressed size could certainly be used as an extra validity
check, I think it cannot be used in isolation (without the CRC checks). While
different values for the size of uncompressed data and the expected size imply
failure, identical values does not guarantee absence of errors. I do not know
if it is a fool proof method for detecting corruption.
> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
> Key: HADOOP-3514
> URL: https://issues.apache.org/jira/browse/HADOOP-3514
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.18.0
> Reporter: Devaraj Das
> Assignee: Jothi Padmanabhan
> Fix For: 0.19.0
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc
> into the iFile rather than having a separate file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.