[
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12610038#action_12610038
]
Devaraj Das commented on HADOOP-3514:
-------------------------------------
Ok, just to be sure everyone is on the same page. Here is what happens today:
1) The MapTask computes the checksum of the map outputs and writes it in a
separate file - one chunk of checksum bytes per io.bytes.per.checksum
2) The TaskTracker validates the checksum of a map output when it serves it to
a ReduceTask. If there is a validation failure, then the corresponding map task
is declared as killed.
3) The ReduceTask gets the raw bytes and writes the bytes to either the
inmemory buffer or to the disk. In the latter case (and only in the latter
case), it is checksummed again. Later on, in the merge and/or reducer
invocation, the checksum validation is done and task fails if the validation
fails anytime.
4) In the above, the data corruption on the wire goes undetected.
We do see (2) happening. Occasionally, we do see (3). But we don't have any way
to determine whether (4) happens. But if the sort validator is any indicator,
it is really rare.
But since we are changing the way checksums are handled for intermediate data,
I think we should take care of (4) also in the framework. Thoughts?
I'd propose the following:
1) Checksum only that data that hits the disk on the map side. That means, if
the output is compressed, then we checksum the compressed data and in case the
data is not compressed (map output compression is off), we checksum the
uncompressed data.
2) Upon a ReduceTask's request, the TaskTracker reads the compressed data from
disk and validates the checksum as it does now. In case the checksum is per
io.bytes.per.checksum, the validation happens inline and in the case we have
one checksum for the entire map output, the TT validates the checksum (that it
internally computes and updates) against what it sees at the end of the map
output.
3) The TaskTracker kills the map if the validation fails. If the validation
succeeds, the mapoutput alongwith the checksum is sent to the ReduceTask.
4) The ReduceTask then validates the checksum of the raw data (compressed or
uncompressed) as it reads the data off the socket. For the case where the data
is destined for the inmemory buffer, the data is decompressed and put in ram
(without the checksum). In the case where the data is destined for the disk,
the raw data (alongwith checksum) is sent to disk as is.
5) When the final merge happens, and/or, the reducer invocation starts, the
data is read from disk and that is again subject to validation for checksum
problems.
In (6) above, the point of fail-fast vs fail-early will come into picture.
imagine the case where we have large intermediate files and the reducer is slow
in processing the records and the data is corrupt in the beginning of the file
itself. But i too think that we could take this cost since the number of such
failures will typically be very small. So, if things are kept simple by having
one checksum for the entire map output everywhere, then we should go for that.
What do others think?
> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
> Key: HADOOP-3514
> URL: https://issues.apache.org/jira/browse/HADOOP-3514
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.18.0
> Reporter: Devaraj Das
> Assignee: Jothi Padmanabhan
> Fix For: 0.19.0
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc
> into the iFile rather than having a separate file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.