[ 
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12610038#action_12610038
 ] 

Devaraj Das commented on HADOOP-3514:
-------------------------------------

Ok, just to be sure everyone is on the same page. Here is what happens today:
1) The MapTask computes the checksum of the map outputs and writes it in a 
separate file - one chunk of checksum bytes per io.bytes.per.checksum
2) The TaskTracker validates the checksum of a map output when it serves it to 
a ReduceTask. If there is a validation failure, then the corresponding map task 
is declared as killed.
3) The ReduceTask gets the raw bytes and writes the bytes to either the 
inmemory buffer or to the disk. In the latter case (and only in the latter 
case), it is checksummed again. Later on, in the merge and/or reducer 
invocation, the checksum validation is done and task fails if the validation 
fails anytime.
4) In the above, the data corruption on the wire goes undetected.

We do see (2) happening. Occasionally, we do see (3). But we don't have any way 
to determine whether (4) happens. But if the sort validator is any indicator, 
it is really rare. 

But since we are changing the way checksums are handled for intermediate data, 
I think we should take care of (4) also in the framework. Thoughts?

I'd propose the following:

1) Checksum only that data that hits the disk on the map side. That means, if 
the output is compressed, then we checksum the compressed data and in case the 
data is not compressed (map output compression is off), we checksum the 
uncompressed data.
2) Upon a ReduceTask's request, the TaskTracker reads the compressed data from 
disk and validates the checksum as it does now. In case the checksum is per 
io.bytes.per.checksum, the validation happens inline and in the case we have 
one checksum for the entire map output, the TT validates the checksum (that it 
internally computes and updates) against what it sees at the end of the map 
output.
3) The TaskTracker kills the map if the validation fails. If the validation 
succeeds, the mapoutput alongwith the checksum is sent to the ReduceTask.
4) The ReduceTask then validates the checksum of the raw data (compressed or 
uncompressed) as it reads the data off the socket. For the case where the data 
is destined for the inmemory buffer, the data is decompressed and put in ram 
(without the checksum). In the case where the data is destined for the disk, 
the raw data (alongwith checksum) is sent to disk as is.
5) When the final merge happens, and/or, the reducer invocation starts, the 
data is read from disk and that is again subject to  validation for checksum 
problems.

In (6) above, the point of fail-fast vs fail-early will come into picture. 
imagine the case where we have large intermediate files and the reducer is slow 
in processing the records and the data is corrupt in the beginning of the file 
itself. But i too think that we could take this cost since the number of such 
failures will typically be very small. So, if things are kept simple by having 
one checksum for the entire map output everywhere, then we should go for that. 

What do others think?

> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
>                 Key: HADOOP-3514
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3514
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc 
> into the iFile rather than having a separate file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to