[ 
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623931#action_12623931
 ] 

Devaraj Das commented on HADOOP-3514:
-------------------------------------

Arun, storing the checksum in the index file would work for the map side of 
things though I should say that I don't see clearly the benefits of that 
approach over what is done in the patch. Reading the checksum via http header 
or as the last 4 bytes of the data that we are transferring seems to have the 
same amount of complexity, no? Since the intended use case for the 
checksuminput/output streams is ifile, I think it is ok to assume that the 
corresponding server/client (MapOutputServlet and shuffler framework code) are 
well behaved and they are built to handle this kind of data streaming.

On the reduce side, we do write the map outputs to disk (either directly or as 
a result of in-mem merge). Now when we read this data back later, we need to 
validate the checksum. Storing the checksum in the metadata file won't work 
here since we don't have a metadata file on the reduce. One could argue that we 
could build this metadata in memory (mapping from disk files to their 
checksums), and do validations by looking up this mapping. But I can't clearly 
see that being a less complex system.

> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
>                 Key: HADOOP-3514
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3514
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3514-v1.patch, hadoop-3514-v2.patch, 
> hadoop-3514-v3.patch, hadoop-3514-v4.patch, hadoop-3514-v5.patch, 
> hadoop-3514-v6.patch, hadoop-3514-v7.patch, hadoop-3514-v8.patch, 
> hadoop-3514.patch
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc 
> into the iFile rather than having a separate file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to