[ 
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623921#action_12623921
 ] 

Arun C Murthy commented on HADOOP-3514:
---------------------------------------

Let me try to express the reasons for which I prefer to keep the checksum 
separate:

1. Greatly simplifies ChecksumInputStream: you don't have to worry about how 
much data you've read from the underlying stream at all... just keep a running 
checksum - in fact you'd not require a IOUtils.fullyRead at all in 
ChecksumInputStream.read. I think that's worth considering. The validation can 
be done when the ChecksumInputStream.close is called. The 'known' checksum is 
given to the ChecksumInputStream via it's constructor.
2. We already have metadata about the IFile stored separately: the compressed 
and decompressed lengths - the checksum is a logical extension.
3. The fact that you have to 'explicitly' send the checksum from 
MapOutputServlet, _after_ sending out the IFile to fake that checksum is part 
of the data is indicative of it's brittle nature - it could lead to all kinds 
of obscure, hard to fix, bugs.
4. W.r.t using it for the index: I'd rather see us go the direction where we 
keep the index in-memory, that means we don't need checksums at all. So, 
overall I still think this is the only use of this particular checksumming 
input/output stream. If necessary we can revisit this later.

Thoughts?

> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
>                 Key: HADOOP-3514
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3514
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3514-v1.patch, hadoop-3514-v2.patch, 
> hadoop-3514-v3.patch, hadoop-3514-v4.patch, hadoop-3514-v5.patch, 
> hadoop-3514-v6.patch, hadoop-3514-v7.patch, hadoop-3514-v8.patch, 
> hadoop-3514.patch
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc 
> into the iFile rather than having a separate file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to