[
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623921#action_12623921
]
Arun C Murthy commented on HADOOP-3514:
---------------------------------------
Let me try to express the reasons for which I prefer to keep the checksum
separate:
1. Greatly simplifies ChecksumInputStream: you don't have to worry about how
much data you've read from the underlying stream at all... just keep a running
checksum - in fact you'd not require a IOUtils.fullyRead at all in
ChecksumInputStream.read. I think that's worth considering. The validation can
be done when the ChecksumInputStream.close is called. The 'known' checksum is
given to the ChecksumInputStream via it's constructor.
2. We already have metadata about the IFile stored separately: the compressed
and decompressed lengths - the checksum is a logical extension.
3. The fact that you have to 'explicitly' send the checksum from
MapOutputServlet, _after_ sending out the IFile to fake that checksum is part
of the data is indicative of it's brittle nature - it could lead to all kinds
of obscure, hard to fix, bugs.
4. W.r.t using it for the index: I'd rather see us go the direction where we
keep the index in-memory, that means we don't need checksums at all. So,
overall I still think this is the only use of this particular checksumming
input/output stream. If necessary we can revisit this later.
Thoughts?
> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
> Key: HADOOP-3514
> URL: https://issues.apache.org/jira/browse/HADOOP-3514
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.18.0
> Reporter: Devaraj Das
> Assignee: Jothi Padmanabhan
> Fix For: 0.19.0
>
> Attachments: hadoop-3514-v1.patch, hadoop-3514-v2.patch,
> hadoop-3514-v3.patch, hadoop-3514-v4.patch, hadoop-3514-v5.patch,
> hadoop-3514-v6.patch, hadoop-3514-v7.patch, hadoop-3514-v8.patch,
> hadoop-3514.patch
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc
> into the iFile rather than having a separate file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.