[
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609533#action_12609533
]
Jothi Padmanabhan commented on HADOOP-3514:
-------------------------------------------
Here is one possible approach to solve this issue.
1. Create all the IFiles on a RawLocalFileSystem instead of the LocalFileSystem
(LocalFileSystems extend ChecksumFileSystem which we do not want here)
2. Modify all the writes to these files to go through an intermediate layer
that calculates and writes checksum for every 512 bytes of data. On close of
file, create and add a checksum for the date from the previous checksum till
end of the file.
3. Modify all the IFile reads to go through the intermediate layer as well
which will do the checksum verification transparent to the calling methods.
Modifications will be done only for the files that are written to the disk, the
InMemory buffer reads/writes will not be affected.
This approach will have the same checksum overhead as the existing scheme, only
that checksums are stored inline in the same file as data.
An alternative approach could possibly be to have record level checksums
(checksum for every key/value pair). This approach could turn out to be
costlier for smaller records where the checksum could possibly become a
sizeable overhead (compared to the record length).
Comments?
> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
> Key: HADOOP-3514
> URL: https://issues.apache.org/jira/browse/HADOOP-3514
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.18.0
> Reporter: Devaraj Das
> Assignee: Jothi Padmanabhan
> Fix For: 0.19.0
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc
> into the iFile rather than having a separate file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.