[jira] Commented: (HADOOP-3514) Reduce seeks during shuffle, by inline crcs

Jothi Padmanabhan (JIRA) Tue, 01 Jul 2008 04:44:08 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609533#action_12609533
 ]


Jothi Padmanabhan commented on HADOOP-3514:
-------------------------------------------

Here is one possible approach to solve this issue.

1. Create all the IFiles on a RawLocalFileSystem instead of the LocalFileSystem 
(LocalFileSystems extend ChecksumFileSystem which we do not want here)
2. Modify all the writes to these files to go through an intermediate layer 
that calculates and writes checksum for every 512 bytes of data.  On close of 
file, create and add a checksum for the date from the previous checksum till 
end of the file.
3. Modify all the IFile reads to go through the intermediate layer as well 
which will do the checksum verification transparent to the calling methods. 

Modifications will be done only for the files that are written to the disk, the 
InMemory buffer reads/writes will not be affected.

This approach will have the same checksum overhead as the existing scheme, only 
that checksums are stored inline in the same file as data.

An alternative approach could possibly be to have record level checksums 
(checksum for every key/value pair). This approach could turn out to be 
costlier for smaller records where the checksum could possibly become a 
sizeable overhead (compared to the record length).

Comments?

> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
>                 Key: HADOOP-3514
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3514
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc 
> into the iFile rather than having a separate file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3514) Reduce seeks during shuffle, by inline crcs

Reply via email to