[ 
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623918#action_12623918
 ] 

Jothi Padmanabhan commented on HADOOP-3514:
-------------------------------------------

bq. There is a class DataChecksum in org.apache.hadoop.util. We probably should 
use it here.

DataChecksum is intended for the typical use case of having a checkum for a 
chunk (bytesPerSum). In IFile, the intent is to have one Checksum per file.  
The variable 'bytesPerSum' in DataChecksum, as the name indicates, is the bytes 
for which a checksum is calculated. However, it is primarily up to the user of 
the DataChecksum Class to use this appropriately, inside DataCheckusm.java, 
bytesPerSum is used only during the constructor and while generating the 
header. Since IFile does not worry about DataChecksum.header, we could still 
use DataChecksum from inside IFIle by passing any arbitrary value for 
bytesPerSum in the constructor. Note that we do not know the length of the file 
a priori, so we are constrained to pass a dummy value. There is one 
modification needed in the DataChecskum.java though -- we need to remove the 
following assert in the update function . There is already a comment that it 
can be removed. Is it OK to remove this assert?

{code}
    // Can be removed.
    assert inSum <= bytesPerChecksum : "DataChecksum.update() : inSum " + 
                inSum + " > " + " bytesPerChecksum " + bytesPerChecksum ; 
{code}

bq. It may be better to have ChecksumOutputStream extending FilterOutputStream, 
instead of OutputStream

Yes, I will make this change.

> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
>                 Key: HADOOP-3514
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3514
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3514-v1.patch, hadoop-3514-v2.patch, 
> hadoop-3514-v3.patch, hadoop-3514-v4.patch, hadoop-3514-v5.patch, 
> hadoop-3514-v6.patch, hadoop-3514-v7.patch, hadoop-3514-v8.patch, 
> hadoop-3514.patch
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc 
> into the iFile rather than having a separate file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to