[
https://issues.apache.org/jira/browse/TEZ-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283355#comment-14283355
]
Siddharth Seth commented on TEZ-1937:
-------------------------------------
The counter should consider compression - since it's measuring bytes read from
disk. It'll be better to update it in the IFile.appendIFile method so that
whenever this is changed to fix compression, it'll be an obvious fix.
{code}
+ } else {
+ LOG.warn("Could not obtain decompressor from CodecPool");
+ in = checksumIn;
+ }
{code}
Should be an exception.
{code}
+ prevKey = null;
+ previous.reset();
{code}
Why is this required ?
Doesn't each IFile stream (per partition in each spill file) also have a
checksum associated with it. I believe using partLength will not copy the
checksum - but is a new checksum being computed for the entire partition stream
in the writer ?
Any corner cases where the same record exists across two files - with RLE break
in any way. I don't think it should.
> Reduce cost of merging ifiles in UnorderedPartitionedWriter
> -----------------------------------------------------------
>
> Key: TEZ-1937
> URL: https://issues.apache.org/jira/browse/TEZ-1937
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1937.1.patch, TEZ-1937.2.patch, TEZ-1937.WIP.patch
>
>
> Currently we iterate through all spilled files for merging. This incurs
> additional deserialization cost.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)