[
https://issues.apache.org/jira/browse/TEZ-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290330#comment-16290330
]
Rohini Palaniswamy commented on TEZ-2950:
-----------------------------------------
Here is a simpler suggestion to try speed it up a bit. Can probably be
addressed in a separate jira as a short term solution and leave this one for
long term solution.
https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/writers/UnorderedPartitionedKVWriter.java#L1010-L1022
- For each partition, each spill file is opened once. For parallelism of 1000
and 8500 spills, it will be making 8500000 file open calls. This can be cut
down by batching of spill file reads and partition writes. Let's say for a
batch size of 10, 10 writers (partitions) and 10 spill file readers are kept
open in parallel and merging is done. It will cut down file open by 10x to
850000.
> Poor performance of UnorderedPartitionedKVWriter
> ------------------------------------------------
>
> Key: TEZ-2950
> URL: https://issues.apache.org/jira/browse/TEZ-2950
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Kuhu Shukla
> Attachments: TEZ-2950.001_prelim.patch
>
>
> Came across a job which was taking a long time in
> UnorderedPartitionedKVWriter.mergeAll. It was decompressing and reading data
> from spill files (8500 spills) and then writing the final compressed merge
> file. Why do we need spill files for UnorderedPartitionedKVWriter? Why not
> just buffer and keep directly writing to the final file which will save a lot
> of time.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)