[
https://issues.apache.org/jira/browse/TEZ-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290330#comment-16290330
]
Rohini Palaniswamy edited comment on TEZ-2950 at 12/14/17 4:59 AM:
-------------------------------------------------------------------
Here is a simpler suggestion to try speed it up a bit. Can probably be
addressed in a separate jira as a short term solution and leave this one for
long term solution.
https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/writers/UnorderedPartitionedKVWriter.java#L1010-L1022
- For each partition, each spill file is opened once. For parallelism of 1000
and 8500 spills, it will be making 8500000 file open calls. We can try keeping
the first N file handles open always (will need a new IFile.Reader method that
does not close the underlying input stream but does rest of close() like
freeing up decompressor and buffers). Let us say we keep first 100 spill files
always open, it will cut down number of file open calls to 8400100. For
parallelism of 1000 and 100 spills, it will cut down file open calls from
100000 to 100.
was (Author: rohini):
Here is a simpler suggestion to try speed it up a bit. Can probably be
addressed in a separate jira as a short term solution and leave this one for
long term solution.
https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/writers/UnorderedPartitionedKVWriter.java#L1010-L1022
- For each partition, each spill file is opened once. For parallelism of 1000
and 8500 spills, it will be making 8500000 file open calls. This can be cut
down by batching of spill file reads and partition writes. Let's say for a
batch size of 10, 10 writers (partitions) and 10 spill file readers are kept
open in parallel and merging is done. It will cut down file open by 10x to
850000.
> Poor performance of UnorderedPartitionedKVWriter
> ------------------------------------------------
>
> Key: TEZ-2950
> URL: https://issues.apache.org/jira/browse/TEZ-2950
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Kuhu Shukla
> Attachments: TEZ-2950.001_prelim.patch
>
>
> Came across a job which was taking a long time in
> UnorderedPartitionedKVWriter.mergeAll. It was decompressing and reading data
> from spill files (8500 spills) and then writing the final compressed merge
> file. Why do we need spill files for UnorderedPartitionedKVWriter? Why not
> just buffer and keep directly writing to the final file which will save a lot
> of time.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)