[
https://issues.apache.org/jira/browse/TEZ-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kuhu Shukla updated TEZ-2950:
-----------------------------
Attachment: TEZ-2950.001_prelim.patch
Attaching a preliminary patch to get some comments on the approach mentioned
earlier.
* I tested this out with a pig script that uses this writer and the drawback of
the new approach is that if native lzo is used for intermediate outputs, the
non-heap usage for 999 open codecs goes up, causing container limits to exceed
while the original implementation is not prone to this problem (for the same
memory.mb setting).
* Using the X_1_11 approach for better memory footprint in LZO exposed a bug in
the gpl-compression module wherein the translation for the strategy is buggy.
* I tried tweaking the buffersize of LZO to 32kB (which is 64kB by default in
Hadoop config) and that seems to mitigate the problem for this use case.
Turning the codecs off altogether causes the size of file.out to bloat up to 3x.
We need a better design alternative and would appreciate some ideas from the
community to reduce the memory footprint of opening {{numpartitions}} files at
once and have no dependency on the type of compression technique.
> Poor performance of UnorderedPartitionedKVWriter
> ------------------------------------------------
>
> Key: TEZ-2950
> URL: https://issues.apache.org/jira/browse/TEZ-2950
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Kuhu Shukla
> Attachments: TEZ-2950.001_prelim.patch
>
>
> Came across a job which was taking a long time in
> UnorderedPartitionedKVWriter.mergeAll. It was decompressing and reading data
> from spill files (8500 spills) and then writing the final compressed merge
> file. Why do we need spill files for UnorderedPartitionedKVWriter? Why not
> just buffer and keep directly writing to the final file which will save a lot
> of time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)