[jira] [Updated] (TEZ-2950) Poor performance of UnorderedPartitionedKVWriter

Kuhu Shukla (JIRA) Tue, 17 May 2016 09:25:27 -0700

     [ 
https://issues.apache.org/jira/browse/TEZ-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kuhu Shukla updated TEZ-2950:
-----------------------------
    Attachment: TEZ-2950.001_prelim.patch

Attaching a preliminary patch to get some comments on the approach mentioned 
earlier.

* I tested this out with a pig script that uses this writer and the drawback of 
the new approach is that if native lzo is used for intermediate outputs, the 
non-heap usage for 999 open codecs goes up, causing container limits to exceed 
while the original implementation is not prone to this problem (for the same 
memory.mb setting). 
* Using the X_1_11 approach for better memory footprint in LZO exposed a bug in 
the gpl-compression module wherein the translation for the strategy is buggy. 
* I tried tweaking the buffersize of LZO to 32kB (which is 64kB by default in 
Hadoop config) and that seems to mitigate the problem for this use case. 
Turning the codecs off altogether causes the size of file.out to bloat up to 3x.
  
We need a better design alternative and would appreciate some ideas from the 
community to reduce the memory footprint of opening {{numpartitions}} files at 
once and have no dependency on the type of compression technique.

> Poor performance of UnorderedPartitionedKVWriter
> ------------------------------------------------
>
>                 Key: TEZ-2950
>                 URL: https://issues.apache.org/jira/browse/TEZ-2950
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Kuhu Shukla
>         Attachments: TEZ-2950.001_prelim.patch
>
>
> Came across a job which was taking a long time in 
> UnorderedPartitionedKVWriter.mergeAll. It was decompressing and reading data 
> from spill files (8500 spills) and then writing the final compressed merge 
> file. Why do we need spill files for UnorderedPartitionedKVWriter? Why not 
> just buffer and keep directly writing to the final file which will save a lot 
> of time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-2950) Poor performance of UnorderedPartitionedKVWriter

Reply via email to