[jira] [Updated] (TEZ-3159) Reduce memory utilization while serializing keys and values

Rohini Palaniswamy (JIRA) Wed, 09 Mar 2016 12:12:55 -0800

     [ 
https://issues.apache.org/jira/browse/TEZ-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rohini Palaniswamy updated TEZ-3159:
------------------------------------
    Description: 
  Currently DataOutputBuffer is used for serializing. The underlying buffer 
keeps doubling in size when it reaches capacity. In some of the Pig scripts 
which serialize big bags, we end up with OOM in Tez as there is no space to 
double the array size. Mapreduce mode runs fine in those cases with 1G heap. 
The scenarios are

    - When combiner runs in reducer and some of the fields after combining are 
still big bags (For eg: distinct). Currently with mapreduce combiner does not 
run in reducer - MAPREDUCE-5221. Since input sort buffers hold good amount of 
memory at that time it can easily go OOM.
   -  While serializing output with bags when there are multiple inputs and 
outputs and the sort buffers for those take up space.

It is a pain especially after buffer size hits 128MB. Doubling at 128MB will 
require 128MB (existing array) +256MB (new array). Any doubling after that 
requires even more space. But most of the time the data is probably not going 
to fill up that 256MB leading to wastage.

 

  was:
  Currently DataOutputBuffer is used for serializing. The underlying buffer 
keeps doubling in size when it reaches capacity. In some of the Pig scripts 
which serialize big bags, we end up with OOM for some situations where 
mapreduce runs fine. 

    - When combiner runs in reducer and some of the fields after combining are 
still big bags (For eg: distinct). Currently with mapreduce combiner does not 
run in reducer - MAPREDUCE-5221. Since input sort buffers hold good amount of 
memory at that time it can easily go OOM
   -  While serializing output when there are multiple inputs and outputs and 
the sort buffers for those take up space.

It is a pain especially after buffer size hits 128MB. Doubling at 128MB will 
require 128MB (existing array) +256MB (new array). Any doubling after that 
requires even more space. But most of the time the data is probably not going 
to fill up that 256MB leading to wastage.

 


> Reduce memory utilization while serializing keys and values
> -----------------------------------------------------------
>
>                 Key: TEZ-3159
>                 URL: https://issues.apache.org/jira/browse/TEZ-3159
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>
>   Currently DataOutputBuffer is used for serializing. The underlying buffer 
> keeps doubling in size when it reaches capacity. In some of the Pig scripts 
> which serialize big bags, we end up with OOM in Tez as there is no space to 
> double the array size. Mapreduce mode runs fine in those cases with 1G heap. 
> The scenarios are
>     - When combiner runs in reducer and some of the fields after combining 
> are still big bags (For eg: distinct). Currently with mapreduce combiner does 
> not run in reducer - MAPREDUCE-5221. Since input sort buffers hold good 
> amount of memory at that time it can easily go OOM.
>    -  While serializing output with bags when there are multiple inputs and 
> outputs and the sort buffers for those take up space.
> It is a pain especially after buffer size hits 128MB. Doubling at 128MB will 
> require 128MB (existing array) +256MB (new array). Any doubling after that 
> requires even more space. But most of the time the data is probably not going 
> to fill up that 256MB leading to wastage.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-3159) Reduce memory utilization while serializing keys and values

Reply via email to