[ 
https://issues.apache.org/jira/browse/TEZ-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187831#comment-15187831
 ] 

Rohini Palaniswamy commented on TEZ-3159:
-----------------------------------------

Currently IFile uses one DataOutputBuffer for key and value and tracks the 
length. We can still use DataOutputBuffer for key. Keys are usually small and 
we require everything to stay in one buffer so that 
WritableComparator.compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int 
l2) API can be used. 

We can use a different implementation of DataOutputStream for values that does 
chaining of byte[]. The initial buffer can start at 32 and keep growing till it 
hits a fixed threshold, say 64MB. Once data to be written crosses that, then we 
can create new byte[] of 64MB and write new data to that and repeat that till 
all data is written. This will avoid the costly System.arrayCopy calls as well. 

Same can be done with DataInputBuffer for value in TezMerger as well.

> Reduce memory utilization while serializing keys and values
> -----------------------------------------------------------
>
>                 Key: TEZ-3159
>                 URL: https://issues.apache.org/jira/browse/TEZ-3159
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>
>   Currently DataOutputBuffer is used for serializing. The underlying buffer 
> keeps doubling in size when it reaches capacity. In some of the Pig scripts 
> which serialize big bags, we end up with OOM for some situations where 
> mapreduce runs fine. 
>     - When combiner runs in reducer and some of the fields after combining 
> are still big bags (For eg: distinct). Currently with mapreduce combiner does 
> not run in reducer - MAPREDUCE-5221. Since input sort buffers hold good 
> amount of memory at that time it can easily go OOM
>    -  While serializing output when there are multiple inputs and outputs and 
> the sort buffers for those take up space.
> It is a pain especially after buffer size hits 128MB. Doubling at 128MB will 
> require 128MB (existing array) +256MB (new array). Any doubling after that 
> requires even more space. But most of the time the data is probably not going 
> to fill up that 256MB leading to wastage.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to