Rohini Palaniswamy created TEZ-3159:
---------------------------------------
Summary: Reduce memory utilization while serializing keys and
values
Key: TEZ-3159
URL: https://issues.apache.org/jira/browse/TEZ-3159
Project: Apache Tez
Issue Type: Improvement
Reporter: Rohini Palaniswamy
Currently DataOutputBuffer is used for serializing. The underlying buffer
keeps doubling in size when it reaches capacity. In some of the Pig scripts
which serialize big bags, we end up with OOM for some situations where
mapreduce runs fine.
- When combiner runs in reducer and some of the fields after combining are
still big bags (For eg: distinct). Currently with mapreduce combiner does not
run in reducer - MAPREDUCE-5221. Since input sort buffers hold good amount of
memory at that time it can easily go OOM
- While serializing output when there are multiple inputs and outputs and
the sort buffers for those take up space.
It is a pain especially after buffer size hits 128MB. Doubling at 128MB will
require 128MB (existing array) +256MB (new array). Any doubling after that
requires even more space. But most of the time the data is probably not going
to fill up that 256MB leading to wastage.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)