Remi Dettai created PARQUET-1188:
------------------------------------
Summary: Off heap memory leaks with large binary fields using
Snappy
Key: PARQUET-1188
URL: https://issues.apache.org/jira/browse/PARQUET-1188
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.8.1
Reporter: Remi Dettai
When I write a large pages (~100MB) that contains large binary fields (~1MB),
the java application uses an unexpected amount of off-heap memory (1.2GB)
This problem was identified when using the {{AvroParquetWriter}} but its source
lies in the parquet-hadoop submodule.
Diving a little bit deeper shows the following:
- writing fields into the ParquetWriter creates a SequenceBytesIn which is
actually just a list of {{BytesInput}} for each field. When calling
{{bytes.writeAllTo(cos)}} in the {{CodecFactory}}, it actually writes one
{{ByteInput}} (which contains a single field) at a time.
- the {{SnappyCompressor}} receives the data in {{setInput}} one large field at
a time. This calls {{ByteBuffer.allocateDirect}} each time with a growing size.
But as the memory is actually allocated off-heap, this does not trigger the
garbage collector which only sees small objects on the heap. The actual memory
associated with the object is the size of all the fields added to the page
until then, so off-heap the memory is growing quadratically.
I did not attach a pull request to this issue because I see multiple mitigation
to the issue but I'm not really delighted by any of them:
- merge all the fields into one byte array before pushing them down to the
{{SnappyCompressor}}. For instance we could replace the previous statement in
the {{CodecFactory}} with
{{BytesInput.from(bytes.toByteArray()).writeAllTo(cos)}}. But this generates an
extra on-heap allocation the size of the whole page.
- force the {{DirectBuffer}} to be cleaned up with something like
{{((DirectBuffer)inputBuffer).cleaner().clean()}} after having copied it to the
new bigger buffer. The issue here would be that {{DirectBuffer}} is part of the
internal API and is likely to be moved. Using reflexion could make the solution
more resilient but is even "hackier" IMHO.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)