Remi Dettai created PARQUET-1188:
------------------------------------

             Summary: Off heap memory leaks with large binary fields using 
Snappy
                 Key: PARQUET-1188
                 URL: https://issues.apache.org/jira/browse/PARQUET-1188
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.8.1
            Reporter: Remi Dettai


When I write a large pages (~100MB) that contains large binary fields (~1MB), 
the java application uses an unexpected amount of off-heap memory (1.2GB) 

This problem was identified when using the {{AvroParquetWriter}} but its source 
lies in the  parquet-hadoop submodule.

Diving a little bit deeper shows the following:
- writing fields into the ParquetWriter creates a SequenceBytesIn which is 
actually just a list of {{BytesInput}} for each field. When calling 
{{bytes.writeAllTo(cos)}} in the {{CodecFactory}}, it actually writes one 
{{ByteInput}} (which contains a single field) at a time.
- the {{SnappyCompressor}} receives the data in {{setInput}} one large field at 
a time. This calls {{ByteBuffer.allocateDirect}} each time with a growing size. 
But as the memory is actually allocated off-heap, this does not trigger the 
garbage collector which only sees small objects on the heap. The actual memory 
associated with the object is the size of all the fields added to the page 
until then, so off-heap the memory is growing quadratically.

I did not attach a pull request to this issue because I see multiple mitigation 
to the issue but I'm not really delighted by any of them:
- merge all the fields into one byte array before pushing them down to the 
{{SnappyCompressor}}. For instance we could replace the previous statement in 
the {{CodecFactory}} with 
{{BytesInput.from(bytes.toByteArray()).writeAllTo(cos)}}. But this generates an 
extra on-heap allocation the size of the whole page.
- force the {{DirectBuffer}} to be cleaned up with something like 
{{((DirectBuffer)inputBuffer).cleaner().clean()}} after having copied it to the 
new bigger buffer. The issue here would be that {{DirectBuffer}} is part of the 
internal API and is likely to be moved. Using reflexion could make the solution 
more resilient but is even "hackier" IMHO.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to