Yujiang Zhong created PARQUET-2160:
--------------------------------------

             Summary: Close decompression stream to free off-heap memory in time
                 Key: PARQUET-2160
                 URL: https://issues.apache.org/jira/browse/PARQUET-2160
             Project: Parquet
          Issue Type: Improvement
         Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
1.4.9.1 + glibc
            Reporter: Yujiang Zhong


The decompressed stream now relies on the JVM GC to close. When reading parquet 
in zstd compressed format, sometimes I ran into OOM cause high off-heap usage. 
I think the reason is that the GC is not timely and causes off-heap memory 
fragmentation. I had to set  lower MALLOC_TRIM_THRESHOLD_ to make glibc give 
back memory to system quickly. There is a 
[thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789&cid=C025PH0G1D4]]
 of this zstd parquet issus in Iceberg community slack:  some people had the 
same problem. 

I think we can close decompressed stream mannually in time to solve this 
problem:

 

InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);

decompressed = BytesInput.from(is, uncompressedSize);

->

InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);

decompressed = BytesInput.{_}copy{_}(BytesInput.{_}from{_}(is, 
uncompressedSize));

is.close();

 

After I made this change to decompress, I found off-heap memory is 
significantly reduced (with same query on same data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to