Yujiang Zhong created PARQUET-2160: -------------------------------------- Summary: Close decompression stream to free off-heap memory in time Key: PARQUET-2160 URL: https://issues.apache.org/jira/browse/PARQUET-2160 Project: Parquet Issue Type: Improvement Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 1.4.9.1 + glibc Reporter: Yujiang Zhong
The decompressed stream now relies on the JVM GC to close. When reading parquet in zstd compressed format, sometimes I ran into OOM cause high off-heap usage. I think the reason is that the GC is not timely and causes off-heap memory fragmentation. I had to set lower MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. There is a [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789&cid=C025PH0G1D4]] of this zstd parquet issus in Iceberg community slack: some people had the same problem. I think we can close decompressed stream mannually in time to solve this problem: InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor); decompressed = BytesInput.from(is, uncompressedSize); -> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor); decompressed = BytesInput.{_}copy{_}(BytesInput.{_}from{_}(is, uncompressedSize)); is.close(); After I made this change to decompress, I found off-heap memory is significantly reduced (with same query on same data). -- This message was sent by Atlassian Jira (v8.20.7#820007)