asfimport opened a new issue, #398: URL: https://github.com/apache/parquet-format/issues/398
The decompressed stream in HeapBytesDecompressor$decompress now relies on the JVM GC to close. When reading parquet in zstd compressed format, sometimes I ran into OOM cause high off-heap usage. I think the reason is that the GC is not timely and causes off-heap memory fragmentation. I had to set lower MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. There is a [thread\|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789&cid=C025PH0G1D4]] of this zstd parquet issus in Iceberg community slack: some people had the same problem. I think maybe we can use ByteArrayBytesInput as decompressed bytes input and close decompressed stream in time to solve this problem: ```java InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor); decompressed = BytesInput.from(is, uncompressedSize); ``` -> ```java InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor); decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize)); is.close(); ``` After I made this change to decompress, I found off-heap memory is significantly reduced (with same query on same data). **Environment**: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 1.4.9.1 + glibc **Reporter**: [Yujiang Zhong](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zhongyuj) / @zhongyujiang <sub>**Note**: *This issue was originally created as [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160). Please see the [migration documentation](https://issues.apache.org/jira/browse/PARQUET-2502) for further details.*</sub> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
