bryanck opened a new pull request, #5681: URL: https://github.com/apache/iceberg/pull/5681
This PR adds a workaround for memory issues encountered when reading Parquet files compressed with zstd. During some load testing on Spark, we encountered various OOM kills when reading from zstd compressed tables. One suggested solution was to set the environment variable `MALLOC_TRIM_THRESHOLD_` to something lower than default, like 8192. This helped in some cases but not all. Upon further investigation, it appeared that buffers were accumulating...  Disabling the buffer pool resulted in finalizers accumulating instead...  The solution is the same being [proposed](https://github.com/apache/parquet-mr/pull/982) in parquet-mr. The current version of Parquet will leave the decompress stream [open](https://github.com/apache/parquet-mr/blob/6add62754b3d53e30376360f8d215da004fa8096/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L111). Instead of leaving it open, this PR changes the behavior to read the stream fully into a buffer and then close the stream, allowing native resources to be [freed](https://github.com/luben/zstd-jni/blob/a3b5c4c1a02ddee56f6e2b019d6c8b52f8e63411/src/main/java/com/github/luben/zstd/ZstdInputStreamNoFinalizer.java#L254) immediately rather than waiting for garbage collection. Anecdotally, this resulted in better performance, but more testing would be needed to validate that. Alternatively, we could wait for the Parquet PR to be merged, but this is a more targeted fix. Also we could add a flag of some sort if desired. Here's a viz of the heap dump with this change...  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
