bryanck opened a new pull request, #5681:
URL: https://github.com/apache/iceberg/pull/5681

   This PR adds a workaround for memory issues encountered when reading Parquet 
files compressed with zstd. During some load testing on Spark, we encountered 
various OOM kills when reading from zstd compressed tables. One suggested 
solution was to set the environment variable `MALLOC_TRIM_THRESHOLD_` to 
something lower than default, like 8192. This helped in some cases but not all.
   
   Upon further investigation, it appeared that buffers were accumulating...
   ![Screen Shot 2022-08-30 at 6 59 47 
PM](https://user-images.githubusercontent.com/5475421/187738629-a4bcb8e4-80fa-4ea5-84e9-29d471e7ad1f.png)
   
   Disabling the buffer pool resulted in finalizers accumulating instead...
   ![Screen Shot 2022-08-30 at 8 11 48 
PM](https://user-images.githubusercontent.com/5475421/187738803-39ff7c8f-237d-4cb1-b08d-d6bb5044b8fd.png)
   
   The solution is the same being 
[proposed](https://github.com/apache/parquet-mr/pull/982) in parquet-mr. The 
current version of Parquet will leave the decompress stream 
[open](https://github.com/apache/parquet-mr/blob/6add62754b3d53e30376360f8d215da004fa8096/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L111).
   
   Instead of leaving it open, this PR changes the behavior to read the stream 
fully into a buffer and then close the stream, allowing native resources to be 
[freed](https://github.com/luben/zstd-jni/blob/a3b5c4c1a02ddee56f6e2b019d6c8b52f8e63411/src/main/java/com/github/luben/zstd/ZstdInputStreamNoFinalizer.java#L254)
 immediately rather than waiting for garbage collection.
   
   Anecdotally, this resulted in better performance, but more testing would be 
needed to validate that.
   
   Alternatively, we could wait for the Parquet PR to be merged, but this is a 
more targeted fix. Also we could add a flag of some sort if desired.
   
   Here's a viz of the heap dump with this change...
   ![Screen Shot 2022-08-31 at 5 55 41 
AM](https://user-images.githubusercontent.com/5475421/187740382-ac8cca06-7530-467f-a824-952c378df97a.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to