asfimport opened a new issue, #398:
URL: https://github.com/apache/parquet-format/issues/398

   The decompressed stream in HeapBytesDecompressor$decompress now relies on 
the JVM GC to close. When reading parquet in zstd compressed format, sometimes 
I ran into OOM cause high off-heap usage. I think the reason is that the GC is 
not timely and causes off-heap memory fragmentation. I had to set  lower 
MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. There 
is a 
[thread\|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789&cid=C025PH0G1D4]]
 of this zstd parquet issus in Iceberg community slack:  some people had the 
same problem. 
   
   I think maybe we can use ByteArrayBytesInput as decompressed bytes input and 
close decompressed stream in time to solve this problem:
   ```java
   
   InputStream is = codec.createInputStream(bytes.toInputStream(), 
decompressor);
   decompressed = BytesInput.from(is, uncompressedSize); 
   ```
   ->
   ```java
   
   InputStream is = codec.createInputStream(bytes.toInputStream(), 
decompressor);
   decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
   is.close(); 
   ```
   After I made this change to decompress, I found off-heap memory is 
significantly reduced (with same query on same data).
   
   **Environment**: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
1.4.9.1 + glibc
   **Reporter**: [Yujiang 
Zhong](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zhongyuj) / 
@zhongyujiang
   
   <sub>**Note**: *This issue was originally created as 
[PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160). Please see 
the [migration 
documentation](https://issues.apache.org/jira/browse/PARQUET-2502) for further 
details.*</sub>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to