Gian Merlino created PARQUET-2429:
-------------------------------------
Summary: Direct buffer churn in NonBlockedDecompressor
Key: PARQUET-2429
URL: https://issues.apache.org/jira/browse/PARQUET-2429
Project: Parquet
Issue Type: Bug
Reporter: Gian Merlino
NonBlockedDecompressor (and NonBlockedCompressor) are grown one chunk at a time
as the class receives successive setInput calls. When decompressing a 64MB
block using a 4KB chunk size, this leads to thousands of allocations and
deallocations totaling GBs of memory. This can be avoided by doubling the
buffer each time rather than adding on a minimal amount of new space.
In a practical scenario I ran into, the time taken to read a 140MB Parquet file
was reduced from 35s to <2s.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]