[
https://issues.apache.org/jira/browse/PARQUET-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gian Merlino updated PARQUET-2429:
----------------------------------
Description:
NonBlockedDecompressor (and NonBlockedCompressor) are grown one chunk at a time
as the class receives successive setInput calls. When decompressing a 64MB
block using a 4KB chunk size, this leads to thousands of allocations and
deallocations totaling GBs of memory. This can be avoided by doubling the
buffer each time rather than adding on a minimal amount of new space.
In a practical scenario I ran into, the time taken to read a 140MB Parquet file
was reduced from 35s to <2s.
PR: https://github.com/apache/parquet-mr/pull/1270
was:
NonBlockedDecompressor (and NonBlockedCompressor) are grown one chunk at a time
as the class receives successive setInput calls. When decompressing a 64MB
block using a 4KB chunk size, this leads to thousands of allocations and
deallocations totaling GBs of memory. This can be avoided by doubling the
buffer each time rather than adding on a minimal amount of new space.
In a practical scenario I ran into, the time taken to read a 140MB Parquet file
was reduced from 35s to <2s.
> Direct buffer churn in NonBlockedDecompressor
> ---------------------------------------------
>
> Key: PARQUET-2429
> URL: https://issues.apache.org/jira/browse/PARQUET-2429
> Project: Parquet
> Issue Type: Bug
> Reporter: Gian Merlino
> Priority: Major
>
> NonBlockedDecompressor (and NonBlockedCompressor) are grown one chunk at a
> time as the class receives successive setInput calls. When decompressing a
> 64MB block using a 4KB chunk size, this leads to thousands of allocations and
> deallocations totaling GBs of memory. This can be avoided by doubling the
> buffer each time rather than adding on a minimal amount of new space.
> In a practical scenario I ran into, the time taken to read a 140MB Parquet
> file was reduced from 35s to <2s.
> PR: https://github.com/apache/parquet-mr/pull/1270
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]