[ 
https://issues.apache.org/jira/browse/PARQUET-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gian Merlino updated PARQUET-2429:
----------------------------------
    Description: 
Input buffers for NonBlockedDecompressor (and NonBlockedCompressor) are grown 
one chunk at a time as the class receives successive setInput calls. When 
decompressing a 64MB block using a 4KB chunk size, this leads to thousands of 
allocations and deallocations totaling GBs of memory. This can be avoided by 
doubling the buffer each time rather than adding on a minimal amount of new 
space.

In a practical scenario I ran into, the time taken to read a 140MB Parquet file 
was reduced from 35s to <2s.

PR: https://github.com/apache/parquet-mr/pull/1270

  was:
NonBlockedDecompressor (and NonBlockedCompressor) are grown one chunk at a time 
as the class receives successive setInput calls. When decompressing a 64MB 
block using a 4KB chunk size, this leads to thousands of allocations and 
deallocations totaling GBs of memory. This can be avoided by doubling the 
buffer each time rather than adding on a minimal amount of new space.

In a practical scenario I ran into, the time taken to read a 140MB Parquet file 
was reduced from 35s to <2s.

PR: https://github.com/apache/parquet-mr/pull/1270


> Direct buffer churn in NonBlockedDecompressor
> ---------------------------------------------
>
>                 Key: PARQUET-2429
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2429
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Gian Merlino
>            Priority: Major
>
> Input buffers for NonBlockedDecompressor (and NonBlockedCompressor) are grown 
> one chunk at a time as the class receives successive setInput calls. When 
> decompressing a 64MB block using a 4KB chunk size, this leads to thousands of 
> allocations and deallocations totaling GBs of memory. This can be avoided by 
> doubling the buffer each time rather than adding on a minimal amount of new 
> space.
> In a practical scenario I ran into, the time taken to read a 140MB Parquet 
> file was reduced from 35s to <2s.
> PR: https://github.com/apache/parquet-mr/pull/1270



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to