Re: [PR] PARQUET-2429: Reduce direct input buffer churn [parquet-mr]

via GitHub Mon, 04 Mar 2024 00:28:45 -0800


gszadovszky commented on PR #1270:
URL: https://github.com/apache/parquet-mr/pull/1270#issuecomment-1975989704


   @gianm, I agree with @wgtmac's concern about the expected size. For 
compression/decompression we are targeting the page size. The page size is 
limited by two configs, `parquet.page.size` and `parquet.page.row.count.limit`. 
(See details 
[here](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop).) One 
may configure both to higher values but it does not really make sense to have 
64M pages.
   I would not use a hadoop config for the default size of compression buffers. 
Hadoop typically compresses whole files. Probably the default page size would 
be a better choice here.
   I like the idea of keeping the last size in the codec so the nexttime you 
don't need the multiple re-allocations. The catch here might be in the case of 
writing Parquet files with different page size configurations so we might 
allocate more than actually required. But I don't think this would be a 
real-life scenario.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] PARQUET-2429: Reduce direct input buffer churn [parquet-mr]

Reply via email to