Re: [PR] PARQUET-2429: Reduce direct input buffer churn [parquet-mr]

via GitHub Mon, 04 Mar 2024 17:14:52 -0800


gianm commented on PR #1270:
URL: https://github.com/apache/parquet-mr/pull/1270#issuecomment-1977765562


   > I agree with @wgtmac's concern about the expected size. For 
compression/decompression we are targeting the page size. The page size is 
limited by two configs, `parquet.page.size` and `parquet.page.row.count.limit`. 
(See details 
[here](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop).) One 
may configure both to higher values but it does not really make sense to have 
64M pages.
   
   I did encounter these in the real world, although it's always possible that 
they were built with some abnormally large values for some reason.
   
   > I would not use a hadoop config for the default size of compression 
buffers. Hadoop typically compresses whole files. Probably the default page 
size would be a better choice here.
   
   I'm ok with doing whichever. FWIW, the setting `io.file.buffer.size` I used 
in the most recent patch (which was recommended here: 
https://github.com/apache/parquet-mr/pull/1270#discussion_r1493591742) defaults 
to 4096 bytes. I am not really a Parquet expert so I am willing to use whatever 
y'all recommend. Is there another property that would be better?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] PARQUET-2429: Reduce direct input buffer churn [parquet-mr]

Reply via email to