Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Thu, 16 May 2024 16:56:34 -0700


tustvold commented on PR #242:
URL: https://github.com/apache/parquet-format/pull/242#issuecomment-2116394832


   Perhaps we could articulate the concrete use-cases we want to support with 
this? I understand that there is a desire to support extremely wide schemas of 
say 10,000 columns, but the precise nature of these columns eludes me?
   
   The reason I ask this is if we stick with a standard page size of 1MB, then 
a 10,000 wide table with even distribution across the columns is unlikely to 
ever need multiple row groups - it will be 10GB just with just a single row 
group. This seems at odds with the stated motivation of this PR to avoid 
scaling per row group, which makes me think I am missing something.
   
   This makes me wonder if the use-case involves much smaller column chunks 
than normal, which would imply small pages, which might require changes beyond 
metadata if we want to support effectively?
   
   _As an aside I did some toy benchmarking of parquet-rs, and confirmed that 
using thrift is perfectly fine, and can perform on par with flatbuffers - 
https://github.com/apache/arrow-rs/issues/5770#issuecomment-2116370344_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to