tustvold commented on PR #242: URL: https://github.com/apache/parquet-format/pull/242#issuecomment-2116394832
Perhaps we could articulate the concrete use-cases we want to support with this? I understand that there is a desire to support extremely wide schemas of say 10,000 columns, but the precise nature of these columns eludes me? The reason I ask this is if we stick with a standard page size of 1MB, then a 10,000 wide table with even distribution across the columns is unlikely to ever need multiple row groups - it will be 10GB just with just a single row group. This seems at odds with the stated motivation of this PR to avoid scaling per row group, which makes me think I am missing something. This makes me wonder if the use-case involves much smaller column chunks than normal, which would imply small pages, which might require changes beyond metadata if we want to support effectively? _As an aside I did some toy benchmarking of parquet-rs, and confirmed that using thrift is perfectly fine, and can perform on par with flatbuffers - https://github.com/apache/arrow-rs/issues/5770#issuecomment-2116370344_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
