Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Fri, 17 May 2024 08:21:50 -0700


thinkharderdev commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117835663


   > > that we built an entire separate system similar
   > 
   > My reading of https://github.com/G-Research/PalletJack is it is filling a 
similar role to a catalog like Hive, Deltalake or iceberg. It makes sense to me 
that applications would want to build additional metadata structures over the 
top of collections of parquet files that are optimised for their particular 
read/write workloads, and that by design these would not be a part of the 
storage format itself?
   
   I'm not sure that's right. A catalog can do many things and certainly some 
of those things don't belong in the storage format (eg grouping individual 
parquet files together into to some logical group relevant to the query or 
something), but if the catalog is just duplicating directly data from the 
parquet footer because reading the footer is too expensive that seems like 
something that should be addressed in the storage format itself. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to