thinkharderdev commented on issue #5770: URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117835663
> > that we built an entire separate system similar > > My reading of https://github.com/G-Research/PalletJack is it is filling a similar role to a catalog like Hive, Deltalake or iceberg. It makes sense to me that applications would want to build additional metadata structures over the top of collections of parquet files that are optimised for their particular read/write workloads, and that by design these would not be a part of the storage format itself? I'm not sure that's right. A catalog can do many things and certainly some of those things don't belong in the storage format (eg grouping individual parquet files together into to some logical group relevant to the query or something), but if the catalog is just duplicating directly data from the parquet footer because reading the footer is too expensive that seems like something that should be addressed in the storage format itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
