thinkharderdev commented on issue #5770: URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117912724
> > Then again, when dealing with object storage doing a bunch of random access reads can be counter-productive so ¯_(ツ)_/¯ > > This 1000%, even with the new SSD backed stores latencies are still on the order of milliseconds > > > So if after the catalog tells you that you need to scan 100k files with 10k columns each and you are only projecting 5 columns (which pretty well describes what we do for every query), then reading the entire parquet footer for each file to get the 5 columns chunks is going to really add up > > Yeah at that point your only option really is to alter the write path to distribute the data in such a way that queries can eliminate files, no metadata chicanery is going to save you if you have to open 100K files 😅 Nimble (the new Meta storage format) has an interesting approach to this. From what I understand they store most of the data that would go in the footer in parquet inline in the column chunk itself. So the footer just needs to describe how the actual column chunks are laid out (along with the schema) and since you need to fetch the column chunk anyway you can avoid reading all the column-chunk specific stuff for columns you don't care about. But they also don't support predicate pushdown and so once you add that in you end with more IOPS to prune data pages (I think) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
