tustvold commented on issue #5770: URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117899838
> Then again, when dealing with object storage doing a bunch of random access reads can be counter-productive so ¯_(ツ)_/¯ This 1000%, even with the new SSD backed stores latencies are still on the order of milliseconds > So if after the catalog tells you that you need to scan 100k files with 10k columns each and you are only projecting 5 columns (which pretty well describes what we do for every query), then reading the entire parquet footer for each file to get the 5 columns chunks is going to really add up Yeah at that point you need to alter the write path to distribute the data in such a way that queries can eliminate files, no metadata chicanery is going to save you if you have to open 100K files :sweat_smile: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
