Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Fri, 17 May 2024 08:55:17 -0700


tustvold commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117899838


   > Then again, when dealing with object storage doing a bunch of random 
access reads can be counter-productive so ¯_(ツ)_/¯
   
   This 1000%, even with the new SSD backed stores latencies are still on the 
order of milliseconds
   
   > So if after the catalog tells you that you need to scan 100k files with 
10k columns each and you are only projecting 5 columns (which pretty well 
describes what we do for every query), then reading the entire parquet footer 
for each file to get the 5 columns chunks is going to really add up
   
   Yeah at that point you need to alter the write path to distribute the data 
in such a way that queries can eliminate files, no metadata chicanery is going 
to save you if you have to open 100K files :sweat_smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to