Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Fri, 17 May 2024 09:01:12 -0700


thinkharderdev commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117912724


   > > Then again, when dealing with object storage doing a bunch of random 
access reads can be counter-productive so ¯_(ツ)_/¯
   > 
   > This 1000%, even with the new SSD backed stores latencies are still on the 
order of milliseconds
   > 
   > > So if after the catalog tells you that you need to scan 100k files with 
10k columns each and you are only projecting 5 columns (which pretty well 
describes what we do for every query), then reading the entire parquet footer 
for each file to get the 5 columns chunks is going to really add up
   > 
   > Yeah at that point your only option really is to alter the write path to 
distribute the data in such a way that queries can eliminate files, no metadata 
chicanery is going to save you if you have to open 100K files 😅
   
   Nimble (the new Meta storage format) has an interesting approach to this. 
From what I understand they store most of the data that would go in the footer 
in parquet inline in the column chunk itself. So the footer just needs to 
describe how the actual column chunks are laid out (along with the schema) and 
since you need to fetch the column chunk anyway you can avoid reading all the 
column-chunk specific stuff for columns you don't care about. But they also 
don't support predicate pushdown and so once you add that in you end with more 
IOPS to prune data pages (I think)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to