Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Fri, 17 May 2024 08:25:31 -0700


tustvold commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117842742


   > duplicating directly data from the parquet footer because reading the 
footer is too expensive
   
   But data locality is extremely important. If you have to scan a load of 
files only to ascertain they're not of interest, that will be wasteful 
regardless of how optimal the storage format is? Most catalogs collocate 
aggregate statistics from across multiple files so that the number of files can 
be quickly and cheaply whittled down. Only then does it consult those files 
that haven't been eliminated and perform more granular push down to the page 
level.
   
   Or at least that's the theory...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to