tustvold commented on issue #5770: URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117842742
> duplicating directly data from the parquet footer because reading the footer is too expensive But data locality is extremely important. If you have to scan a load of files only to ascertain they're not of interest, that will be wasteful regardless of how optimal the storage format is? Most catalogs collocate aggregate statistics from across multiple files so that the number of files can be quickly and cheaply whittled down. Only then does it consult those files that haven't been eliminated and perform more granular push down to the page level. Or at least that's the theory... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
