thinkharderdev commented on issue #5770: URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117884517
> > duplicating directly data from the parquet footer because reading the footer is too expensive > > But data locality is extremely important? If you have to scan a load of files only to ascertain they're not of interest, that will be wasteful regardless of how optimal the storage format is? Most catalogs collocate file statistics from across multiple files so that the number of files can be quickly and cheaply whittled down. Only then does it consult those files that haven't been eliminated and perform more granular push down to the row group and page level using the statistics embedded in those files. > > Or at least that's the theory... Right, but there are two different levels here. You can store statistics that allow you to prune based only on metadata, but what's left you just have to go and scan. And the scanning requires reading the column chunk metadata. So if after the catalog tells you that you need to scan 100k files with 10k columns each and you are only projecting 5 columns (which pretty well describes what we do for every query), then reading the entire parquet footer for each file to get the 5 columns chunks is going to really add up. How much of that cost is related to thrift decoding specifically? I think not much, and the real issue is just IO. But if you could read a file-specific metadata header that allows you to prune IO by reading only the column chunk metadata you actually need (whether it is encoded using thrift or flatbuffers or protobuf or XYZ) then I think that could definitely help a lot. Then again, when dealing with object storage doing a bunch of random access reads can be counter-productive so ¯\_(ツ)_/¯ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
