thinkharderdev commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117884517

   > > duplicating directly data from the parquet footer because reading the 
footer is too expensive
   > 
   > But data locality is extremely important? If you have to scan a load of 
files only to ascertain they're not of interest, that will be wasteful 
regardless of how optimal the storage format is? Most catalogs collocate file 
statistics from across multiple files so that the number of files can be 
quickly and cheaply whittled down. Only then does it consult those files that 
haven't been eliminated and perform more granular push down to the row group 
and page level using the statistics embedded in those files.
   > 
   > Or at least that's the theory...
   
   Right, but there are two different levels here. You can store statistics 
that allow you to prune based only on metadata, but what's left you just have 
to go and scan. And the scanning requires reading the column chunk metadata. So 
if after the catalog tells you that you need to scan 100k files with 10k 
columns each and you are only projecting 5 columns (which pretty well describes 
what we do for every query), then reading the entire parquet footer for each 
file to get the 5 columns chunks is going to really add up. 
   
   How much of that cost is related to thrift decoding specifically? I think 
not much, and the real issue is just IO. But if you could read a file-specific 
metadata header that allows you to prune IO by reading only the column chunk 
metadata you actually need (whether it is encoded using thrift or flatbuffers 
or protobuf or XYZ) then I think that could definitely help a lot. 
   
   Then again, when dealing with object storage doing a bunch of random access 
reads can be counter-productive so ¯\_(ツ)_/¯


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to