Re: [PR] Add ability to skip or transform page encoding statistics in Parquet metadata [arrow-rs]

via GitHub Mon, 17 Nov 2025 07:35:32 -0800


etseidl commented on PR #8797:
URL: https://github.com/apache/arrow-rs/pull/8797#issuecomment-3542499595


   > Given this observation, what do you think about removing the 
`skip_encoding_stats` option? If it makes the API more complicated, and doesn't 
make decoding faster, why are we adding it?
   
   Well, there's a couple things coming up. I'm working on speeding up the 
skipping code, and when we have the skip index this will be even faster. There 
are also other stats to skip in there (chunk `Statistics`, size stats, geo 
stats, bloom filter pointers). We could have a single option to skip all of 
them, but I can see wanting to enable some and not others depending on the use 
case. Filtering on a sorted column would want the chunk stats, filtering on 
unsorted might want the bloom filter but not other stats. If I want a size 
estimate for planning purposes but don't plan on filtering I would want size 
stats and nothing else. I think we should support all of these. I can see 
wanting different options for different columns in a single query.
   
   But I can also see saying if the stats are enabling page pruning, the cost 
savings from the pruning should outweigh the cost of decoding the stats that 
enable that pruning, so don't worry so much and keep this simple for now. We 
can make it more fine grained later if need be.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add ability to skip or transform page encoding statistics in Parquet metadata [arrow-rs]

Reply via email to