Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

via GitHub Wed, 08 Oct 2025 12:18:34 -0700


adriangb commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3382885667


   I gave it a shot in https://github.com/apache/datafusion/pull/17980.
   I decided to create a new struct to replace `ColumnStatistics` and a new 
struct to replace `Statistics` and added them as fields to `PartitionedFile`, 
making it backwards compatible to at least experiment with the new thing.
   
   A couple interesting things I found:
   1. Have we considered `Precision<Distribution>`? It seems to nicely 
encapsulate what I think we want.
   2. I included and think we want several other public fields, e.g. I think 
`total_byte_size: Precision<usize>` is sensible and we should keep it. The main 
thing to replace was `min: Precision<ScalarValue>` and `max: 
Precision<ScalarValue>`. I kept `sum: Precision<ScalarValue>` but feel that 
maybe that could just be made to be a field of the `Distribution` variants? 
@berkaysynnada any thoughts?
   3. I couldn't put the new structs alongside the old ones because 
`datafusion/common` does not depend on `datafusion/expr-common`. Seems like we 
might need to do a bit of juggling if we want to access `Distribution` from the 
current location.
   4. I don't love the current reperesentation `Vec<ColumnStatistics>` for 
several reasons: (1) for wide tables / scans having empty `ColumnStatistics` 
takes up a lot of memory. We had a situation where we essentially initialized 
every file in the table (say 10k) and initialized empty stats for each column 
(say 100) so we ended up with a couple hunded MB of empty stats; (2) it depends 
on the projection so it's somewhat hard to tie statistics to the column by 
name; (3) when you apply a projection you have to depp clone all of the 
statistics. I tried to address (1) and (3) by changing the representation to 
`Vec<Option<Arc<ColumnDistributionStatistics>>` which makes it cheaper to 
initailize as empty and cheaper to project. But that doesn't address (2). 
Should we also include the table schema for each file? It's just one more 
`Arc'd` field. Then we could have helpers to get projected statistics for 
certain columns or certain indices, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

Reply via email to