Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

via GitHub Sat, 18 Oct 2025 00:24:40 -0700


alamb commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3387109890


   > I gave it a shot in https://github.com/apache/datafusion/pull/17980.
   
   I think you mean
   - https://github.com/apache/datafusion/pull/17978
   
   > Have we considered Precision<Distribution>? It seems to nicely encapsulate 
what I think we want.
   >  Seems like we might need to do a bit of juggling if we want to access 
Distribution from the current location.
   
   I can't help but feel the current `Distribution` doesn't have many practical 
benefits -- specifically the idea of having mathemetical descriptions of value 
distributions is intellectually appealing, but I have never see actual query 
engines use it (because real data is never completely described by those 
theoretical distributions). Maybe I am missing something
   
   > I don't love the current reperesentation Vec<ColumnStatistics> for several 
reasons: 
   
   I broadly agree with this concern
   
   > But that doesn't address (2).
   
   Maybe the statistics could include a `SchemaRef` pointer so projections 
could be converted back into names when needed
   
   >  I tried to address (1) and (3) by changing the representation to 
Vec<Option<Arc<ColumnDistributionStatistics>> which makes it cheaper to 
initailize as empty and cheaper to project.
   
   I think that is basically the right internal representation, but in terms of 
software engineering, I personally recommend:
   
   Updating statistics initially to be a wrapper:
   
   ```rust
   /// Not sure of this name. I am open to new ones
   ///  StatisticsV2 was proposed at some point
   /// but that seems to offer no additional context.
   pub struct RelationStatistics {
     inner: Vec<Option<Arc<ColumnStatistics>>
   }
   ```
   
   And then there should be an easy way to convert between 
`Vec<ColumnStatistics>` to a RelationStatistics to ease migration
   
   Then we could change the signatgure of ExecutionPlan to retun an arc'd 
version
   
   ```rust
   trait ExecutionPlan {
     /// retun the statistics for this plan
     pub fn relation_statistics(&self) -> &Arc<RelationStatistics>;
   ...
   }
   ```
   
   Then copying entire nodes would be fast (copy an Arc) as would projecting 
statistics 🤔 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

Reply via email to