Re: [PR] Impl a general get results from stats [datafusion]

via GitHub Fri, 05 Jul 2024 20:10:26 -0700


Rachelint commented on PR #11261:
URL: https://github.com/apache/datafusion/pull/11261#issuecomment-2211608009


   > Thank you @Rachelint ❤️
   > 
   > > It is mainly due to AggregateUDFImpl is defined in expr crate, and the 
needed PhysicalExpr is defined in datafusion-physical-expr-common crate which 
will refer exps a dependency. And if we import PhysicalExpr in expr, that will 
lead to cyclic dependecies...
   > 
   > This makes sense -- I agree the ` AggregateUDFImpl` API should not be in 
terms of `PhysicalExpr` (for the dependency reasons you describe)
   > 
   > One way I could think to get around this would be to figure out the column 
statistics in the physical planner and pass it directly (rather than making the 
UDF figure them out). Something like
   > 
   > ```rust
   > pub trait AggregateUDFImpl {
   > ...
   >    /// If the output of this aggregate function can be determined 
   >    /// only from input statistics (e.g. COUNT(..) with 0 rows, is always 0)
   >    /// return that value
   >    ///
   >    /// # Arguments
   >    /// stats: Overall statistics for the input (including row count)
   >    /// arg_stats: Statistics, if known, for each input argument
   >     fn output_from_stats(
   >         &self,
   >         stats: &Statistics,
   >         arg_stats: &[Option<&ColumnStatistics>]
   >     ) -> Option<ScalarValue> {
   >         None
   >     }
   > ...
   > }
   > ```
   > 
   > ANother option might be to use the narrower API described on #11153
   > 
   > ```rust
   > impl AggregateExpr {
   > 
   >  /// Return the value of the aggregate function, if known, given the 
number of input rows.
   >  /// 
   >  /// Return None if the value can not be determined solely from the input.
   >  /// 
   >  /// # Examples
   >  /// * The `COUNT` aggregate would return `Some(11)` given `num_rows = 11`
   >  /// * The `MIN` aggregate would return `Some(Null)` given `num_rows = 0
   >  /// * The `MIN` aggregate would return `None` given num_rows = 11
   >  fn output_from_rows(&self, num_rows: usize) -> Option<ScalarValue> { None 
}
   > ...
   > }
   > ```
   > 
   > Though I think your idea is better (pass in statistics in generla)
   
   
   
   > Thank you @Rachelint ❤️
   > 
   > > It is mainly due to AggregateUDFImpl is defined in expr crate, and the 
needed PhysicalExpr is defined in datafusion-physical-expr-common crate which 
will refer exps a dependency. And if we import PhysicalExpr in expr, that will 
lead to cyclic dependecies...
   > 
   > This makes sense -- I agree the ` AggregateUDFImpl` API should not be in 
terms of `PhysicalExpr` (for the dependency reasons you describe)
   > 
   > One way I could think to get around this would be to figure out the column 
statistics in the physical planner and pass it directly (rather than making the 
UDF figure them out). Something like
   > 
   > ```rust
   > pub trait AggregateUDFImpl {
   > ...
   >    /// If the output of this aggregate function can be determined 
   >    /// only from input statistics (e.g. COUNT(..) with 0 rows, is always 0)
   >    /// return that value
   >    ///
   >    /// # Arguments
   >    /// stats: Overall statistics for the input (including row count)
   >    /// arg_stats: Statistics, if known, for each input argument
   >     fn output_from_stats(
   >         &self,
   >         stats: &Statistics,
   >         arg_stats: &[Option<&ColumnStatistics>]
   >     ) -> Option<ScalarValue> {
   >         None
   >     }
   > ...
   > }
   > ```
   > 
   > ANother option might be to use the narrower API described on #11153
   > 
   > ```rust
   > impl AggregateExpr {
   > 
   >  /// Return the value of the aggregate function, if known, given the 
number of input rows.
   >  /// 
   >  /// Return None if the value can not be determined solely from the input.
   >  /// 
   >  /// # Examples
   >  /// * The `COUNT` aggregate would return `Some(11)` given `num_rows = 11`
   >  /// * The `MIN` aggregate would return `Some(Null)` given `num_rows = 0
   >  /// * The `MIN` aggregate would return `None` given num_rows = 11
   >  fn output_from_rows(&self, num_rows: usize) -> Option<ScalarValue> { None 
}
   > ...
   > }
   > ```
   > 
   > Though I think your idea is better (pass in statistics in generla)
   
   One thing I worry about the narrow api is that, it seems can't be used to 
support the origin optimization of min/max?
   
https://github.com/apache/datafusion/blob/7df000a333a7d4018e1446ef900f652288b1f104/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L195-L221
   
   Maybe I misunderstand about it?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Impl a general get results from stats [datafusion]

Reply via email to