Dandandan commented on issue #27:
URL: https://github.com/apache/arrow-datafusion/issues/27#issuecomment-825140989


   > > We probably need some fast way to have a rough estimate on the number of 
distinct values in the aggregate keys, maybe dynamically based on the first 
batch(es).
   > 
   > Does your `TableProvider` column stats work provide any useful base for 
this in situations where we're running aggregations on original table columns 
(as opposed to computed exprs) or is that too coarse?
   
   I think yes, for cases where the table is queried directly and we have 
statistics for distinct values available, those could be using heuristics for 
`group by` expressions based on cardinality statistics (say if you use column a 
and b maybe distinct(a)*distinct(b) would be an OK heuristic). 
   We also need the same support of distinct values per column for generalizing 
the join order optimization rule to more complicated joins and with more 
expressions than those directly on tables.
   Support for collecting  those statistics (i.e. `ANALYZE TABLE`) would need 
to be added too.
   
   To support the more general case we probably need a way to estimate the 
cardinality of the intermediate results, based on sampling one or a couple of 
batches with the particular group by expression.
   
   I added this requirement in this design doc:
   
https://docs.google.com/document/d/17DCBe_HygkjsoMzC4Znw-t8i1okLGlBkf0kp8DfLBhk/edit?usp=drivesdk
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to