asolimando commented on PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#issuecomment-4491522282

   > Thank you so much for pushing the staistics forward @asolimando -- this is 
an area that has been long overdue for improvement. Akso thank you for helping 
@xudong963
   > 
   > I am sorry for the delay in reviewing this PR as well
   
   Thank you for finding the time and to the many people who helped so far with 
discussions and reviews, I really appreciate it and, as a maintainer myself, I 
know how much time and effort it takes.
   
   > I think the core abstraction for `ExpressionAnalyzerRegistry` makes sense 
here as a way to extend analysis, but I think the plumbing is a bit unwieldy (I 
left a more detailed comment inline0
   
   I have replied inline, the status quo was meant to be temporary, after 
https://github.com/apache/datafusion/pull/21815 the integration will be easier 
and this will allow me to simplify the design along the lines you have 
suggested.
   
   > Other thoughts:
   > 
   > ## Statistics V2
   > 1. This PR seems like it is not using the new statistics framework 
introduced in   #14699 (which is fine, but I think I may make a PR proposing to 
deprecate that code to make it clear)
   
   I saw the PR for deprecation, makes sense to me.
   
   > ## "Motivating Example" / Usecase
   > It would help me evaluate these APIs to see a "motivating example" in the 
examples crate. Examples make sure we are building an API that is useable for 
the desired usecase (and it has the bonus of providing documentation for others)
   > 
   > Specifically in this case I think that example would be a program written 
from an user point of view where the system wants to plug in its own statistics 
framework. For example, it has some way to know what the cardinality of 
`join_key=12345` is based on its own statistics) and maybe wants to customize 
the JoinNode or something
   > 
   > This approach really helped me with the API for custom indexes. Here is 
what I did for that: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/data_io/parquet_embedded_index.rs
   > 
   > With a motivating example like this
   
   @alamb, that's a great suggestion. I think the right motivating example for 
`ExpressionAnalyzer` specifically could involve an expression like `a + 1`, 
injective, for which today we return `Absent`, or a custom UDF scenario, 
similarly unsupported today.
   
   The "external cardinality for join keys" use case you mentioned (knowing 
`join_key=12345` matches N rows) is more naturally a `StatisticsRegistry` 
concern (operator-level, PR #21483) than an `ExpressionAnalyzer` one 
(expression-level NDV/selectivity).
   
   The operator and expression level statistics are ultimately one and the same 
concept, so it would probably make sense to build an example that shows both 
working together, but I can keep them separate if you prefer.
   
   When https://github.com/apache/datafusion/pull/21815 gets finalized I will 
rebase this PR, rework the points discussed in the inlined comments, and 
propose an end-to-end example as you suggested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to