asolimando commented on PR #21122: URL: https://github.com/apache/datafusion/pull/21122#issuecomment-4491522282
> Thank you so much for pushing the staistics forward @asolimando -- this is an area that has been long overdue for improvement. Akso thank you for helping @xudong963 > > I am sorry for the delay in reviewing this PR as well Thank you for finding the time and to the many people who helped so far with discussions and reviews, I really appreciate it and, as a maintainer myself, I know how much time and effort it takes. > I think the core abstraction for `ExpressionAnalyzerRegistry` makes sense here as a way to extend analysis, but I think the plumbing is a bit unwieldy (I left a more detailed comment inline0 I have replied inline, the status quo was meant to be temporary, after https://github.com/apache/datafusion/pull/21815 the integration will be easier and this will allow me to simplify the design along the lines you have suggested. > Other thoughts: > > ## Statistics V2 > 1. This PR seems like it is not using the new statistics framework introduced in #14699 (which is fine, but I think I may make a PR proposing to deprecate that code to make it clear) I saw the PR for deprecation, makes sense to me. > ## "Motivating Example" / Usecase > It would help me evaluate these APIs to see a "motivating example" in the examples crate. Examples make sure we are building an API that is useable for the desired usecase (and it has the bonus of providing documentation for others) > > Specifically in this case I think that example would be a program written from an user point of view where the system wants to plug in its own statistics framework. For example, it has some way to know what the cardinality of `join_key=12345` is based on its own statistics) and maybe wants to customize the JoinNode or something > > This approach really helped me with the API for custom indexes. Here is what I did for that: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/data_io/parquet_embedded_index.rs > > With a motivating example like this @alamb, that's a great suggestion. I think the right motivating example for `ExpressionAnalyzer` specifically could involve an expression like `a + 1`, injective, for which today we return `Absent`, or a custom UDF scenario, similarly unsupported today. The "external cardinality for join keys" use case you mentioned (knowing `join_key=12345` matches N rows) is more naturally a `StatisticsRegistry` concern (operator-level, PR #21483) than an `ExpressionAnalyzer` one (expression-level NDV/selectivity). The operator and expression level statistics are ultimately one and the same concept, so it would probably make sense to build an example that shows both working together, but I can keep them separate if you prefer. When https://github.com/apache/datafusion/pull/21815 gets finalized I will rebase this PR, rework the points discussed in the inlined comments, and propose an end-to-end example as you suggested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
