2010YOUY01 commented on PR #20926: URL: https://github.com/apache/datafusion/pull/20926#issuecomment-4173976995
> @2010YOUY01, apologies for the direct ping, would you be interesting in taking a look? > > Re. the discussion in [#21120 (comment)](https://github.com/apache/datafusion/issues/21120#issuecomment-4131266771), we have taken Trino and Spark as reference, for grouping sets we went a bit further than those systems, but I believe enough details are provided for the proposed formula (see [#20926 (comment)](https://github.com/apache/datafusion/pull/20926#discussion_r2942442151) for the discussion, which @buraksenn captured in code comments). > > Is this matching your suggestion for CBO improvements you shared in [#21120 (comment)](https://github.com/apache/datafusion/issues/21120#issuecomment-4131266771)? > > This is an interesting example as it covers as "porting" of existing statistics propagation code from battle-tested systems, and a reasonable extension, so I am particularly interested in your feedback to calibrate future PRs and reviews. Thanks for sharing the context! happy to take a look. I don't have prior experience on CBO, and I'm still working through the relevant material. Here are several questions: - why we assume independence between grouping keys — is this just a simplifying heuristic, or do reference systems intentionally do this because overestimating NDV leads to better end-to-end performance (e.g., for join reordering)? - why we explicitly account for nulls — this adds implementation complexity, but seems to have only a minor impact ### Regarding reference system approach If we follow a reference-system approach, I suggest going a bit further: identify the best-performing system and try to port its related components more holistically. It’s possible that one system works significantly better than another, and that its cardinality estimation is co-designed with other components like the cost model. Taking ideas from multiple reference systems may not yield good results. The metrics for deciding reference system IMO is: 1. Simple to implement 2. With comprehensive documentations, explainable 3. Overall improvement on benchmarks like tpcds/join-order-benchmark DuckDB seems to perform well — they have a thesis (https://blobs.duckdb.org/papers/tom-ebergen-msc-thesis-join-order-optimization-with-almost-no-statistics.pdf) describing their approach, and they’ve reported benchmark results somewhere. Perhaps we can compare it with other reference systems like Spark/Trino similarly, and then pick one reference system and stick to it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
