asolimando opened a new issue, #20966: URL: https://github.com/apache/datafusion/issues/20966
### Is your feature request related to a problem or challenge? The overlap-based NDV merge formula (`estimate_ndv_with_overlap`) introduced in #19957 (and shared with Union from #20846) is not associative: merging `(A+B)+C` produces a different result than `A+(B+C)` because the intermediate merge updates min/max/NDV, which "smears" the column statistics before the next pairwise merge. Example: ``` A = [0,100], NDV=80 B = [50,150], NDV=60 C = [100,200], NDV=50 (A+B)+C = 135 A+(B+C) = 137 ``` Current usage in `try_merge`, as addressed in #19957, folds left-to-right over row groups, so the result depends on row group ordering. Practically, row group order within a Parquet file is stable, so results are deterministic for a given file. The difference is typically small (bounded by the uniform-distribution assumption). ### Describe the solution you'd like A multi-way merge that computes the overlap formula over all inputs at once rather than pairwise, using the original (unsmeared) min/max/NDV from each input. ### Describe alternatives you've considered Store HLL sketches per row group/column in Parquet and merge them for exact NDV computation, bypassing the scalar heuristic entirely. Parquet does not natively support this, but [apache/arrow-rs#8608 (comment)](https://github.com/apache/arrow-rs/issues/8608#issuecomment-4041324952) proposes encoding distinct-count estimates in user-defined key=value metadata, which could serve as a vehicle for HLL sketches. Citing as an alternative for completeness, not currently viable as not implemented yet. ### Additional context - Review comment: https://github.com/apache/datafusion/pull/19957#discussion_r2936704265 (@gene-bordegaray) - Related PR: #20846 (Union NDV overlap formula) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
