kosiew opened a new pull request, #21313:
URL: https://github.com/apache/datafusion/pull/21313

   ## Which issue does this PR close?
   
   * Closes #20966.
   
   ---
   
   ## Rationale for this change
   
   The existing NDV merge implementation (`estimate_ndv_with_overlap`) is not 
associative, meaning that merging statistics in different orders can produce 
different results. This happens because intermediate merges "smear" min/max/NDV 
values, which then influence subsequent computations.
   
   This lack of order invariance leads to inconsistent NDV estimates depending 
on merge order, especially when combining more than two inputs.
   
   To address this, this PR introduces a multi-input NDV estimation approach 
that computes overlap using the original (unsmeared) statistics across all 
inputs simultaneously. This ensures stable and deterministic results regardless 
of merge order.
   
   ---
   
   ## What changes are included in this PR?
   
   * Refactored `Statistics::try_merge_iter` to:
   
     * Perform a single pass for mergeable statistics (null counts, min/max, 
sum).
     * Defer NDV computation until after all inputs are collected.
   
   * Replaced pairwise NDV merging with a new multi-input algorithm:
   
     * Introduced `estimate_ndv_with_overlap_many` to compute NDV across all 
inputs simultaneously.
     * Uses segment-based partitioning of value ranges and selects the maximum 
density contribution per segment.
     * Handles constant (point) values separately to avoid over-counting.
   
   * Ensured order invariance:
   
     * NDV is now computed from original inputs instead of intermediate merged 
states.
   
   * Added helper:
   
     * `max_input_ndv` as a fallback when estimation is not possible.
   
   * Code cleanup:
   
     * Simplified iteration using `zip` instead of index-based access.
     * Improved clarity and separation of concerns in merge logic.
   
   ---
   
   ## Are these changes tested?
   
   Yes.
   
   New and updated tests include:
   
   * Reusable helpers for test setup (`int32_test_schema`, `int32_ndv_stats`).
   * Validation of NDV behavior for:
   
     * Identical ranges (full overlap).
     * Partial overlaps.
     * Constant columns (same and different values).
   * **New test ensuring order invariance across permutations**, verifying that 
all permutations of inputs produce identical NDV results.
   
   These tests ensure correctness, stability, and regression protection for the 
new algorithm.
   
   ---
   
   ## Are there any user-facing changes?
   
   No direct user-facing API changes.
   
   However, NDV estimates are now:
   
   * More stable
   * Deterministic across merge orders
   * Potentially slightly different (more accurate) compared to previous 
behavior
   
   This improves query planning consistency without requiring user intervention.
   
   ---
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to