[PR] fix: approx_distinct over-counts for utf8view [datafusion]

via GitHub Sun, 07 Jun 2026 20:38:52 -0700


haohuaijin opened a new pull request, #22815:
URL: https://github.com/apache/datafusion/pull/22815


   ## Which issue does this PR close?
   
   - Closes https://github.com/apache/datafusion/issues/22796
   
   ## Rationale for this change
   
   `approx_distinct` over-counted distinct values for `Utf8View` columns when 
the same short string appeared across batches with different layouts.
   
   Arrow stores strings ≤ 12 bytes inline in the 128-bit view integer. The fast 
path (no data buffers) hashed these as raw `u128`. But when a batch also had a 
long string, it fell into a different branch that hashed **all** strings as 
`&str` — including the short inline ones. The same string hashed differently in 
different batches, so HyperLogLog counted it twice.
   
   ## What changes are included in this PR?
   
   - **`StringViewHLLAccumulator::update_batch`** and **`Utf8ViewHasher`**: in 
mixed batches (data buffers present), short strings (≤ 12 bytes) are still 
hashed as the raw `u128` view; only long strings hash as `&str`. This keeps 
hashing consistent regardless of batch layout.
   
   - **Two regression tests**:
     - `utf8view_acc_split_batches_match_single_mixed_batch` — scalar 
accumulator
     - `utf8view_groups_short_string_hashed_consistently_across_batches` — 
group accumulator
   
   ## Are these changes tested?
   
   Yes, two new regression tests cover the exact failure mode.
   
   ## Are there any user-facing changes?
   
   Yes. `approx_distinct` on `Utf8View` / `VARCHAR VIEW` columns now returns 
correct (lower) counts. Results may differ from the previously incorrect values.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: approx_distinct over-counts for utf8view [datafusion]

Reply via email to