haohuaijin opened a new pull request, #22815: URL: https://github.com/apache/datafusion/pull/22815
## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/22796 ## Rationale for this change `approx_distinct` over-counted distinct values for `Utf8View` columns when the same short string appeared across batches with different layouts. Arrow stores strings ≤ 12 bytes inline in the 128-bit view integer. The fast path (no data buffers) hashed these as raw `u128`. But when a batch also had a long string, it fell into a different branch that hashed **all** strings as `&str` — including the short inline ones. The same string hashed differently in different batches, so HyperLogLog counted it twice. ## What changes are included in this PR? - **`StringViewHLLAccumulator::update_batch`** and **`Utf8ViewHasher`**: in mixed batches (data buffers present), short strings (≤ 12 bytes) are still hashed as the raw `u128` view; only long strings hash as `&str`. This keeps hashing consistent regardless of batch layout. - **Two regression tests**: - `utf8view_acc_split_batches_match_single_mixed_batch` — scalar accumulator - `utf8view_groups_short_string_hashed_consistently_across_batches` — group accumulator ## Are these changes tested? Yes, two new regression tests cover the exact failure mode. ## Are there any user-facing changes? Yes. `approx_distinct` on `Utf8View` / `VARCHAR VIEW` columns now returns correct (lower) counts. Results may differ from the previously incorrect values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
