[PR] improve approx_distinct for small value [datafusion]

via GitHub Thu, 04 Jun 2026 19:12:47 -0700


haohuaijin opened a new pull request, #22768:
URL: https://github.com/apache/datafusion/pull/22768


   ## Which issue does this PR close?
   
   - Closes #22767
   
   ## Rationale for this change
   
   `approx_distinct` is very slow with `GROUP BY` on high-cardinality keys.
   
   On a real dataset (~3.9M rows, ~512K groups):
   
   ```sql
   SELECT client_ip, approx_distinct(trace_id) AS cnt
   FROM '*.parquet'
   GROUP BY client_ip
   ORDER BY cnt DESC LIMIT 10;
   ```
   
   - DataFusion: **~32.6s**
   - DuckDB (`approx_count_distinct`): **~0.1s**
   
   The reason is that `approx_distinct` only implemented `Accumulator`, not 
`GroupsAccumulator`. So grouped queries fell back to 
`GroupsAccumulatorAdapter`, which allocates a full 16 KiB HyperLogLog per group 
(~8 GB for 512K groups) and re-slices the input per group on every batch — even 
though most groups only see a few distinct values.
   
   ## What changes are included in this PR?
   
   - Add a dedicated `GroupsAccumulator` for `approx_distinct` that processes 
each batch in a single pass (no per-group slicing or dynamic dispatch).
   - Use an adaptive per-group sketch: keep a small list of hashes (sparse) and 
only switch to a dense 16 KiB HyperLogLog after 256 distinct values. This cuts 
memory and keeps the partial state small. The dense format stays compatible 
with the existing scalar accumulator.
   - Add `count_from_hashes` so small groups are estimated directly from their 
stored hashes, avoiding a 16 KiB alloc + scan per group at output time.
   - Hashing matches the existing per-type scalar accumulators, so results are 
unchanged. Boolean / small-int / `Null` keep using the old path.
   
   Result on the query above: **~32.6s → ~0.12s** (~270x, on par with DuckDB), 
with identical output.
   
   ## Are these changes tested?
   
   Yes.
   
   - New unit tests for the per-group sketch (sparse/dense, promotion, 
serialize/merge round-trip, merging groups, empty groups), checked against a 
dense-fold reference.
   - New `aggregate.slt` cases: grouped `approx_distinct` over `Utf8`, 
`Utf8View`, and `Int32` (small groups are exact), null-only groups (= 0), and a 
sparse→dense case (2000 distinct/group, within HyperLogLog error).
   - Existing `aggregate.slt` and `aggregate_skip_partial.slt` still pass; 
clippy and fmt are clean.
   
   ## Are there any user-facing changes?
   
   No API or result changes — only a large speedup for `approx_distinct` with 
`GROUP BY` on high-cardinality keys.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] improve approx_distinct for small value [datafusion]

Reply via email to