Re: [PR] Fix TopK aggregation for UTF-8/Utf8View group keys and add safe fallback for unsupported string aggregates [datafusion]

via GitHub Sat, 03 Jan 2026 19:58:46 -0800


kosiew commented on code in PR #19285:
URL: https://github.com/apache/datafusion/pull/19285#discussion_r2659306373



##########
datafusion/physical-plan/src/aggregates/topk/heap.rs:
##########
@@ -161,6 +225,166 @@ where
     }
 }
 
+/// An implementation of `ArrowHeap` that deals with string values.
+///
+/// Supports all three UTF-8 string types: `Utf8`, `LargeUtf8`, and `Utf8View`.
+/// String values are compared lexicographically. Null values are not 
explicitly handled
+/// and should not appear in the input; the aggregation layer ensures nulls 
are managed
+/// appropriately before calling this heap.
+///
+/// Uses string interning to avoid repeated allocations for duplicate strings 
within a batch.
+/// The `string_cache` maps string hashes to `Arc<str>` values, amortizing 
allocation costs
+/// when the same string appears multiple times (common in trace IDs, user 
IDs, etc.).
+pub struct StringHeap {
+    batch: ArrayRef,
+    heap: TopKHeap<Arc<str>>,
+    desc: bool,
+    data_type: DataType,
+    /// Cache of interned strings for the current batch, mapping hash to 
`Arc<str>`.
+    /// Cleared on each `set_batch` call to prevent memory leaks from old 
batches.
+    string_cache: HashMap<u64, Arc<str>>,

Review Comment:
   Fair point—for high-cardinality string columns like trace or user IDs, 
duplicate values within a single batch are rare. 
   
   However, the cache provides a cheap safety net for lower-cardinality 
workloads (e.g., country codes, status strings) where duplicates are common. 
The overhead is minimal (one hash lookup per new value). 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix TopK aggregation for UTF-8/Utf8View group keys and add safe fallback for unsupported string aggregates [datafusion]

Reply via email to