haohuaijin commented on code in PR #19285:
URL: https://github.com/apache/datafusion/pull/19285#discussion_r2658787645
##########
datafusion/physical-plan/src/aggregates/topk/heap.rs:
##########
@@ -161,6 +225,166 @@ where
}
}
+/// An implementation of `ArrowHeap` that deals with string values.
+///
+/// Supports all three UTF-8 string types: `Utf8`, `LargeUtf8`, and `Utf8View`.
+/// String values are compared lexicographically. Null values are not
explicitly handled
+/// and should not appear in the input; the aggregation layer ensures nulls
are managed
+/// appropriately before calling this heap.
+///
+/// Uses string interning to avoid repeated allocations for duplicate strings
within a batch.
+/// The `string_cache` maps string hashes to `Arc<str>` values, amortizing
allocation costs
+/// when the same string appears multiple times (common in trace IDs, user
IDs, etc.).
+pub struct StringHeap {
+ batch: ArrayRef,
+ heap: TopKHeap<Arc<str>>,
+ desc: bool,
+ data_type: DataType,
+ /// Cache of interned strings for the current batch, mapping hash to
`Arc<str>`.
+ /// Cleared on each `set_batch` call to prevent memory leaks from old
batches.
+ string_cache: HashMap<u64, Arc<str>>,
+}
+
+impl StringHeap {
+ pub fn new(limit: usize, desc: bool, data_type: DataType) -> Self {
+ let batch: ArrayRef = Arc::new(StringArray::from(Vec::<&str>::new()));
+ Self {
+ batch,
+ heap: TopKHeap::new(limit, desc),
+ desc,
+ data_type,
+ string_cache: HashMap::new(),
+ }
+ }
+
+ /// Extracts a string value from the current batch at the given row index.
+ ///
+ /// Panics if the row index is out of bounds or if the data type is not
one of
+ /// the supported UTF-8 string types.
+ ///
+ /// Note: Null values should not appear in the input; the aggregation layer
+ /// ensures nulls are filtered before reaching this code.
+ fn value(&self, row_idx: usize) -> &str {
+ extract_string_value(&self.batch, &self.data_type, row_idx)
+ }
+
+ /// Interns a string value, returning a cached `Arc<str>` if available.
+ ///
+ /// This method implements string interning to reduce allocations for
duplicate strings.
+ /// It computes a hash of the input string and checks the cache. If found,
returns the
+ /// cached `Arc<str>`. Otherwise, allocates a new `Arc<str>`, caches it,
and returns it.
+ ///
+ /// This is particularly effective for workloads with repeated string
values like trace IDs.
+ fn intern_string(&mut self, s: &str) -> Arc<str> {
+ // Compute hash of the string
+ let mut hasher = std::collections::hash_map::DefaultHasher::new();
+ s.hash(&mut hasher);
+ let hash = hasher.finish();
+
+ // Check cache and return if found
+ if let Some(cached) = self.string_cache.get(&hash) {
Review Comment:
maybe we can use `entry` to avoid two time lookup for `hash`
##########
datafusion/physical-plan/src/aggregates/topk/heap.rs:
##########
@@ -161,6 +225,166 @@ where
}
}
+/// An implementation of `ArrowHeap` that deals with string values.
+///
+/// Supports all three UTF-8 string types: `Utf8`, `LargeUtf8`, and `Utf8View`.
+/// String values are compared lexicographically. Null values are not
explicitly handled
+/// and should not appear in the input; the aggregation layer ensures nulls
are managed
+/// appropriately before calling this heap.
+///
+/// Uses string interning to avoid repeated allocations for duplicate strings
within a batch.
+/// The `string_cache` maps string hashes to `Arc<str>` values, amortizing
allocation costs
+/// when the same string appears multiple times (common in trace IDs, user
IDs, etc.).
+pub struct StringHeap {
+ batch: ArrayRef,
+ heap: TopKHeap<Arc<str>>,
+ desc: bool,
+ data_type: DataType,
+ /// Cache of interned strings for the current batch, mapping hash to
`Arc<str>`.
+ /// Cleared on each `set_batch` call to prevent memory leaks from old
batches.
+ string_cache: HashMap<u64, Arc<str>>,
Review Comment:
i'm not sure if we really need this, the heap mantains the values of
aggregate result, if the column in stringHeap is the column like trace ids or
user ids, for each batch almost impossible to have duplicate values.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]