Re: [PR] metrics: added more metrics for multi-column group by [datafusion]

via GitHub Sat, 17 Jan 2026 04:44:50 -0800


kosiew commented on code in PR #19526:
URL: https://github.com/apache/datafusion/pull/19526#discussion_r2700921710



##########
datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs:
##########
@@ -46,11 +46,45 @@ use 
datafusion_execution::memory_pool::proxy::{HashTableAllocExt, VecAllocExt};
 use datafusion_expr::EmitTo;
 use datafusion_physical_expr::binary_map::OutputType;
 
+use datafusion_execution::metrics::{
+    ExecutionPlanMetricsSet, Gauge, MetricBuilder, Time,
+};
 use hashbrown::hash_table::HashTable;
 
 const NON_INLINED_FLAG: u64 = 0x8000000000000000;
 const VALUE_MASK: u64 = 0x7FFFFFFFFFFFFFFF;
 
+pub(crate) struct MultiColumnGroupByMetrics {
+    /// Time spent hashing the grouping columns
+    pub(crate) time_hashing_grouping_columns: Time,
+
+    /// Time spent building and appending the hash table for grouping columns
+    pub(crate) build_hash_table_time: Time,
+
+    /// Track the maximum number of entries that map held
+    ///
+    /// Very large value will probably indicate problems with fetching from 
the hash table
+    pub(crate) maximum_number_of_entries_in_map: Gauge,
+
+    /// Maximum hash map capacity
+    pub(crate) maximum_hash_map_capacity: Gauge,

Review Comment:
   is this tracking the HashTable's initial capacity, or how it grows during 
execution? When is this useful vs. `maximum_number_of_entries_in_map`?



##########
datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs:
##########
@@ -1280,8 +1356,12 @@ mod tests {
     #[test]
     fn test_emit_first_n_for_vectorized_group_values() {

Review Comment:
   Besides these tests, I think it would be good to add `.slt` tests 
demonstrating the new metrics appear in `EXPLAIN ANALYZE`.



##########
datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs:
##########
@@ -46,11 +46,45 @@ use 
datafusion_execution::memory_pool::proxy::{HashTableAllocExt, VecAllocExt};
 use datafusion_expr::EmitTo;
 use datafusion_physical_expr::binary_map::OutputType;
 
+use datafusion_execution::metrics::{
+    ExecutionPlanMetricsSet, Gauge, MetricBuilder, Time,
+};
 use hashbrown::hash_table::HashTable;
 
 const NON_INLINED_FLAG: u64 = 0x8000000000000000;
 const VALUE_MASK: u64 = 0x7FFFFFFFFFFFFFFF;
 
+pub(crate) struct MultiColumnGroupByMetrics {
+    /// Time spent hashing the grouping columns
+    pub(crate) time_hashing_grouping_columns: Time,
+
+    /// Time spent building and appending the hash table for grouping columns
+    pub(crate) build_hash_table_time: Time,
+
+    /// Track the maximum number of entries that map held
+    ///
+    /// Very large value will probably indicate problems with fetching from 
the hash table
+    pub(crate) maximum_number_of_entries_in_map: Gauge,
+
+    /// Maximum hash map capacity
+    pub(crate) maximum_hash_map_capacity: Gauge,
+}
+
+impl MultiColumnGroupByMetrics {
+    pub(crate) fn new(metrics: &ExecutionPlanMetricsSet, partition: usize) -> 
Self {
+        Self {
+            time_hashing_grouping_columns: MetricBuilder::new(metrics)
+                .subset_time("time_hashing_grouping_columns", partition),

Review Comment:
   Is aggregation of per-partition metrics handled elsewhere, or will users 
need to manually sum across partitions to see total time? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] metrics: added more metrics for multi-column group by [datafusion]

Reply via email to