westonpace commented on a change in pull request #12702:
URL: https://github.com/apache/arrow/pull/12702#discussion_r834734593



##########
File path: cpp/src/arrow/compute/exec/aggregate_node.cc
##########
@@ -168,17 +168,26 @@ class ScalarAggregateNode : public ExecNode {
 
   Status DoConsume(const ExecBatch& batch, size_t thread_index) {
     util::tracing::Span span;
-    START_SPAN(span, "Consume",
-               {{"aggregate", ToStringExtra()},
-                {"node.label", label()},
-                {"batch.length", batch.length}});
+    START_SPAN(
+        span, "Consume",
+        {{"aggregate", ToStringExtra()},
+         {"node.label", label()},
+         {"batch.length", batch.length},
+         {"memory_pool_bytes",
+          plan_->exec_context()->memory_pool()->bytes_allocated()},
+         {"memory_used", arrow::internal::tracing::GetMemoryUsed()},
+         {"memory_used_process", 
arrow::internal::tracing::GetMemoryUsedByProcess()}});

Review comment:
       Background thread on a fixed schedule was the original ask but I think 
there was some concern around having two streams of telemetry information.  As 
a start I think only using `memory_pool_bytes` would be sufficient.  Longer 
term it could be handy to have some idea of the RSS of the process just to have 
some sense for how we are doing on fragmentation and non-pool allocations.  
Since a system call is expensive I wonder if there would be a way to debounce 
this information and only report it at most every 10ms or on some configurable 
threshold.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to