lidavidm commented on a change in pull request #12702:
URL: https://github.com/apache/arrow/pull/12702#discussion_r834246278



##########
File path: cpp/src/arrow/compute/exec/aggregate_node.cc
##########
@@ -168,17 +168,26 @@ class ScalarAggregateNode : public ExecNode {
 
   Status DoConsume(const ExecBatch& batch, size_t thread_index) {
     util::tracing::Span span;
-    START_SPAN(span, "Consume",
-               {{"aggregate", ToStringExtra()},
-                {"node.label", label()},
-                {"batch.length", batch.length}});
+    START_SPAN(
+        span, "Consume",
+        {{"aggregate", ToStringExtra()},
+         {"node.label", label()},
+         {"batch.length", batch.length},
+         {"memory_pool_bytes",
+          plan_->exec_context()->memory_pool()->bytes_allocated()},
+         {"memory_used", arrow::internal::tracing::GetMemoryUsed()},
+         {"memory_used_process", 
arrow::internal::tracing::GetMemoryUsedByProcess()}});

Review comment:
       Yes, this seems it will be quite expensive. Do we need information 
that's this fine-grained?
   
   We could cache the values, and only update them if they're out of date. Or 
we could actually record the memory pool's statistics instead. Also, this 
information isn't really thread- or span- specific. Maybe it could be done by a 
background thread on a fixed schedule, especially if it's process memory and 
not memory pool allocations? 

##########
File path: cpp/src/arrow/compute/exec/aggregate_node.cc
##########
@@ -168,17 +168,26 @@ class ScalarAggregateNode : public ExecNode {
 
   Status DoConsume(const ExecBatch& batch, size_t thread_index) {
     util::tracing::Span span;
-    START_SPAN(span, "Consume",
-               {{"aggregate", ToStringExtra()},
-                {"node.label", label()},
-                {"batch.length", batch.length}});
+    START_SPAN(
+        span, "Consume",
+        {{"aggregate", ToStringExtra()},
+         {"node.label", label()},
+         {"batch.length", batch.length},
+         {"memory_pool_bytes",
+          plan_->exec_context()->memory_pool()->bytes_allocated()},
+         {"memory_used", arrow::internal::tracing::GetMemoryUsed()},
+         {"memory_used_process", 
arrow::internal::tracing::GetMemoryUsedByProcess()}});
     for (size_t i = 0; i < kernels_.size(); ++i) {
       util::tracing::Span span;
-      START_SPAN(span, aggs_[i].function,
-                 {{"function.name", aggs_[i].function},
-                  {"function.options",
-                   aggs_[i].options ? aggs_[i].options->ToString() : 
"<NULLPTR>"},
-                  {"function.kind", std::string(kind_name()) + "::Consume"}});
+      START_SPAN(
+          span, aggs_[i].function,
+          {{"function.name", aggs_[i].function},
+           {"function.options",
+            aggs_[i].options ? aggs_[i].options->ToString() : "<NULLPTR>"},
+           {"function.kind", std::string(kind_name()) + "::Consume"},
+           {"memory_pool_bytes", 
plan_->exec_context()->memory_pool()->bytes_allocated()},
+           {"memory_used", arrow::internal::tracing::GetMemoryUsed()},
+           {"memory_used_process", 
arrow::internal::tracing::GetMemoryUsedByProcess()}});

Review comment:
       This could be done by a span processor, though that has a couple caveats:
   - Processors are configured as part of the tracing setup, by the end user 
application (generally not by a library),
   - IIRC, processors may not run immediately (one of the processors batches 
spans and processes them on a background thread, I think)
   
   I think some span recorders can also be configured to attach this sort of 
information as well? (Though again, the timestamps won't line up anymore.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to