[GitHub] [spark] cloud-fan commented on a change in pull request #32786: [SPARK-35296][SQL] Allow Dataset.observe to work even if CollectMetricsExec in a task handles multiple partitions.

GitBox Mon, 07 Jun 2021 08:36:55 -0700


cloud-fan commented on a change in pull request #32786:
URL: https://github.com/apache/spark/pull/32786#discussion_r646710820




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CollectMetricsExec.scala
##########
@@ -69,7 +69,7 @@ case class CollectMetricsExec(
       // - Performance issues due to excessive serialization.

Review comment:
       Reading the comments above, our goal is to only send back the 
accumulator updates to the driver side when the task completes, and the 
approach we took is to create a temp task local accumulator and set it to the 
real accumulator when the task ends.
   
   I think there are three options:
   1. allow merging accumulators in the executor side. But the consequence is 
not very clear.
   2. find a way to share this temp task local accumulator between different 
partitions that run in the same task.
   3. improve the accumulator framework to support "no update via heartbeat" 
feature natively, so that we don't need the temp task local accumulator.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #32786: [SPARK-35296][SQL] Allow Dataset.observe to work even if CollectMetricsExec in a task handles multiple partitions.

Reply via email to