heyihong commented on code in PR #53801:
URL: https://github.com/apache/spark/pull/53801#discussion_r2694402835


##########
sql/api/src/main/scala/org/apache/spark/sql/Observation.scala:
##########
@@ -58,12 +57,15 @@ class Observation(val name: String) {
 
   private val isRegistered = new AtomicBoolean()
 
-  private val promise = Promise[Row]()
+  private val promise = Promise[() => Row]()
+
+  private val lazyMetricsFuture: Future[() => Row] = promise.future
 
   /**
-   * Future holding the (yet to be completed) observation.
+   * Future holding the (yet to be completed) observation. Lazy to avoid 
collecting the metrics
+   * until it is needed.
    */
-  val future: Future[Row] = promise.future
+  lazy val future: Future[Row] = 
lazyMetricsFuture.map(_())(ExecutionContext.global)

Review Comment:
   IMHO, the metrics collection tasks should be CPU-bound and do not involve 
I/O. ExecutionContext.global should be sufficient because 
ExecutionContext.global sets its parallelism to Runtime.availableProcessors by 
default according to 
https://docs.scala-lang.org/overviews/core/futures.html#the-global-execution-context.
   
   Using a dedicated thread pool for `Observation` does not provide any 
additional parallelism. Though we may need to be careful with tasks that are 
submitted to ExecutionContext.global
   
   Also, `Observation.future` is only used in very rare cases if I understand 
correctly. I am not sure if there is a need to expose it, and we could consider 
deprecating it in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to