EnricoMi opened a new pull request #33422:
URL: https://github.com/apache/spark/pull/33422


   ### What changes were proposed in this pull request?
   This pull request introduces a helper class that simplifies usage of 
`Dataset.observe()` for batch datasets:
   
       val observation = Observation("name")
       val observed = ds.observe(observation, max($"id").as("max_id"))
       observed.count()
       val metrics = observation.get
   
   ### Why are the changes needed?
   Currently, users are required to implement the `QueryExecutionListener` 
interface to retrieve the metrics, as well as apply some knowledge on threading 
and locking to pull the metrics over to the main thread. With the helper class, 
metrics can be retrieved from batch dataset processing with three lines of code 
(the action on the observed dataset does not count as a line of code here).
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, one new class and one `Dataset`` method.
   
   ### How was this patch tested?
   Adds a unit test to `DataFrameSuite`, similar to `"get observable metrics by 
callback"` in `DataFrameCallbackSuite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to