[GitHub] [spark] cloud-fan commented on pull request #31398: [SPARK-34297][SQL][SS] Add metrics for data loss and offset out range for KafkaMicroBatchStream

GitBox Tue, 02 Feb 2021 21:41:34 -0800


cloud-fan commented on pull request #31398:
URL: https://github.com/apache/spark/pull/31398#issuecomment-772246886



   Yea creating a separate PR SGTM. Let's have a high-level discussion first (I 
haven't read this PR yet). From my understanding, metrics in batch execution 
can be done as:
   1. data source returns custom metrics in each read/write task (via task 
completion event or heartbeat event)
   2. data source aggregates custom metrics from all tasks
   
   For microbatch execution, we just repeat the batch execution steps for each 
microbatch. And we update the metrics in the UI in every microbatch.
   
   For continuous execution, it's like an endless batch execution, so we can 
only use heartbeat events to update metrics. And we update the metrics in the 
UI in every epoch.
   
   The problem here is how to integrate this with Spark SQL. One idea is to use 
`AccumulatorV2`, which is a public API already and is very flexible. But we 
need to figure out how to make it work with the SQL UI. The other idea is to 
use `SQLMetrics`, which is private and we need some API design to map public 
API to `SQLMetrics`. It also limits the way of aggregating metrics.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on pull request #31398: [SPARK-34297][SQL][SS] Add metrics for data loss and offset out range for KafkaMicroBatchStream

Reply via email to