viirya edited a comment on pull request #31398:
URL: https://github.com/apache/spark/pull/31398#issuecomment-772132389


   One public interface `CustomMetric` is added. The interface simply can 
return metric name, description and value.
   
   Currently there is only metric implementation `CustomSumMetric`. Normally we 
will have other metrics like `CustomSizeMetric` and `CustomTimingMetric`, 
corresponding to SQL size and timing metrics. But for the SS metrics I added, I 
only use sum metric, so I don't add them.
   
   There are two public methods added to existing public interfaces.
   
   * PartitionReader.getCustomMetrics(): returns an array of `CustomMetric`. 
Here is where the actual metrics values are collected. Empty array by default.
   * MicroBatchStream.supportedCustomMetrics(): returns an array of supported 
custom metrics with name and description. Basically `MicroBatchStream` is for 
micro-batch and `ContinuousStream` is for continuous streaming. I only test 
micro-batch for now. Empty array by default.
   
   The metric collection happens as following.
   
   1. `MicroBatchScanExec` calls its stream's `supportedCustomMetrics` method 
to know what custom SQL metrics should be added.
   2. `MicroBatchScanExec` passes a callback function for updating metrics to 
`DataSourceRDD`. When `DataSourceRDD` completes data consumption, it calls the 
callback function.
   
   Note that currently some changes are specific for SS micro-batch. If there 
is a consensus we should extend to general DS v2, I need to make corresponding 
change. For example, moving `supportedCustomMetrics` from `MicroBatchStream` to 
general `Scan`. So both batch, streaming (micro-bath, continuous streaming) can 
use it.
   
   So for general metrics report, I think it works like:
   
   1. `BatchScanExec`, `MicroBatchScanExec`, `ContinuousScanExec` call its 
scan's `supportedCustomMetrics` method to know what custom SQL metrics should 
be added.
   2. `BatchScanExec`, `MicroBatchScanExec` pass a callback function for 
updating metrics to `DataSourceRDD`. When `DataSourceRDD` completes data 
consumption, it calls the callback function. For `ContinuousScanExec`, it 
passes similar callback function to `ContinuousDataSourceRDD` for updating 
metrics.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to