viirya edited a comment on pull request #31398: URL: https://github.com/apache/spark/pull/31398#issuecomment-772132389
One public interface `CustomMetric` is added. The interface simply can return metric name, description and value. Currently there is only metric implementation `CustomSumMetric`. Normally we will have other metrics like `CustomSizeMetric` and `CustomTimingMetric`, corresponding to SQL size and timing metrics. But for the SS metrics I added, I only use sum metric, so I don't add them. There are two public methods added to existing public interfaces. * PartitionReader.getCustomMetrics(): returns an array of `CustomMetric`. Here is where the actual metrics values are collected. Empty array by default. * MicroBatchStream.supportedCustomMetrics(): returns an array of supported custom metrics with name and description. Basically `MicroBatchStream` is for micro-batch and `ContinuousStream` is for continuous streaming. I only test micro-batch for now. Empty array by default. The metric collection happens as following. 1. `MicroBatchScanExec` calls its stream's `supportedCustomMetrics` method to know what custom SQL metrics should be added. 2. `MicroBatchScanExec` passes a callback function for updating metrics to `DataSourceRDD`. When `DataSourceRDD` completes data consumption, it calls the callback function. Note that currently some changes are specific for SS micro-batch. If there is a consensus we should extend to general DS v2, I need to make corresponding change. For example, moving `supportedCustomMetrics` from `MicroBatchStream` to general `Scan`. So both batch, streaming (micro-bath, continuous streaming) can use it. So for general metrics report, I think it works like: 1. `BatchScanExec`, `MicroBatchScanExec`, `ContinuousScanExec` call its scan's `supportedCustomMetrics` method to know what custom SQL metrics should be added. 2. `BatchScanExec`, `MicroBatchScanExec` pass a callback function for updating metrics to `DataSourceRDD`. When `DataSourceRDD` completes data consumption, it calls the callback function. For `ContinuousScanExec`, it passes similar callback function to `ContinuousDataSourceRDD` for updating metrics. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
