cloud-fan commented on pull request #31476:
URL: https://github.com/apache/spark/pull/31476#issuecomment-785170297
I think it's too much work to support arbitrary accumulators in
`SQLMetrics`. During these years, it seems sufficient to support various
metrics (size, time, average, etc.) with simple long values. So I agree with
the design here that the v2 metrics API should leverage `SQLMetrics` under the
hood.
However, I share the same concern from @zsxwing that using enum to indicate
the metrics merging logic is too limited. I don't think it's hard to design an
API to merge a bunch of long values.
First, let's look at the task side. The read/write task needs to update the
value of executor-side `SQLMetrics`, and Spark will send back the metrics
update to the driver via heartbeat event or task complete event. The current PR
adds a new method to `PartitionReader` to report the current metrics values,
which looks good. Spark needs to call the new method at the end of task (at the
end of each epoch for continuous mode), get the metrics value, find the
corresponding `SQLMetrics` and update its value. We can polish the API from the
current PR a little bit:
```
interface CustomTaskMetrics {
def name: String
def value: Long
}
interface PartitionReader {
...
def currentMetrics: Array[CustomTaskMetrics] = Nil
}
```
Then, let's look at the driver side. The driver side needs to aggregate the
task metrics and produce a string that will be displayed in the UI. The current
PR adds a new API `CustomMetrics` which is returned by `Scan`. This looks good
and we just need to replace the metrics type with a real merge method
```
interface CustomMetrics {
def name: String
...
def aggregateTaskMetrics(taskMetrics: Array[Long]): String
}
```
Internally, we can add a new `V2_CUSTOM` metrics type in `SQLMetrics`. For
this metrics type, we delegate the aggregating logic to the corresponding
`CustomMetrics`.
What do you think?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]