zhengruifeng commented on issue #24648: [SPARK-27777][ML] Eliminate uncessary sliding job in AreaUnderCurve URL: https://github.com/apache/spark/pull/24648#issuecomment-495463566 @srowen I made a detailed review on `ML.XXXEvaluator` & `MLLIB.XXXMetrics` recently and find another several places seems needing to be improved. For example: 1, all metrics in `MultilabelMetrics` & `MulticlassMetrics` can be computed on only one pass, however, in current impl each metric needs one pass. 2, `ML.XXXEvaluator` only supports only one metric at once, which means at least one pass is needed for one metric. I think we can cache the `MLLIB.XXXMetrics` in the impl, and in the following calls, if the input dataset donot change, we can direct get the metric from cached `MLLIB.XXXMetrics` without accumlation on the input dataset. 3, `MultiLabelClassificationEvalutaor` is missing now. 4, in `BinaryClassificationMetrics`, to control the #Bins, direct setting the #Partition in the sort stage seems more reasonable than current impl Would you mind if I open a umbrella ticket "Evaluator & Metrics improvements" to track above points and opened tickets on `sliding job` and `SSreg`?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
