[GitHub] spark issue #20028: [SPARK-19053][ML]Supporting multiple evaluation metrics ...

hhbyyh Sun, 29 Jul 2018 03:35:03 -0700

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/20028
  
    Thanks for the comments @zhengruifeng @felixcheung 
    
    It's been nearly 8 months and it took me a while to recall what this PR 
does. While the PR did provide some improvement for the current API, I wonder 
if it lays a good foundation for an extensible and flexible `Evaluator` 
framework for Spark ML.
    
    The current design is not quite user-friendly as it asks users to 
understand the concept of `Metrics` (BinaryClassificationMetrics, 
MultiClassClassificationMetrics, RegressionMetrics) which are primarily for 
internal calculation, and it implies that all the indicators in a `Metrics` can 
be calculated in one pass of the DataFrame, which creates some difficulty when 
we add extra indicators in the Metric which cannot be calculated with other 
indicators.
    
    IMO, API wise, ideally we should allow users to specify any combination of 
the metrics that they want to add to the `Evaluator`, then the `Evaluator` 
needs to figure out the best way to efficiently calculate the metrics. 
Following are the concrete suggestions:
    
    1. Evaluator API:
    ```
    ClassificationEvaluator {
    
      def setPredictionCol(value: String): this.type
    
      def setLabelCol(value: String): this.type
      
      // kept for back-ward compatibility and Cross validation
      def setMetricName(value: String): this.type
      
      // kept for back-ward compatibility and Cross validation
      override def evaluate(dataset: Dataset[_]): Double
      
      // calculate multiple metrics, will try to optimize calculation internally
      override def getMetrics(dataset: Dataset[_], metrics: Array[String]): 
Map[String, Any] // or wrap it with customized class
    
    }
    
    val ce = new 
ClassificationEvaluator().setLabelCol("x").setPredictionCol("y")
    metrics = ce.getMetrics(dataframe, 
Array(Classification.truePositiveRateByLabel, 
BinaryClassification.areaUnderROC))
    println(metrics)
    ```
    We can basically merge BinaryClassificationEvaluator and 
MultiClassificationEvaluator.
    
    Similarly we can have `RegressionEvaluator` and `ClusteringEvaluator`, 
separating those because we may need to provide different setters in each. 
    
    2. Summy classes may invoke Evaluator internally.
    
    @felixcheung , I'm not sure if this can get a shepherd and review bandwidth 
for the next release. I don't want to just update version numbers every a few 
months.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20028: [SPARK-19053][ML]Supporting multiple evaluation metrics ...

Reply via email to