[jira] [Created] (SPARK-27867) RegressionEvaluator cache lastest RegressionMetrics to avoid duplicated computation

zhengruifeng (JIRA) Tue, 28 May 2019 02:46:31 -0700

zhengruifeng created SPARK-27867:
------------------------------------

             Summary: RegressionEvaluator cache lastest RegressionMetrics to 
avoid duplicated computation
                 Key: SPARK-27867
                 URL: https://issues.apache.org/jira/browse/SPARK-27867
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 3.0.0
            Reporter: zhengruifeng



In most cases, given a model, we have to obtain multi metrics of it.

For examples, a regression model, we may need to obtain the R2, MAE and MSE.

However, current design of `Evaluator` do not support computing multi metrics 
at once.

In practice, we usually use RegressionEvaluator like this:
{code:java}
val evaluator = new RegressionEvaluator()


val r2 = evaluator.setMetricName("r2").evaluate(df)


val mae = evaluator.setMetricName("mae").evaluate(df)


val mse = evaluator.setMetricName("mse").evaluate(df){code}
 

However, current impl of RegressionEvaluator needs one pass of the whole input 
dataset to compute one metric. So, above example needs 3 passes.

This can be optimized since in \{RegressionMetrics}  all metrics can be 
computed at once.

If we cache the lastest inputs, and then if the next evaluate call keep the 
inputs (except the metricName), then we can directly obtain the metric from the 
internal intermediate summary.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-27867) RegressionEvaluator cache lastest RegressionMetrics to avoid duplicated computation

Reply via email to