Github user sethah commented on a diff in the pull request:
https://github.com/apache/spark/pull/19638#discussion_r148664242
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala
---
@@ -125,4 +125,14 @@ class RegressionMetrics @Since("2.0.0") (
1 - SSerr / SStot
}
}
+
+ /**
+ * Returns adjusted R^2^, the adjusted coefficient of determination.
+ * @see <a
href="https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2">
+ * Coefficient of determination (Wikipedia)</a>
+ */
+ @Since("2.3.0")
+ def r2adj: Double = {
+ 1 - (SSerr / (summary.count - summary.numParam - 1)) / (SStot /
(summary.count - 1))
--- End diff --
This isn't correct for the case when there is no intercept. This [previous
PR](https://github.com/apache/spark/pull/10384/) is relevant. Actually, there's
a bigger problem: `RegressionMetrics` is only passed predictions and
observations, nothing about the regression model that was used to fit it.
Adjusted r2 doesn't make sense here. In fact, r2 shouldn't be here either since
it's only valid for linear regression models.
The solution I propose: add a `val r2adj` in the linear regression summary,
but simply define it in terms of the r2 value and don't add it to regression
metrics or regression evaluator.
```scala
val r2adj: Double = {
val interceptDOF = if (privateModel.getFitIntercept) 1 else 0
1 - (1 - r2) * (numInstances - interceptDOF) / (numInstances -
privateModel.coefficients.size - interceptDOF)
}
```
Ok, but then you can't use it when doing cross validation right? I'm not
sure if there's a solution there - maybe to make a `LinearRegressionEvaluator`?
`r2` and `adjr2` are not valid for non-linear regression
http://statisticsbyjim.com/regression/r-squared-invalid-nonlinear-regression/.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]