Sean Owen created SPARK-14533:
---------------------------------
Summary: RowMatrix.computeCovariance inaccurate when values are
very large
Key: SPARK-14533
URL: https://issues.apache.org/jira/browse/SPARK-14533
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.6.1, 2.0.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
The following code will produce a Pearson correlation that's quite different
from 0, sometimes outside [-1,1] or even NaN:
{code}
val a = RandomRDDs.normalRDD(sc, 100000, 10).map(_ + 1000000000.0)
val b = RandomRDDs.normalRDD(sc, 100000, 10).map(_ + 1000000000.0)
val p = Statistics.corr(a, b, method = "pearson")
{code}
This is a "known issue" to some degree, given how Cov(X,Y) is calculated in
{{RowMatrix.getCovariance}}, as Cov(X,Y) = E[XY] - E[X]E[Y]. The easier and
more accurate approach involves just centering the input before computing the
Gramian, but this would be inefficient for sparse data.
However, for dense data -- which includes the code paths that compute
correlations -- this approach is quite sensible. This would improve accuracy
for the dense row case, at least.
Also, the mean column values computed in this method can be computed more
simply and accurately from {{computeColumnSummaryStatistics()}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]