Sean Owen created SPARK-14533:
---------------------------------

             Summary: RowMatrix.computeCovariance inaccurate when values are 
very large
                 Key: SPARK-14533
                 URL: https://issues.apache.org/jira/browse/SPARK-14533
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 1.6.1, 2.0.0
            Reporter: Sean Owen
            Assignee: Sean Owen
            Priority: Minor


The following code will produce a Pearson correlation that's quite different 
from 0, sometimes outside [-1,1] or even NaN:

{code}
    val a = RandomRDDs.normalRDD(sc, 100000, 10).map(_ + 1000000000.0)
    val b = RandomRDDs.normalRDD(sc, 100000, 10).map(_ + 1000000000.0)
    val p = Statistics.corr(a, b, method = "pearson")
{code}

This is a "known issue" to some degree, given how Cov(X,Y) is calculated in 
{{RowMatrix.getCovariance}}, as Cov(X,Y) = E[XY] - E[X]E[Y]. The easier and 
more accurate approach involves just centering the input before computing the 
Gramian, but this would be inefficient for sparse data.

However, for dense data -- which includes the code paths that compute 
correlations -- this approach is quite sensible. This would improve accuracy 
for the dense row case, at least.

Also, the mean column values computed in this method can be computed more 
simply and accurately from {{computeColumnSummaryStatistics()}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to