GitHub user NarineK opened a pull request:
https://github.com/apache/spark/pull/9366
[SPARK-11057] [SQL] Add correlation and covariance matrices
Hi there,
As we know R has the option to calculate the correlation and covariance for
all columns of a dataframe or between columns of two dataframes.
If we look at apache math package we can see that, they have that too.
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
In case we have as input only one DataFrame:
------------------------------------------------------
for correlation:
cor[i,j] = cor[j,i]
and for the main diagonal we can have 1s.
---------------------
for covariance:
cov[i,j] = cov[j,i]
and for main diagonal: we can compute the variance for that specific column:
See:
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
Thanks,
Narine
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/NarineK/spark sparksqlcorcov
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9366.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9366
----
commit 74bdf5451dd92de10acfea3e1db9cd3325bf6dd7
Author: Narine Kokhlikyan <[email protected]>
Date: 2015-10-29T14:49:55Z
Initial commit for correelation and covariance matrices
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]