[jira] [Created] (SPARK-7151) Correlation methods for DataFrame

Joseph K. Bradley (JIRA) Sun, 26 Apr 2015 00:03:07 -0700

Joseph K. Bradley created SPARK-7151:
----------------------------------------


             Summary: Correlation methods for DataFrame
                 Key: SPARK-7151
                 URL: https://issues.apache.org/jira/browse/SPARK-7151
             Project: Spark
          Issue Type: New Feature
          Components: ML, SQL
            Reporter: Joseph K. Bradley
            Priority: Minor


We should support computing correlations between columns in DataFrames with a 
simple API.

This could be a DataFrame feature:
{code}
myDataFrame.corr("col1", "col2")
// or
myDataFrame.corr("col1", "col2", "pearson") // specify correlation type
{code}

Or it could be an MLlib feature:
{code}
Statistics.corr(myDataFrame("col1"), myDataFrame("col2"))
// or
Statistics.corr(myDataFrame, "col1", "col2")
{code}
(The first Statistics.corr option is more flexible, but it could cause trouble 
if a user tries to pass in 2 unzippable DataFrame columns.)

Note: R follow the latter setup.  I'm OK with either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-7151) Correlation methods for DataFrame

Reply via email to