Joseph K. Bradley created SPARK-7151:
----------------------------------------
Summary: Correlation methods for DataFrame
Key: SPARK-7151
URL: https://issues.apache.org/jira/browse/SPARK-7151
Project: Spark
Issue Type: New Feature
Components: ML, SQL
Reporter: Joseph K. Bradley
Priority: Minor
We should support computing correlations between columns in DataFrames with a
simple API.
This could be a DataFrame feature:
{code}
myDataFrame.corr("col1", "col2")
// or
myDataFrame.corr("col1", "col2", "pearson") // specify correlation type
{code}
Or it could be an MLlib feature:
{code}
Statistics.corr(myDataFrame("col1"), myDataFrame("col2"))
// or
Statistics.corr(myDataFrame, "col1", "col2")
{code}
(The first Statistics.corr option is more flexible, but it could cause trouble
if a user tries to pass in 2 unzippable DataFrame columns.)
Note: R follow the latter setup. I'm OK with either.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]