[
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhengruifeng resolved SPARK-31301.
----------------------------------
Fix Version/s: 3.1.0
Resolution: Fixed
Issue resolved by pull request 28176
[https://github.com/apache/spark/pull/28176]
> flatten the result dataframe of tests in stat
> ---------------------------------------------
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 3.1.0
> Reporter: zhengruifeng
> Assignee: zhengruifeng
> Priority: Major
> Fix For: 3.1.0
>
>
> {code:java}
> scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
> | (0.0, Vectors.dense(0.5, 10.0)),
> | (0.0, Vectors.dense(1.5, 20.0)),
> | (1.0, Vectors.dense(1.5, 30.0)),
> | (0.0, Vectors.dense(3.5, 30.0)),
> | (0.0, Vectors.dense(3.5, 40.0)),
> | (1.0, Vectors.dense(3.5, 40.0))
> | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] =
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]),
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df =
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>
> val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom:
> array<int> ... 1 more field]scala> chi.show
> +--------------------+----------------+----------+
> | pValues|degreesOfFreedom|statistics|
> +--------------------+----------------+----------+
> |[0.68728927879097...| [2, 3]|[0.75,1.5]|
> +--------------------+----------------+----------+{code}
>
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}},
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000,
> the only operation we can deal with the test result is to collect it by
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>
> {{note: {{ANOVATest}}{{}}}} and\{{FValueTest}} are newly added in 3.1.0, but
> {{{{ChiSquareTest}}}} and {{{{Correlation}}}} were here for a long time.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]