[
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peng Meng updated SPARK-17870:
------------------------------
Summary: ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is
wrong (was: ML/MLLIB: Statistics.chiSqTest(RDD) is wrong )
> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
> ------------------------------------------------------------------------
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib
> Reporter: Peng Meng
> Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It
> select the features with the largest ChiSqure value. But the Degree of
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected.
> This is strange.
> I use scikit learn to test the same data with the same parameters.
> When use selectKBest to select: feature 1 will be selected.
> When use selectFpr to select: feature 1 and 2 will be selected.
> This result is make sense. because the df of each feature in scikit learn is
> the same.
> I plan to submit a PR for this problem.
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]