[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-17870. ------------------------------- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15444 [https://github.com/apache/spark/pull/15444] > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > ------------------------------------------------------------------------ > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Reporter: Peng Meng > Priority: Critical > Fix For: 2.1.0 > > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org