Peng Meng commented on SPARK-17870:

hi [~srowen], thanks very much for you quickly reply. 
yes,the p-value is better than raw statistic in this case, because p-value is 
count  based on DoF and raw statistic.
raw statistic is also popular for feature selection. The SelectKBest and 
SelectPercentile in scikit learn is based on raw statistic. 
The question here is we should use the same DoF like scikit learn to count 
ChiSquare value. 
For this JIRA, I propose to change the method to count ChiSquare value like 
what is done in scikit learn (change Statistics.chiSqTest(RDD)). 

Thanks very much.  

> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> ---------------------------------------------
>                 Key: SPARK-17870
>                 URL: https://issues.apache.org/jira/browse/SPARK-17870
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>            Reporter: Peng Meng
>            Priority: Critical
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to