huaxingao commented on issue #27527: [SPARK-30776][ML] Support 
FValueRegressionSelector for continuous features and continuous labels
URL: https://github.com/apache/spark/pull/27527#issuecomment-584820773
 
 
   ### Outline of the changes:
   1. Add new abstract classes Selector and SelectorModel. All the common code 
between ChiSqSelector and this newly added FValueRegressionSelector are put in 
these abstract classes. There are two abstract methods in Selector:
   ```
   getSelectionTestResult(dataset: Dataset[_]): Array[SelectionTestResult]
   
   createSelectorModel: T
   ```
   2. Make ChiSqSelector extend Selector.  
   implement ```getSelectionTestResult``` to return an array of 
```ChiSqTestResult(pValue, degreeOfFreedom, statistics)  ```
   pValue is used to rank the features and make selection
   implement ```createSelectorModel``` to return a ```ChiSqSelectorModel```
   
   3. FValueRegressionSelector extends Selector. 
   implement ```getSelectionTestResult``` to return an array of ```
   FValueRegressionTestResult(pValue, degreeOfFreedom, statistics)  
   // statistics is fValue ```
   pValue is used to rank the features and make selection
   implement ```createSelectorModel``` to return a 
```FValueRegressionSelectorModel```
   
   ```
   fValue calculation:    X: feature      Y:label    N: numOfSample
   degreeOfFreedom = N - 2
   covariance = sum(((Xi - avg(X)) * ((Yi-avg(Y))) / (N-1)
   correlation =  covariance / (Xstd * Ystd)
   fValue = correlation * correlation / (1 - correlation * correlation) * 
degreeOfFreedom
   ```
   
   
   4. The ChiSqSelectorModel constructor gets changed because two more 
parameters statistics and pValue were added. I think we should make all the 
SelectorModel (ChiSqSelectorModel and FRegressionSelectorModel) return 
statistics (chi square statistics or Fvalue)  and P-values. This is to address 
the comment in https://github.com/apache/spark/pull/27322.  
   > I found that f_regression in scikit-learn will return both arrays of 
F-values and P-values, can we also add them to FRegressionSelectorModel? 
   
   
   5. Because of adding two more parameters statistics and pValue in 
ChiSqSelectorModel constructor,  I added a ml-models/chisq-3.0.0 and modified 
the ChiSqSelectorSuite to make sure pre 3.1.0 model can be loaded OK in 3.1.0. 
   
   
       

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to