huaxingao commented on issue #27527: [SPARK-30776][ML] Support FValueRegressionSelector for continuous features and continuous labels URL: https://github.com/apache/spark/pull/27527#issuecomment-584820773 ### Outline of the changes: 1. Add new abstract classes Selector and SelectorModel. All the common code between ChiSqSelector and this newly added FValueRegressionSelector are put in these abstract classes. There are two abstract methods in Selector: ``` getSelectionTestResult(dataset: Dataset[_]): Array[SelectionTestResult] createSelectorModel: T ``` 2. Make ChiSqSelector extend Selector. implement ```getSelectionTestResult``` to return an array of ```ChiSqTestResult(pValue, degreeOfFreedom, statistics) ``` pValue is used to rank the features and make selection implement ```createSelectorModel``` to return a ```ChiSqSelectorModel``` 3. FValueRegressionSelector extends Selector. implement ```getSelectionTestResult``` to return an array of ``` FValueRegressionTestResult(pValue, degreeOfFreedom, statistics) // statistics is fValue ``` pValue is used to rank the features and make selection implement ```createSelectorModel``` to return a ```FValueRegressionSelectorModel``` ``` fValue calculation: X: feature Y:label N: numOfSample degreeOfFreedom = N - 2 covariance = sum(((Xi - avg(X)) * ((Yi-avg(Y))) / (N-1) correlation = covariance / (Xstd * Ystd) fValue = correlation * correlation / (1 - correlation * correlation) * degreeOfFreedom ``` 4. The ChiSqSelectorModel constructor gets changed because two more parameters statistics and pValue were added. I think we should make all the SelectorModel (ChiSqSelectorModel and FRegressionSelectorModel) return statistics (chi square statistics or Fvalue) and P-values. This is to address the comment in https://github.com/apache/spark/pull/27322. > I found that f_regression in scikit-learn will return both arrays of F-values and P-values, can we also add them to FRegressionSelectorModel? 5. Because of adding two more parameters statistics and pValue in ChiSqSelectorModel constructor, I added a ml-models/chisq-3.0.0 and modified the ChiSqSelectorSuite to make sure pre 3.1.0 model can be loaded OK in 3.1.0.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
