[GitHub] [spark] huaxingao commented on issue #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection

GitBox Wed, 22 Jan 2020 11:00:17 -0800

huaxingao commented on issue #27322: [SPARK-26111][ML][WIP] Support F-value 
between label/feature for continuous distribution feature selection
URL: https://github.com/apache/spark/pull/27322#issuecomment-577334796
 
 
   @srowen 
   Hi Sean, I am thinking of adding selector for continuous distribution 
features. I want to ask your opinion before I go any further. I will also ask 
Ruifeng after Chinese New Year holiday. I bet he is on vacation so don't want 
to ping him now. 
   
   Currently, Spark only supports selection of categorical features 
(```ChiSqSelector```). I am thinking of adding two new selectors for continuous 
distribution features:
   1. ```FValueRegressionSelector``` for  continuous features and continuous 
labels.
   2. ```FValueClassificationSelector``` for  continuous features and 
categorical labels.
   
   Currently, this WIP PR only has ```FValueRegressionSelector``` implemented. 
```FValueClassificationSelector``` is very similar. The calculation for 
classification f value is a little more complicated. I wrote the pseudo code 
here along with an example:
   
   ```
     // pseudo code:
     // for each feature:
     //     separate feature values into array of arrays by label (call it arr)
     //       e.g. if feature is [3.3, 2.5, 1.0, 3.0, 2.0] and labels are [1, 
2, 1, 3, 3]
     //       then output should be arr = [[3.3, 1.0], [2.5], [3.0, 2.0]]
     //     n_classes = len(arr) (num. of distinct label categories)
     //     n_samples_per_class = [len(a) for a in arr]
     //     n_samples = sum(n_samples_per_class)    (= num. of rows in feature 
column)
     //       e.g. in above example, n_classes = 3, n_samples_per_class = [2, 
1, 2], n_samples = 5
     //     ss_all = sum of squares of all in feature (e.g. 
3.3^2+2.5^2+1.0^2+3.0^2+2.0^2)
     //     sq_sum_all = square of sum of all data (e.g. 
(3.3+2.5+1.0+3.0+2.0)^2)
     //     sq_sum_classes = [sum(a) ** 2 for a in arr]  (e.g. [(3.3+1.0)^2, 
2.5^2, (3.0+2.0)^2]
     //     sstot = ss_all - (sq_sum_all / n_samples)
     //     ssbn = sum( sq_sum_classes[k] / n_samples_per_class[k] for k in 
range(n_classes)) - (sq_sum_all / n_samples)
     //       e.g. ((3.3+1.0)^2 / 2 + 2.5^2 / 1 + (3.0+2.0)^2 / 2) - sq_sum_all 
/ 5
     //     sswn = sstot - ssbn
     //     dfbn = n_classes - 1
     //     dfwn = n_samples - n_classes
     //     msb = ssbn / dfbn
     //     msw = sswn / dfwn
     //     f = msb / msw
     //     pvalue = 1 - FDistribution(dfbn, dfwn).cdf(f)
   ```
   sklean has both f_regression and f_classif. Here are the links:
   
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression
   
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] huaxingao commented on issue #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection

Reply via email to