huaxingao commented on issue #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#issuecomment-577334796 @srowen Hi Sean, I am thinking of adding selector for continuous distribution features. I want to ask your opinion before I go any further. I will also ask Ruifeng after Chinese New Year holiday. I bet he is on vacation so don't want to ping him now. Currently, Spark only supports selection of categorical features (```ChiSqSelector```). I am thinking of adding two new selectors for continuous distribution features: 1. ```FValueRegressionSelector``` for continuous features and continuous labels. 2. ```FValueClassificationSelector``` for continuous features and categorical labels. Currently, this WIP PR only has ```FValueRegressionSelector``` implemented. ```FValueClassificationSelector``` is very similar. The calculation for classification f value is a little more complicated. I wrote the pseudo code here along with an example: ``` // pseudo code: // for each feature: // separate feature values into array of arrays by label (call it arr) // e.g. if feature is [3.3, 2.5, 1.0, 3.0, 2.0] and labels are [1, 2, 1, 3, 3] // then output should be arr = [[3.3, 1.0], [2.5], [3.0, 2.0]] // n_classes = len(arr) (num. of distinct label categories) // n_samples_per_class = [len(a) for a in arr] // n_samples = sum(n_samples_per_class) (= num. of rows in feature column) // e.g. in above example, n_classes = 3, n_samples_per_class = [2, 1, 2], n_samples = 5 // ss_all = sum of squares of all in feature (e.g. 3.3^2+2.5^2+1.0^2+3.0^2+2.0^2) // sq_sum_all = square of sum of all data (e.g. (3.3+2.5+1.0+3.0+2.0)^2) // sq_sum_classes = [sum(a) ** 2 for a in arr] (e.g. [(3.3+1.0)^2, 2.5^2, (3.0+2.0)^2] // sstot = ss_all - (sq_sum_all / n_samples) // ssbn = sum( sq_sum_classes[k] / n_samples_per_class[k] for k in range(n_classes)) - (sq_sum_all / n_samples) // e.g. ((3.3+1.0)^2 / 2 + 2.5^2 / 1 + (3.0+2.0)^2 / 2) - sq_sum_all / 5 // sswn = sstot - ssbn // dfbn = n_classes - 1 // dfwn = n_samples - n_classes // msb = ssbn / dfbn // msw = sswn / dfwn // f = msb / msw // pvalue = 1 - FDistribution(dfbn, dfwn).cdf(f) ``` sklean has both f_regression and f_classif. Here are the links: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
