FYI This is my first take on feature selection, filtering and chi-squared: https://github.com/apache/spark/pull/1484
-----Original Message----- From: Ulanov, Alexander Sent: Thursday, July 10, 2014 9:39 PM To: dev@spark.apache.org Subject: Feature selection interface Hi, I've implemented a class that does Chi-squared feature selection for RDD[LabeledPoint]. It also computes basic class/feature occurrence statistics and other methods like mutual information or information gain can be easily implemented. I would like to make a pull request. However, MLlib master branch doesn't have any feature selection methods implemented. So, I need to create a proper interface that my class will extend or mix. It should be easy to use from developers and users prospective. I was thinking that there should be FeatureEvaluator that for each feature from RDD[LabeledPoint] returns RDD[((featureIndex: Int, label: Double), value: Double)]. Then there should be FeatureSelector that selects top N features or top N features group by class etc. And the simplest one, FeatureFilter that filters the data based on set of feature indices. Additionally, there should be the interface for FeatureEvaluators that don't use class labels, i.e. for RDD[Vector]. I am concerned that such design looks rather "disconnected" because there are 3 disconnected objects. As a result of use, I would like to see something like "val filteredData = Filter(data, ChiSquared(data).selectTop(100))". Any ideas or suggestions? Best regards, Alexander