Hi, again - As part of the next step, I'd like to make a more substantive contribution and propose some initial work on feature selection, primarily as it relates to text classification.
Specifically, I'd like to contribute very straightforward code to perform information gain feature evaluation. Below's a good primer that shows that Information Gain is a very good option in many cases. If successful, BNS (introduced in the paper), would be another approach worth looking into as it actually improves the f score with a smaller feature space. http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf And here's my first cut: https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8 I don't like that I do two passes to compute the class priors and joint distributions, so I'll look into using combineByKey as in the NaiveBayes implementation. Also, this is still untested code, but it gets my ideas out there and think it'd be best to define a FeatureEval trait or whatnot that helps with ranking and selecting. I also realize the above methods are probably more suitable for MLI than MLlib, but there doesn't seem to be much activity on the former. Second, is there a plan to support sparse vector representations for NaiveBayes. This will probably be more efficient in, for example, text classification tasks with lots of features (consider the case where n-grams with n > 1 are used). And on a related note, MLUtils.loadLabeledData doesn't support loading sparse data. Any plans here to do so? There also doesn't seem to be a defined file format for MLlib. Has there been any consideration to support multiple standard formats, rather than defining one: eg, csv, tsv, Weka's arff, etc? Thanks for your time, Ignacio