Hi, again -

As part of the next step, I'd like to make a more substantive contribution
and propose some initial work on feature selection, primarily as it relates
to text classification.

Specifically, I'd like to contribute very straightforward code to perform
information gain feature evaluation. Below's a good primer that shows that
Information Gain is a very good option in many cases. If successful, BNS
(introduced in the paper), would be another approach worth looking into as
it actually improves the f score with a smaller feature space.

http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf

And here's my first cut:
https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8

I don't like that I do two passes to compute the class priors and joint
distributions, so I'll look into using combineByKey as in the NaiveBayes
implementation.  Also, this is still untested code, but it gets my ideas
out there and think it'd be best to define a FeatureEval trait or whatnot
that helps with ranking and selecting.

I also realize the above methods are probably more suitable for MLI than
MLlib, but there doesn't seem to be much activity on the former.

Second, is there a plan to support sparse vector representations for
NaiveBayes. This will probably be more efficient in, for example, text
classification tasks with lots of features (consider the case where n-grams
with n > 1 are used).

And on a related note, MLUtils.loadLabeledData doesn't support loading
sparse data. Any plans here to do so? There also doesn't seem to be a
defined file format for MLlib. Has there been any consideration to support
multiple standard formats, rather than defining one: eg, csv, tsv, Weka's
arff, etc?

Thanks for your time,
Ignacio

Reply via email to