Hi,

Regarding the implementation of feature selection techniques, I'm
implementing some iterative algorithms based on a paper by Gavin Brown et
al. [1]. In this paper, he proposes a common framework for many Information
Theory-based criteria, namely those that use relevancy (mutual information
between one feature and the label; Information Gain), redundancy, and
conditional redundancy. The latter two are differently interpreted
depending on the criteria, but all of them play with the mutual information
between the feature being analyzed and the already selected ones and the
same mutual information conditioned to the label.

I think we should have a common interface to plug different Feature
Selection techniques. I already have the algorithm implemented, but still
have to do tests on it. Right now I'm working on the design. Next week I
can share with you a proposal, so we can work together to bring Feature
Selection to Spark.

[1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
likelihood maximisation: a unifying framework for information theoretic
feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.

---
Héctor


On Fri, Apr 11, 2014 at 5:20 AM, Xiangrui Meng <men...@gmail.com> wrote:

> Hi Ignacio,
>
> Please create a JIRA and send a PR for the information gain
> computation, so it is easy to track the progress.
>
> The sparse vector support for NaiveBayes is already implemented in
> branch-1.0 and master. You only need to provide an RDD of sparse
> vectors (created from Vectors.sparse).
>
> MLUtils.loadLibSVMData reads sparse features in LIBSVM format.
>
> Best,
> Xiangrui
>
> On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas
> <ignacio.zendejas...@gmail.com> wrote:
> > Hi, again -
> >
> > As part of the next step, I'd like to make a more substantive
> contribution
> > and propose some initial work on feature selection, primarily as it
> relates
> > to text classification.
> >
> > Specifically, I'd like to contribute very straightforward code to perform
> > information gain feature evaluation. Below's a good primer that shows
> that
> > Information Gain is a very good option in many cases. If successful, BNS
> > (introduced in the paper), would be another approach worth looking into
> as
> > it actually improves the f score with a smaller feature space.
> >
> > http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf
> >
> > And here's my first cut:
> >
> https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8
> >
> > I don't like that I do two passes to compute the class priors and joint
> > distributions, so I'll look into using combineByKey as in the NaiveBayes
> > implementation.  Also, this is still untested code, but it gets my ideas
> > out there and think it'd be best to define a FeatureEval trait or whatnot
> > that helps with ranking and selecting.
> >
> > I also realize the above methods are probably more suitable for MLI than
> > MLlib, but there doesn't seem to be much activity on the former.
> >
> > Second, is there a plan to support sparse vector representations for
> > NaiveBayes. This will probably be more efficient in, for example, text
> > classification tasks with lots of features (consider the case where
> n-grams
> > with n > 1 are used).
> >
> > And on a related note, MLUtils.loadLabeledData doesn't support loading
> > sparse data. Any plans here to do so? There also doesn't seem to be a
> > defined file format for MLlib. Has there been any consideration to
> support
> > multiple standard formats, rather than defining one: eg, csv, tsv, Weka's
> > arff, etc?
> >
> > Thanks for your time,
> > Ignacio
>

Reply via email to