Hi Christian,

Some time ago I had similar problems. I.e., I wanted to use additional 
features to my lexical features and simple concatanation didn't work 
that well for me even though both feature sets on their own performed 
pretty well.

You can follow the discussion about my problem here [1] if you scroll 
down - ignore the starting discussion. The best solution I ended up was 
the one suggested by Olivier. You basically train a linear classifier on 
your lexical features and then use the predict_proba outcome and your 
additional categorical features for training a second classifier - for 
example random forests. It was also helpful to perform leave-one-out 
when training the probabilities (if you have few samples).

[1] 
http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K5F2BJ_ms51a-61HwmNrAyRTb1W0KK7ziBPzGAcdiBRqQ%40mail.gmail.com&forum_name=scikit-learn-general

If you find out anything else, let us know ;)

Regards,
Philipp

Am 01.06.2013 20:30, schrieb Christian Jauvin:
> Hi,
>
> I asked a (perhaps too vague?) question about the use of Random
> Forests with a mix of categorical and lexical features on two ML
> forums (stats.SE and MetaOp), but since it has received no attention,
> I figured that it might work better on this list (I'm using sklearn's
> RF of course):
>
> "I'm working on a binary classification problem for which the dataset
> is mostly composed of categorical features, but also a few lexical
> ones (i.e. article titles and abstracts). I'm experimenting with
> Random Forests, and my current strategy is to build the training set
> by appending the k best lexical features (chosen with univariate
> feature selection, and weighted with tf-idf) to the full set of
> categorical features. This works reasonably well, but as I cannot find
> explicit references to such a strategy of using hybrid features for
> RF, I have doubts about my approach: does it make sense? Am I
> "diluting" the power of the RF by doing so, and should I rather try to
> combine two classifiers specializing on both types of features?"
>
> http://stats.stackexchange.com/questions/60162/random-forest-with-a-mix-of-categorical-and-lexical-features
>
> Thanks,
>
> Christian
>
> ------------------------------------------------------------------------------
> Get 100% visibility into Java/.NET code with AppDynamics Lite
> It's a free troubleshooting tool designed for production
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap2
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to