Hi Christian, Some time ago I had similar problems. I.e., I wanted to use additional features to my lexical features and simple concatanation didn't work that well for me even though both feature sets on their own performed pretty well.
You can follow the discussion about my problem here [1] if you scroll down - ignore the starting discussion. The best solution I ended up was the one suggested by Olivier. You basically train a linear classifier on your lexical features and then use the predict_proba outcome and your additional categorical features for training a second classifier - for example random forests. It was also helpful to perform leave-one-out when training the probabilities (if you have few samples). [1] http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K5F2BJ_ms51a-61HwmNrAyRTb1W0KK7ziBPzGAcdiBRqQ%40mail.gmail.com&forum_name=scikit-learn-general If you find out anything else, let us know ;) Regards, Philipp Am 01.06.2013 20:30, schrieb Christian Jauvin: > Hi, > > I asked a (perhaps too vague?) question about the use of Random > Forests with a mix of categorical and lexical features on two ML > forums (stats.SE and MetaOp), but since it has received no attention, > I figured that it might work better on this list (I'm using sklearn's > RF of course): > > "I'm working on a binary classification problem for which the dataset > is mostly composed of categorical features, but also a few lexical > ones (i.e. article titles and abstracts). I'm experimenting with > Random Forests, and my current strategy is to build the training set > by appending the k best lexical features (chosen with univariate > feature selection, and weighted with tf-idf) to the full set of > categorical features. This works reasonably well, but as I cannot find > explicit references to such a strategy of using hybrid features for > RF, I have doubts about my approach: does it make sense? Am I > "diluting" the power of the RF by doing so, and should I rather try to > combine two classifiers specializing on both types of features?" > > http://stats.stackexchange.com/questions/60162/random-forest-with-a-mix-of-categorical-and-lexical-features > > Thanks, > > Christian > > ------------------------------------------------------------------------------ > Get 100% visibility into Java/.NET code with AppDynamics Lite > It's a free troubleshooting tool designed for production > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://p.sf.net/sfu/appdyn_d2d_ap2 > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general