I got very good results on text century dating using random forests on very few (20-ish) bag-of-words tf-idf features selected by chi2. It depends on the problem.
Cheers, Vlad On Sat, Jun 1, 2013 at 9:01 PM, Andreas Mueller <amuel...@ais.uni-bonn.de> wrote: > On 06/01/2013 08:30 PM, Christian Jauvin wrote: >> Hi, >> >> I asked a (perhaps too vague?) question about the use of Random >> Forests with a mix of categorical and lexical features on two ML >> forums (stats.SE and MetaOp), but since it has received no attention, >> I figured that it might work better on this list (I'm using sklearn's >> RF of course): >> >> "I'm working on a binary classification problem for which the dataset >> is mostly composed of categorical features, but also a few lexical >> ones (i.e. article titles and abstracts). I'm experimenting with >> Random Forests, and my current strategy is to build the training set >> by appending the k best lexical features (chosen with univariate >> feature selection, and weighted with tf-idf) to the full set of >> categorical features. This works reasonably well, but as I cannot find >> explicit references to such a strategy of using hybrid features for >> RF, I have doubts about my approach: does it make sense? Am I >> "diluting" the power of the RF by doing so, and should I rather try to >> combine two classifiers specializing on both types of features?" >> > I think it is ok, though I think people rarely use RF on bag-of-word > features. > Btw, you do encode the categorical variables using one-hot, right? > The sklearn trees don't really support categorical variables. > An alternative approach would be to run a linear classifier on all tfidf > features > and feed the output together with the other variables to the RF. > > Hth, > Andy > > ps: try stackoverflow with scikit-learn tag next time. > > ------------------------------------------------------------------------------ > Get 100% visibility into Java/.NET code with AppDynamics Lite > It's a free troubleshooting tool designed for production > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://p.sf.net/sfu/appdyn_d2d_ap2 > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general