I got very good results on text century dating using random forests on
very few (20-ish) bag-of-words tf-idf features selected by chi2.  It
depends on the problem.

Cheers,
Vlad

On Sat, Jun 1, 2013 at 9:01 PM, Andreas Mueller
<amuel...@ais.uni-bonn.de> wrote:
> On 06/01/2013 08:30 PM, Christian Jauvin wrote:
>> Hi,
>>
>> I asked a (perhaps too vague?) question about the use of Random
>> Forests with a mix of categorical and lexical features on two ML
>> forums (stats.SE and MetaOp), but since it has received no attention,
>> I figured that it might work better on this list (I'm using sklearn's
>> RF of course):
>>
>> "I'm working on a binary classification problem for which the dataset
>> is mostly composed of categorical features, but also a few lexical
>> ones (i.e. article titles and abstracts). I'm experimenting with
>> Random Forests, and my current strategy is to build the training set
>> by appending the k best lexical features (chosen with univariate
>> feature selection, and weighted with tf-idf) to the full set of
>> categorical features. This works reasonably well, but as I cannot find
>> explicit references to such a strategy of using hybrid features for
>> RF, I have doubts about my approach: does it make sense? Am I
>> "diluting" the power of the RF by doing so, and should I rather try to
>> combine two classifiers specializing on both types of features?"
>>
> I think it is ok, though I think people rarely use RF on bag-of-word
> features.
> Btw, you do encode the categorical variables using one-hot, right?
> The sklearn trees don't really support categorical variables.
> An alternative approach would be to run a linear classifier on all tfidf
> features
> and feed the output together with the other variables to the RF.
>
> Hth,
> Andy
>
> ps: try stackoverflow with scikit-learn tag next time.
>
> ------------------------------------------------------------------------------
> Get 100% visibility into Java/.NET code with AppDynamics Lite
> It's a free troubleshooting tool designed for production
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap2
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to