Liaw, Andy wrote:
From: Uwe Ligges
WeiWei Shi wrote:
Hi, Andy:
Thanks. It works after I removed the variable. I think I
got a similar
problem when I used randomForest. And I am not sure if they were due to the same reason.
Practically and Unfortunately, that variable is very
important to the
accuracy. I am wondering if there is another way besides collapsing
it. BTW, I remember you mentioned some alternative implementation to
randomForest (the author provided) to avoid the upper limit
(32, if I
am correct) for the level of factor which can be used in the R version's randomForest.
Thanks for further assistance!
So you *really* want it to be factor?! Thought it was a mistake not to have it numerical....
Amazing! Maybe computers are sometimes even too fast these days.
Uwe
[Uwe: Not sure if you meant to keep this off-list. If so, my most sincere apologies.]
Andy, *you* do not need to apologize (yes, I meant to keep it off list, but WeiWei Shi posted it anyway).
Er... not really. Currently (classification) randomForest encode splits on categorical variables by binary expansion of levels that go to the left. Such split is stored in (4-byte) integers, thus the 32-level restriction. In newer version of Breiman & Cutler's Fortran code, that restriction is removed by storing the entire indicator matrix (# of nodes by max. number of levels, then by number of trees in the forest). For the stand-alone Fortran, each tree is written to file as soon as it's grown, so it doesn't need to store the entire forest in memory. The R version has no such luxury (if you can call it that).
The way the new RF Fortran code deals with categorical variables with more than 10 categories is by randomly sampling some number (say 512) of random splits and pick the best among them. That's probably a good strategy for random forests, but may not be what one would do to grow a single tree.
When growing a single tree with data containing categorical variables with large number of categories, one should also be mindful of the problem that, because of the greedy nature of the algorithm, it will tend to split on variables with larger numbers of possible splits, even if those variables are less `informative'.
Andy
Certainly you are right - I don't know all those details about RandomForests, but the point I tried to make is different:
Be aware not to be called a professional overfitter: Variable name "V141" and at least in one of those variables a factor with 88 levels...!!!
Uwe
______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
