Hi, Joseph, I thought I understood, why it has a limit of 30 levels for decision tree, but now I'm not that sure. I thought that's because the decision tree stored in the array, which has length of type int, which cannot be more, than 2^31-1. But here are my new discoveries. I've trained two different random forest models of 50 trees and different maxDepth (20 and 30) and specified node size = 5. Here are couple of those trees
Model with maxDepth = 20: depth=20, numNodes=471 depth=19, numNodes=497 Model with maxDepth = 30: depth=30, numNodes=11347 depth=30, numNodes=10963 It looks like the tree is not pretty balanced and I understand why that happens, but I'm surprised that actual number of nodes way less, than 2^31 - 1. And now I'm not sure of why the limitation actually exists. With tree that consist of 2^31 nodes it'd required to have 8G of memory just to store those indexes, so I'd say that depth isn't the biggest issue in such a case. Is it possible to workaround or simply miss maxDepth limitation (without codebase modification) to train the tree until I hit the max number of nodes? I'd assume that in most cases I simply won't hit it, but the depth of the tree would be much more, than 30. -- Be well! Jean Morozov On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley <jos...@databricks.com> wrote: > Hi Eugene, > > The maxDepth parameter exists because the implementation uses Integer node > IDs which correspond to positions in the binary tree. This simplified the > implementation. I'd like to eventually modify it to avoid depending on > tree node IDs, but that is not yet on the roadmap. > > There is not an analogous limit for the GLMs you listed, but I'm not very > familiar with the perceptron implementation. > > Joseph > > On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wrote: > >> Hello! >> >> I'm currently working on POC and try to use Random Forest (classification >> and regression). I also have to check SVM and Multiclass perceptron (other >> algos are less important at the moment). So far I've discovered that Random >> Forest has a limitation of maxDepth for trees and just out of curiosity I >> wonder why such a limitation has been introduced? >> >> An actual question is that I'm going to use Spark ML in production next >> year and would like to know if there are other limitations like maxDepth in >> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc. >> >> Thanks in advance for your time. >> -- >> Be well! >> Jean Morozov >> > >