Re: SparkML algos limitations question.

Eugene Morozov Mon, 21 Mar 2016 11:24:05 -0700

Hi, Joseph,

I thought I understood, why it has a limit of 30 levels for decision tree,
but now I'm not that sure. I thought that's because the decision tree
stored in the array, which has length of type int, which cannot be more,
than 2^31-1.
But here are my new discoveries. I've trained two different random forest
models of 50 trees and different maxDepth (20 and 30) and specified node
size = 5. Here are couple of those trees


Model with maxDepth = 20:
    depth=20, numNodes=471
    depth=19, numNodes=497

Model with maxDepth = 30:
    depth=30, numNodes=11347
    depth=30, numNodes=10963

It looks like the tree is not pretty balanced and I understand why that
happens, but I'm surprised that actual number of nodes way less, than 2^31
- 1. And now I'm not sure of why the limitation actually exists. With tree
that consist of 2^31 nodes it'd required to have 8G of memory just to store
those indexes, so I'd say that depth isn't the biggest issue in such a
case.

Is it possible to workaround or simply miss maxDepth limitation (without
codebase modification) to train the tree until I hit the max number of
nodes? I'd assume that in most cases I simply won't hit it, but the depth
of the tree would be much more, than 30.


--
Be well!
Jean Morozov

On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley <jos...@databricks.com>
wrote:

> Hi Eugene,
>
> The maxDepth parameter exists because the implementation uses Integer node
> IDs which correspond to positions in the binary tree.  This simplified the
> implementation.  I'd like to eventually modify it to avoid depending on
> tree node IDs, but that is not yet on the roadmap.
>
> There is not an analogous limit for the GLMs you listed, but I'm not very
> familiar with the perceptron implementation.
>
> Joseph
>
> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Hello!
>>
>> I'm currently working on POC and try to use Random Forest (classification
>> and regression). I also have to check SVM and Multiclass perceptron (other
>> algos are less important at the moment). So far I've discovered that Random
>> Forest has a limitation of maxDepth for trees and just out of curiosity I
>> wonder why such a limitation has been introduced?
>>
>> An actual question is that I'm going to use Spark ML in production next
>> year and would like to know if there are other limitations like maxDepth in
>> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>>
>> Thanks in advance for your time.
>> --
>> Be well!
>> Jean Morozov
>>
>
>

Re: SparkML algos limitations question.

Reply via email to