[jira] [Comment Edited] (SPARK-14043) Remove restriction on maxDepth for decision trees

Eugene Morozov (JIRA) Wed, 23 Mar 2016 00:30:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206841#comment-15206841
 ]


Eugene Morozov edited comment on SPARK-14043 at 3/23/16 7:28 AM:
-----------------------------------------------------------------

I have couple of ideas to mitigate the issue:
- introduce Array64 (int[][] that allows longer arrays, than max_integer) or 
List, but the bad part is that it'd require a lot of memory just to store those 
indices,

It looks like the the issue with using array is that most of the indexes are 
"wasted" - even if some nodes are not split, the array contains elements as if 
they would be split. This greatly reduces amount of nodes to be actually used 
in the decision tree. F.e. I've trained couple of models with different 
maxDepth with 50 trees. All of the decision trees for both models looked like 
the following couples:
Model with maxDepth = 20:
- depth=20, numNodes=471
- depth=19, numNodes=497

Model with maxDepth = 30:
- depth=30, numNodes=11347
- depth=30, numNodes=10963

Even though the decision trees grows up to the limit of 30 levels, it contains 
way less number of nodes, than it actually might. 

- Another way to solve this is to represent the decision tree as a tree - it 
would still allows us to use node indexes, but it won't "waste" them. So, if 
the node indexes are integers, then the limit is 2^31 - 1 nodes. I'm not sure 
if that's feasible to achieve, but I'd say it's better to use longs just in 
case.


was (Author: jean):
I looked at the spark code regarding the issue and I have couple of ideas how 
this can be fixed
- introduce Array64 (int[][] that allows longer arrays, than max_integer) or 
List, but the bad part is that it'd require a lot of memory just to store those 
indices,
- represent the decision tree as a tree without nodeIds at all.

> Remove restriction on maxDepth for decision trees
> -------------------------------------------------
>
>                 Key: SPARK-14043
>                 URL: https://issues.apache.org/jira/browse/SPARK-14043
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> We currently restrict decision trees (DecisionTree, GBT, RandomForest) to be 
> of maxDepth <= 30.  We should remove this restriction to support deep 
> (imbalanced) trees.
> Trees store an index for each node, where each index corresponds to a unique 
> position in a binary tree.  (I.e., the first index of row 0 is 1, the first 
> of row 1 is 2, the first of row 2 is 4, etc., IIRC)
> With some careful thought, we could probably avoid using indices altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-14043) Remove restriction on maxDepth for decision trees

Reply via email to