Frank McQuillan created MADLIB-1035:
---------------------------------------

             Summary: Reduce memory usage for DT and RF
                 Key: MADLIB-1035
                 URL: https://issues.apache.org/jira/browse/MADLIB-1035
             Project: Apache MADlib
          Issue Type: Improvement
          Components: Module: Decision Tree
            Reporter: Frank McQuillan


DT train requires collecting stats at each leaf node to decide which 
feature/threshold to split on. This involves building a data structure that 
keeps a count for each feature-value combination. This is done for each node at 
the bottom layer - there are 2^k nodes at depth k. Hence as the depth 
increases, the memory requirement increases. The memory needed in each node 
depends on number of features and the values those features take. For 
continuous features, the value of `n_bins` determines the memory footprint. 

The tree itself also adds to the memory needs, but that is small compared to 
the stats we need to collect.

We can think about doing approximations or reducing the possible number of 
splits as we go deeper into the tree.  The memory required is exponential to 
the depth.

There is the obvious optimization: we keep the stats data structure for the 
whole feature space at each leaf node - clearly once we get deeper into the 
tree, there are some parts of the feature space that will never reach a 
particular leaf node. Hence we don’t need to assign memory for that space. This 
requires some code-fu and refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to