[jira] [Commented] (MADLIB-1035) Spike: Investigate reducing memory usage for DT and RF

Frank McQuillan (JIRA) Wed, 18 Jan 2017 11:27:07 -0800

    [ 
https://issues.apache.org/jira/browse/MADLIB-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828623#comment-15828623
 ]


Frank McQuillan commented on MADLIB-1035:
-----------------------------------------

Thanks Orhan.

I created a new JIRA 
https://issues.apache.org/jira/browse/MADLIB-1057
with details.

So closing this JIRA


> Spike: Investigate reducing memory usage for DT and RF
> ------------------------------------------------------
>
>                 Key: MADLIB-1035
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1035
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Decision Tree
>            Reporter: Frank McQuillan
>             Fix For: v1.10
>
>         Attachments: DTNotes.pdf
>
>
> DT train requires collecting stats at each leaf node to decide which 
> feature/threshold to split on. This involves building a data structure that 
> keeps a count for each feature-value combination. This is done for each node 
> at the bottom layer - there are 2^k nodes at depth k. Hence as the depth 
> increases, the memory requirement increases. The memory needed in each node 
> depends on number of features and the values those features take. For 
> continuous features, the value of `n_bins` determines the memory footprint. 
> The tree itself also adds to the memory needs, but that is small compared to 
> the stats we need to collect.
> We can think about doing approximations or reducing the possible number of 
> splits as we go deeper into the tree.  The memory required is exponential to 
> the depth.
> There is the obvious optimization: we keep the stats data structure for the 
> whole feature space at each leaf node - clearly once we get deeper into the 
> tree, there are some parts of the feature space that will never reach a 
> particular leaf node. Hence we don’t need to assign memory for that space. 
> This requires some code-fu and refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MADLIB-1035) Spike: Investigate reducing memory usage for DT and RF

Reply via email to