[ 
https://issues.apache.org/jira/browse/MADLIB-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973811#comment-15973811
 ] 

ASF GitHub Bot commented on MADLIB-1057:
----------------------------------------

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/incubator-madlib/pull/120

    DT: Assign memory only for reachable nodes

    JIRA: MADLIB-1057
    
    TreeAccumulator assigns a matrix to track the statistics of rows
    reaching the last layer of nodes. This matrix assumes a complete 
    tree and assigns memory for all nodes. As the tree gets deeper, 
    most of the nodes are unreachable, resulting in excessive wasted
    memory. This commit reduces that waste by only assigning memory
    for nodes that are reachable and accessing them through a lookup 
    table.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyerr3/incubator-madlib 
feature/dt_reduce_memory

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/120.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #120
    
----
commit b1cea55925ee1e3f6569d2d7aafac16e608c43b3
Author: Rahul Iyer <[email protected]>
Date:   2017-04-15T00:54:31Z

    Initial commit for sparser stats matrices

commit a0875f23ff69f22462a227b500612965976e0358
Author: Rahul Iyer <[email protected]>
Date:   2017-04-18T20:38:04Z

    Build lookup index vector

commit 67cb1b121a4829f4840f33f7cdc7eabe839ec343
Author: Rahul Iyer <[email protected]>
Date:   2017-04-19T00:39:24Z

    Remove warnings

----


> Reduce memory footprint for DT
> ------------------------------
>
>                 Key: MADLIB-1057
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1057
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Decision Tree
>            Reporter: Frank McQuillan
>            Assignee: Rahul Iyer
>             Fix For: v1.11
>
>
> Follow on from spike 
> https://issues.apache.org/jira/browse/MADLIB-1035
> Step 1
> As a madlib developer I want to recreate the RF memory issue (reported in 
> https://issues.apache.org/jira/browse/MADLIB-1035). 
> The current datasets we have are 
> dt_adult : 32K rows 14 columns
> ecommerce : 1M rows 4 columns (ecommerce isn’t actually suitable for DT/RF)
> We need a table with ~2.2M rows and ~130 features (the actual target table 
> has ~1300 features). Randomly filling them might help diagnosing the issue 
> but ideally we would want a somewhat sensible dataset. The problem seems to 
> involve relatively short trees (depth 5) which means a random dataset will 
> probably fill the whole tree which might not be true for a structured dataset.
> Step 2
> Refactoring DT for for smaller memory footprint.
> Tree Accumulator has 2 matrices for continuous and categorical variables. 
> The whole structure is recreated at every level. 
> Every matrix has 2^i rows (i is the level)
> The categorical matrix size depends on the total number of categories 
> (weather : {sunny, cloudy, rainy}, isWeekend : {true, false} means this total 
> is 3+2=5) 
> The continuous matrix size depends on the number of cont. features * the 
> number of bins.
> Tree accumulator works like an array not a linked list. Even if the output is 
> not a complete tree, the tree accumulator creates rows for nonexistent 
> branches in proper order and fills them with 0 values. 
> The refactored version would create a small index table that has the same 
> number of rows as the old tree accumulator (a complete tree) but only a 
> single index column that points to the new tree accumulator row. 
> This will allow us to keep most of the internal function interfaces same but 
> the code to access (read/write) the tree accumulator will have to change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to