Frank McQuillan created MADLIB-1057:
---------------------------------------

             Summary: Reduce memory footprint for DT
                 Key: MADLIB-1057
                 URL: https://issues.apache.org/jira/browse/MADLIB-1057
             Project: Apache MADlib
          Issue Type: Improvement
          Components: Module: Decision Tree
            Reporter: Frank McQuillan
             Fix For: v2.0


Follow on from spike 
https://issues.apache.org/jira/browse/MADLIB-1035

Step 1

As a madlib developer I want to recreate the RF memory issue (reported in 
https://issues.apache.org/jira/browse/MADLIB-1035). 

The current datasets we have are 
dt_adult : 32K rows 14 columns
ecommerce : 1M rows 4 columns (ecommerce isn’t actually suitable for DT/RF)

We need a table with ~2.2M rows and ~130 features (the actual target table has 
~1300 features). Randomly filling them might help diagnosing the issue but 
ideally we would want a somewhat sensible dataset. The problem seems to involve 
relatively short trees (depth 5) which means a random dataset will probably 
fill the whole tree which might not be true for a structured dataset.


Step 2

Refactoring DT for for smaller memory footprint.

Tree Accumulator has 2 matrices for continuous and categorical variables. 
The whole structure is recreated at every level. 
Every matrix has 2^i rows (i is the level)
The categorical matrix size depends on the total number of categories (weather 
: {sunny, cloudy, rainy}, isWeekend : {true, false} means this total is 3+2=5) 
The continuous matrix size depends on the number of cont. features * the number 
of bins.

Tree accumulator works like an array not a linked list. Even if the output is 
not a complete tree, the tree accumulator creates rows for nonexistent branches 
in proper order and fills them with 0 values. 

The refactored version would create a small index table that has the same 
number of rows as the old tree accumulator (a complete tree) but only a single 
index column that points to the new tree accumulator row. 

This will allow us to keep most of the internal function interfaces same but 
the code to access (read/write) the tree accumulator will have to change.








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to