[ 
https://issues.apache.org/jira/browse/MADLIB-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664459#comment-15664459
 ] 

Frank McQuillan commented on MADLIB-1035:
-----------------------------------------

Here is some input from a MADlib users:

This query did not work for a table with 134 cols and 2.2M rows, it hit VMEM 
memory issue:

{code}
SELECT madlib.forest_train(‘data_table’
                                                , ‘output_table’
                                                , 'id
                                                , 'target
                                                , '*'
                                                , 'partition_i'
                                                , NULL::text
                                                , 20::int --num trees
                                                , sqrt(134)::int 
--num_random_features
                                                , true::boolean --importance
                                                , 1::int --num_permutations
                                                , 15::int --max_depth
                                                , 200::int --min_split
                                                , 10::int --min_bucket
                                                , 10::int --num_splits
);
{code}

This query worked OK for a table with 134 cols and 2.2M rows:

{code}
SELECT madlib.forest_train(‘data_table’
                                                , ‘output_table’
                                                , 'id
                                                , 'target
                                                , '*'
                                                , 'partition_i'
                                                , NULL::text
                                                , 10::int --num trees
                                                , sqrt(134)::int 
--num_random_features
                                                , true::boolean --importance
                                                , 1::int --num_permutations
                                                , 5::int --max_depth
                                                , 200::int --min_split
                                                , 5::int --min_bucket
                                                , 10::int --num_splits
);
{code}

Out of the 134 features, 32 of them are categorical with 2-5 values each.

Tried with the same parameters but on the complete data set which is 2.2M rows 
and 1350 columns and hit VMEM issue.

In general:  number of features, tree depth and bins (for cont variables) drive 
the memory footprint

I would suggest we have a look at the RF code to determine if:

1) Are there  bugs, memory leaks, etc. issues may be causing the existing 
algorithm to use too much memory?  Is there any obvious improvements to make?
2) What refinements to the algorithm or implementation can be made to reduce 
the memory footprint?




> Reduce memory usage for DT and RF
> ---------------------------------
>
>                 Key: MADLIB-1035
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1035
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Decision Tree
>            Reporter: Frank McQuillan
>             Fix For: v1.10
>
>
> DT train requires collecting stats at each leaf node to decide which 
> feature/threshold to split on. This involves building a data structure that 
> keeps a count for each feature-value combination. This is done for each node 
> at the bottom layer - there are 2^k nodes at depth k. Hence as the depth 
> increases, the memory requirement increases. The memory needed in each node 
> depends on number of features and the values those features take. For 
> continuous features, the value of `n_bins` determines the memory footprint. 
> The tree itself also adds to the memory needs, but that is small compared to 
> the stats we need to collect.
> We can think about doing approximations or reducing the possible number of 
> splits as we go deeper into the tree.  The memory required is exponential to 
> the depth.
> There is the obvious optimization: we keep the stats data structure for the 
> whole feature space at each leaf node - clearly once we get deeper into the 
> tree, there are some parts of the feature space that will never reach a 
> particular leaf node. Hence we don’t need to assign memory for that space. 
> This requires some code-fu and refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to