[ 
https://issues.apache.org/jira/browse/SPARK-30381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30381.
----------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 27103
[https://github.com/apache/spark/pull/27103]

> GBT reuse treePoints for all trees
> ----------------------------------
>
>                 Key: SPARK-30381
>                 URL: https://issues.apache.org/jira/browse/SPARK-30381
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Assignee: zhengruifeng
>            Priority: Major
>             Fix For: 3.0.0
>
>
> In existing GBT, for each tree, it will first compute avaiable splits of each 
> feature (via RandomForest.findSplits), based on sampled dataset at this 
> iteration. Then it will use these splits to discretize vectors into 
> BaggedPoint[TreePoint]s. The BaggedPoints (of the same size of input vectors) 
> are then cached and used at this iteration. Note that the splits for 
> discretization in each tree are different (if subsamplingRate<1), only 
> because the sampled vectors are different.
> However, the splits at different iterations shoud be similar if sampled 
> dataset is big enough, and even the same if subsamplingRate=1.
>  
> However, in other famous GBT impls (like XGBoost/lightGBM) with binned 
> features, the splits for discretization is the same for different iterations:
> {code:java}
> import xgboost as xgb
> from sklearn.datasets import load_svmlight_file
> X, y = 
> load_svmlight_file('/data0/Dev/Opensource/spark/data/mllib/sample_linear_regression_data.txt')
> dtrain = xgb.DMatrix(X[:, :2], label=y)
> num_round = 3
> param = {'max_depth': 2, 'eta': 1, 'objective': 'reg:squarederror', 
> 'tree_method': 'hist', 'max_bin': 2, 'eta': 0.01, 'subsample':0.5}
> bst = xgb.train(param, dtrain, num_round)
> bst.trees_to_dataframe('/tmp/bst')
> Out[61]: 
>     Tree  Node   ID Feature     Split  Yes   No Missing        Gain  Cover
> 0      0     0  0-0      f1  0.000408  0-1  0-2     0-1  170.337143  256.0
> 1      0     1  0-1      f0  0.003531  0-3  0-4     0-3   44.865482  121.0
> 2      0     2  0-2      f0  0.003531  0-5  0-6     0-5  125.615570  135.0
> 3      0     3  0-3    Leaf       NaN  NaN  NaN     NaN   -0.010050   67.0
> 4      0     4  0-4    Leaf       NaN  NaN  NaN     NaN    0.002126   54.0
> 5      0     5  0-5    Leaf       NaN  NaN  NaN     NaN    0.020972   69.0
> 6      0     6  0-6    Leaf       NaN  NaN  NaN     NaN    0.001714   66.0
> 7      1     0  1-0      f0  0.003531  1-1  1-2     1-1   50.417793  263.0
> 8      1     1  1-1      f1  0.000408  1-3  1-4     1-3   48.732742  124.0
> 9      1     2  1-2      f1  0.000408  1-5  1-6     1-5   52.832161  139.0
> 10     1     3  1-3    Leaf       NaN  NaN  NaN     NaN   -0.012784   63.0
> 11     1     4  1-4    Leaf       NaN  NaN  NaN     NaN   -0.000287   61.0
> 12     1     5  1-5    Leaf       NaN  NaN  NaN     NaN    0.008661   64.0
> 13     1     6  1-6    Leaf       NaN  NaN  NaN     NaN   -0.003624   75.0
> 14     2     0  2-0      f1  0.000408  2-1  2-2     2-1   62.136013  242.0
> 15     2     1  2-1      f0  0.003531  2-3  2-4     2-3  150.537781  118.0
> 16     2     2  2-2      f0  0.003531  2-5  2-6     2-5    3.829046  124.0
> 17     2     3  2-3    Leaf       NaN  NaN  NaN     NaN   -0.016737   65.0
> 18     2     4  2-4    Leaf       NaN  NaN  NaN     NaN    0.005809   53.0
> 19     2     5  2-5    Leaf       NaN  NaN  NaN     NaN    0.005251   60.0
> 20     2     6  2-6    Leaf       NaN  NaN  NaN     NaN    0.001709   64.0
>  {code}
>  
> We can see that even if we set subsample=0.5, the three trees share the same 
> splits.
>  
> So I think we could reuse the splits and treePoints at all iterations:
> at iteration=0, compute the splits on whole training dataset, and use the 
> splits to generate treepoints.
> At each iteration, directly generate baggedPoints based on the treePoints.
> Here we do not need to persist/unpersist the internal training dataset for 
> each tree.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to