[
https://issues.apache.org/jira/browse/SPARK-30381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhengruifeng resolved SPARK-30381.
----------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 27103
[https://github.com/apache/spark/pull/27103]
> GBT reuse treePoints for all trees
> ----------------------------------
>
> Key: SPARK-30381
> URL: https://issues.apache.org/jira/browse/SPARK-30381
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 3.0.0
> Reporter: zhengruifeng
> Assignee: zhengruifeng
> Priority: Major
> Fix For: 3.0.0
>
>
> In existing GBT, for each tree, it will first compute avaiable splits of each
> feature (via RandomForest.findSplits), based on sampled dataset at this
> iteration. Then it will use these splits to discretize vectors into
> BaggedPoint[TreePoint]s. The BaggedPoints (of the same size of input vectors)
> are then cached and used at this iteration. Note that the splits for
> discretization in each tree are different (if subsamplingRate<1), only
> because the sampled vectors are different.
> However, the splits at different iterations shoud be similar if sampled
> dataset is big enough, and even the same if subsamplingRate=1.
>
> However, in other famous GBT impls (like XGBoost/lightGBM) with binned
> features, the splits for discretization is the same for different iterations:
> {code:java}
> import xgboost as xgb
> from sklearn.datasets import load_svmlight_file
> X, y =
> load_svmlight_file('/data0/Dev/Opensource/spark/data/mllib/sample_linear_regression_data.txt')
> dtrain = xgb.DMatrix(X[:, :2], label=y)
> num_round = 3
> param = {'max_depth': 2, 'eta': 1, 'objective': 'reg:squarederror',
> 'tree_method': 'hist', 'max_bin': 2, 'eta': 0.01, 'subsample':0.5}
> bst = xgb.train(param, dtrain, num_round)
> bst.trees_to_dataframe('/tmp/bst')
> Out[61]:
> Tree Node ID Feature Split Yes No Missing Gain Cover
> 0 0 0 0-0 f1 0.000408 0-1 0-2 0-1 170.337143 256.0
> 1 0 1 0-1 f0 0.003531 0-3 0-4 0-3 44.865482 121.0
> 2 0 2 0-2 f0 0.003531 0-5 0-6 0-5 125.615570 135.0
> 3 0 3 0-3 Leaf NaN NaN NaN NaN -0.010050 67.0
> 4 0 4 0-4 Leaf NaN NaN NaN NaN 0.002126 54.0
> 5 0 5 0-5 Leaf NaN NaN NaN NaN 0.020972 69.0
> 6 0 6 0-6 Leaf NaN NaN NaN NaN 0.001714 66.0
> 7 1 0 1-0 f0 0.003531 1-1 1-2 1-1 50.417793 263.0
> 8 1 1 1-1 f1 0.000408 1-3 1-4 1-3 48.732742 124.0
> 9 1 2 1-2 f1 0.000408 1-5 1-6 1-5 52.832161 139.0
> 10 1 3 1-3 Leaf NaN NaN NaN NaN -0.012784 63.0
> 11 1 4 1-4 Leaf NaN NaN NaN NaN -0.000287 61.0
> 12 1 5 1-5 Leaf NaN NaN NaN NaN 0.008661 64.0
> 13 1 6 1-6 Leaf NaN NaN NaN NaN -0.003624 75.0
> 14 2 0 2-0 f1 0.000408 2-1 2-2 2-1 62.136013 242.0
> 15 2 1 2-1 f0 0.003531 2-3 2-4 2-3 150.537781 118.0
> 16 2 2 2-2 f0 0.003531 2-5 2-6 2-5 3.829046 124.0
> 17 2 3 2-3 Leaf NaN NaN NaN NaN -0.016737 65.0
> 18 2 4 2-4 Leaf NaN NaN NaN NaN 0.005809 53.0
> 19 2 5 2-5 Leaf NaN NaN NaN NaN 0.005251 60.0
> 20 2 6 2-6 Leaf NaN NaN NaN NaN 0.001709 64.0
> {code}
>
> We can see that even if we set subsample=0.5, the three trees share the same
> splits.
>
> So I think we could reuse the splits and treePoints at all iterations:
> at iteration=0, compute the splits on whole training dataset, and use the
> splits to generate treepoints.
> At each iteration, directly generate baggedPoints based on the treePoints.
> Here we do not need to persist/unpersist the internal training dataset for
> each tree.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]