zhengruifeng opened a new pull request #27103: [SPARK-30381][ML] Refactor GBT 
to reuse treePoints for all trees
URL: https://github.com/apache/spark/pull/27103
 
 
   ### What changes were proposed in this pull request?
   Make GBT reuse splits for all trees:
   1, reuse splits/treePoints for all trees:
   existing impl will find feature splits and transform input vectors to 
treePoints for each tree; while other famous impls like XGBoost/lightGBM will 
build a global splits/binned features and reuse them for all trees; 
   This will cause a little behavior change: Existing impl will build splits on 
a random sampled dataset at each iter, so splits maybe different among trees; 
(If the size of dataset is large, or the sampling rate is high, the splits at 
different iteration should be similar)
   
   2, do not cache input vectors:
   existing impl will cached the input twice: 1,`input: RDD[Instance]` is used 
to compute/update prediction and errors; 2, at each iteration, input is 
transformed to bagged points, the bagged points will be cached during this 
iteration;
   In this PR,`input: RDD[Instance]` is no longer cached, since it is only used 
three times: 1, compute metadata; 2, find splits; 3, transformed to treeePoints;
   Instead, the treePoints `RDD[TreePoint]` is cached, at each iter, it is 
convert to bagged points by attach extra `labelWithWeights: RDD[(Double, Int, 
Double)]` containing residuals/sampleCount/weight information, this rdd is 
relative small (like cached `norms` in KMeans);
   To compute/update prediction and errors, new prediction method on binned 
features are added in `Node`
   
   ### Why are the changes needed?
   for perfermance improvement: 
   1,40%~50% faster than existing impl
   2,save 40%~60% RAM
   
   ### Does this PR introduce any user-facing change?
   No
   
   ### How was this patch tested?
   existing testsuites & several manual tests in REPL
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to