[GitHub] [spark] zhengruifeng edited a comment on issue #27103: [SPARK-30381][ML] Refactor GBT to reuse treePoints for all trees

GitBox Mon, 06 Jan 2020 02:04:39 -0800

zhengruifeng edited a comment on issue #27103: [SPARK-30381][ML] Refactor GBT 
to reuse treePoints for all trees
URL: https://github.com/apache/spark/pull/27103#issuecomment-571064302
 
 
   testcode:
   ```scala
   import org.apache.spark.ml.regression._
   import org.apache.spark.storage.StorageLevel
   
   var df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
   
   (0 until 8).foreach{ _ => df = df.union(df) }
   df.persist(StorageLevel.MEMORY_AND_DISK)
   
   df.count
   df.count
   df.count
   
   val gbt = new GBTRegressor().setMaxIter(10)
   val gbtm = gbt.fit(df)
   
   
   val start = System.currentTimeMillis; val gbtm = gbt.fit(df); val end = 
System.currentTimeMillis; end - start
   
   gbtm.evaluateEachIteration(df, "squared")
   
   
   val gbt2 = new GBTRegressor().setMaxIter(10).setSubsamplingRate(0.8)
   
   val start = System.currentTimeMillis; val gbtm2 = gbt2.fit(df); val end = 
System.currentTimeMillis; end - start
   
   gbtm2.evaluateEachIteration(df, "squared")
   ```
   
   result:
   about 48% faster than existing impl
   
   |dur_gbt(new) | dur_gbt2(new) | dur_gbt(old) | dur_gbt2(old) |
   |------|----------|------------|----------|
   |133214|134787|197777|188205|
   
   
   |loss(new) | loss2(new) | loss(old) | loss2(old) |
   |------|----------|------------|----------|
   
|0.4283679582338368|0.42678864636469305|0.4283679582338368|0.42678864636469305|
   
   We can see that the convergences of loss are the same.
   
   
   RAM used:
   existing impl:
   
![master](https://user-images.githubusercontent.com/7322292/71808485-dab2a080-30a8-11ea-8bde-563135a34b2c.png)
   
   RAM used for training dataset: 2.3G+5.0G=7.3G
   rdd965/rdd991/rdd1017 are internal rdd at each iteraion
   
   this PR:
   
![pr](https://user-images.githubusercontent.com/7322292/71808493-e0a88180-30a8-11ea-891a-b1ae99316963.png)
   
   RAM used for training dataset: 4.4G+761M=5.1G


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng edited a comment on issue #27103: [SPARK-30381][ML] Refactor GBT to reuse treePoints for all trees

Reply via email to