zhengruifeng commented on issue #27103: [SPARK-30381][ML] Refactor GBT to reuse 
treePoints for all trees
URL: https://github.com/apache/spark/pull/27103#issuecomment-571064302
 
 
   testcode:
   ```scala
   import org.apache.spark.ml.regression._
   import org.apache.spark.storage.StorageLevel
   
   var df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
   
   (0 until 8).foreach{ _ => df = df.union(df) }
   df.persist(StorageLevel.MEMORY_AND_DISK)
   
   df.count
   df.count
   df.count
   
   val gbt = new GBTRegressor().setMaxIter(10)
   val gbtm = gbt.fit(df)
   
   
   val start = System.currentTimeMillis; val gbtm = gbt.fit(df); val end = 
System.currentTimeMillis; end - start
   
   gbtm.evaluateEachIteration(df, "squared")
   
   
   val gbt2 = new GBTRegressor().setMaxIter(10).setSubsamplingRate(0.8)
   
   val start = System.currentTimeMillis; val gbtm2 = gbt2.fit(df); val end = 
System.currentTimeMillis; end - start
   
   gbtm2.evaluateEachIteration(df, "squared")
   ```
   
   result:
   about 48% faster than existing impl
   |dur_gbt(new) | dur_gbt2(new) | dur_gbt(old) | dur_gbt2(old) |
   |------|----------|------------|----------|
   |197777|188205|133214|134787|
   
   |loss(new) | loss2(new) | loss(old) | loss2(old) |
   |------|----------|------------|----------|
   
|0.4283679582338368|0.42678864636469305|0.4283679582338368|0.42678864636469305|
   
   The convergences of `gbt` should be the same, since `splits` are always 
built on the whole dataset and should be the same.
   The convergences of `gbt2` do not have to be the same. However in above 
tests they happen to be the same, maybe due to that the input df is generated 
by repeate `a9a` 1024 times.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to