zhengruifeng opened a new pull request #27070: [SPARK-9612][ML][FOLLOWUP] fix 
GBT support weights if subsamplingRate<1
URL: https://github.com/apache/spark/pull/27070
 
 
   ### What changes were proposed in this pull request?
   fix `BaggedPoint.convertToBaggedRDD` when `subsamplingRate < 1.0`
   
   ### Why are the changes needed?
   1, `baggedPoint: BaggedPoint[TreePoint]` is used in the tree growth to find 
best split;
   2, `BaggedPoint[TreePoint]` contains two weights:
   ```scala
   class BaggedPoint[Datum](val datum: Datum, val subsampleCounts: Array[Int], 
val sampleWeight: Double = 1.0)
   class TreePoint(val label: Double, val binnedFeatures: Array[Int], val 
weight: Double)
   ```
   3, only the var `sampleWeight` in `BaggedPoint` is used, the var `weight` in 
`TreePoint` is never used in finding splits;
   4, The method  `BaggedPoint.convertToBaggedRDD` was changed in 
https://github.com/apache/spark/pull/21632, it was only for decisiontree, so 
only the following code path was changed; 
   ```
   if (numSubsamples == 1 && subsamplingRate == 1.0) {
           convertToBaggedRDDWithoutSampling(input, extractSampleWeight)
         }
   ```
   5, In https://github.com/apache/spark/pull/25926, I made GBT support 
weights, but only test it with default `subsamplingRate==1`.
   GBT with `subsamplingRate<1` will convert treePoints to baggedPoints via
   ```scala
   convertToBaggedRDDSamplingWithoutReplacement(input, subsamplingRate, 
numSubsamples, seed)
   ```
   in which the orignial weights from `weightCol` will be discarded and all 
`sampleWeight` are assigned default 1.0;
   
   ### Does this PR introduce any user-facing change?
   No
   
   ### How was this patch tested?
   updated testsuites

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to