Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4677#discussion_r25143389
  
    --- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala
 ---
    @@ -158,6 +158,63 @@ class GradientBoostedTreesSuite extends FunSuite with 
MLlibTestSparkContext {
           }
         }
       }
    +
    +  test("runWithValidation performs better on a validation dataset 
(Regression)") {
    +    // Set numIterations large enough so that it early stops.
    +    val numIterations = 20
    +    val trainRdd = sc.parallelize(GradientBoostedTreesSuite.trainData, 2)
    +    val validateRdd = 
sc.parallelize(GradientBoostedTreesSuite.validateData, 2)
    +
    +    val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
    +      categoricalFeaturesInfo = Map.empty)
    +    Array(SquaredError, AbsoluteError).foreach { error =>
    +      val boostingStrategy =
    +        new BoostingStrategy(treeStrategy, error, numIterations, 
validationTol = 0.0)
    +
    +      val gbtValidate = new 
GradientBoostedTrees(boostingStrategy).runWithValidation(
    +        trainRdd, validateRdd)
    +      assert(gbtValidate.numTrees != numIterations)
    +
    +      val gbt = GradientBoostedTrees.train(trainRdd, boostingStrategy)
    +      val errorWithoutValidation = error.computeError(gbt, validateRdd)
    +      val errorWithValidation = error.computeError(gbtValidate, 
validateRdd)
    +      assert(errorWithValidation < errorWithoutValidation)
    +    }
    +  }
    +
    +  test("runWithValidation performs better on a validation dataset 
(Classification)") {
    +    // Set numIterations large enough so that it early stops.
    +    val numIterations = 20
    +    val trainRdd = sc.parallelize(GradientBoostedTreesSuite.trainData, 2)
    +    val validateRdd = 
sc.parallelize(GradientBoostedTreesSuite.validateData, 2)
    +
    +    val treeStrategy = new Strategy(algo = Classification, impurity = 
Variance, maxDepth = 2,
    +      categoricalFeaturesInfo = Map.empty)
    +    val boostingStrategy =
    +      new BoostingStrategy(treeStrategy, LogLoss, numIterations, 
validationTol = 0.0)
    +
    +    // Test that it stops early.
    +    val gbtValidate = new 
GradientBoostedTrees(boostingStrategy).runWithValidation(
    +      trainRdd, validateRdd)
    +    assert(gbtValidate.numTrees != numIterations)
    +
    +    // Remap labels to {-1, 1}
    +    val remappedInput = validateRdd.map(x => new LabeledPoint(2 * x.label 
- 1, x.features))
    +
    +    // The error checked for internally in the GradientBoostedTrees is 
based on Regression.
    +    // Hence for the validation model, the Classification error need not 
be strictly less than
    +    // that done with validation.
    +    val gbtValidateRegressor = new GradientBoostedTreesModel(
    --- End diff --
    
    I have addressed all your comment except this.
    I am testing with validationInput only. Sorry if the variable name is 
confusing.
    
    I think what happens is the number of true labels classified is the same 
whether or not I run with validation in because of the dataset that is being 
tested here. i.e when I run without validation, there might be an increase in 
the validation error but there is no change in the number of labels that are 
predicted correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to