Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11903#discussion_r57082871
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -58,8 +67,8 @@ class DecisionTree @Since("1.0.0") (private val strategy: 
Strategy)
        */
       @Since("1.2.0")
       def run(input: RDD[LabeledPoint]): DecisionTreeModel = {
    -    // Note: random seed will not be used since numTrees = 1.
    -    val rf = new RandomForest(strategy, numTrees = 1, 
featureSubsetStrategy = "all", seed = 0)
    +    val rf = new RandomForest(strategy, numTrees = 1, 
featureSubsetStrategy = "all",
    --- End diff --
    
    Across the tree/ensemble libraries in Spark, unit tests generally test that 
ML and MLlib results are equal. For decision trees, ML uses a random seed and 
MLlib basically ignores it. By some stroke of luck, this doesn't cause a 
problem. This is because `findSplits` only subsamples continuous features for 
split calculations on large datasets using a random seed and the unit tests do 
not deal with large enough data sets. However, GBTs use subsampling of the 
input on single decision trees (this is not allowed or tested in DecisionTree*) 
which also uses the random seed. The unit tests fail because MLlib doesn't use 
the same random seed as ML. Instead of re-writing the tests, it was a fairly 
simple fix to add a random seed to GBT in MLlib, which required adding it to 
MLlib decision trees as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to