[jira] [Commented] (SPARK-29232) RandomForestRegressionModel does not update the parameter maps of the DecisionTreeRegressionModels underneath

Jiaqi Guo (Jira) Fri, 27 Sep 2019 12:21:15 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-29232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939712#comment-16939712
 ]


Jiaqi Guo commented on SPARK-29232:
-----------------------------------

[~aman_omer], here is [an example from the Spark 
documentation|[https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression]].
{code:java}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Set maxCategories so features with > 4 distinct values are treated as 
continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestRegressor()
  .setNumTrees(5)
  .setMaxDepth(10)
  .setLabelCol("label")
  .setFeaturesCol("indexedFeatures")

// Chain indexer and forest in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(featureIndexer, rf))

// Train model. This also runs the indexer.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")
val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel]
println(s"Learned regression forest model:\n ${rfModel.toDebugString}")
{code}
This gives you a random forest model called rfModel. I modified the max depth 
to 10 for the trees. 
{code:java}
rfModel.extractParamMap()
// Printout
res23: org.apache.spark.ml.param.ParamMap = { rfr_8197914ca605-cacheNodeIds: 
false, rfr_8197914ca605-checkpointInterval: 10, 
rfr_8197914ca605-featureSubsetStrategy: auto, rfr_8197914ca605-featuresCol: 
indexedFeatures, rfr_8197914ca605-impurity: variance, 
rfr_8197914ca605-labelCol: label, rfr_8197914ca605-maxBins: 32, 
rfr_8197914ca605-maxDepth: 10, rfr_8197914ca605-maxMemoryInMB: 256, 
rfr_8197914ca605-minInfoGain: 0.0, rfr_8197914ca605-minInstancesPerNode: 1, 
rfr_8197914ca605-numTrees: 5, rfr_8197914ca605-predictionCol: prediction, 
rfr_8197914ca605-seed: 235498149, rfr_8197914ca605-subsamplingRate: 1.0 }
{code}
As you can see the maxDepth here is correct. However, if we were to check the 
parameter map of the trees.
{code:java}
rfModel.trees(0).extractParamMap()
// Printout
res22: org.apache.spark.ml.param.ParamMap = { dtr_bfcfc13f1334-cacheNodeIds: 
false, dtr_bfcfc13f1334-checkpointInterval: 10, dtr_bfcfc13f1334-featuresCol: 
features, dtr_bfcfc13f1334-impurity: variance, dtr_bfcfc13f1334-labelCol: 
label, dtr_bfcfc13f1334-maxBins: 32, dtr_bfcfc13f1334-maxDepth: 5, 
dtr_bfcfc13f1334-maxMemoryInMB: 256, dtr_bfcfc13f1334-minInfoGain: 0.0, 
dtr_bfcfc13f1334-minInstancesPerNode: 1, dtr_bfcfc13f1334-predictionCol: 
prediction, dtr_bfcfc13f1334-seed: 1366634793 }
{code}
The max depth stays at the default value 5. In fact, parameter maps of 
individual trees will only give the default decision tree values.

> RandomForestRegressionModel does not update the parameter maps of the 
> DecisionTreeRegressionModels underneath
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29232
>                 URL: https://issues.apache.org/jira/browse/SPARK-29232
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.4.0
>            Reporter: Jiaqi Guo
>            Priority: Major
>
> We trained a RandomForestRegressionModel, and tried to access the trees. Even 
> though the DecisionTreeRegressionModel is correctly built with the proper 
> parameters from random forest, the parameter map is not updated, and still 
> contains only the default value. 
> For example, if a RandomForestRegressor was trained with maxDepth of 12, then 
> accessing the tree information, extractParamMap still returns the default 
> values, with maxDepth=5. Calling the depth itself of 
> DecisionTreeRegressionModel returns the correct value of 12 though.
> This creates issues when we want to access each individual tree and build the 
> trees back up for the random forest estimator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-29232) RandomForestRegressionModel does not update the parameter maps of the DecisionTreeRegressionModels underneath

Reply via email to