[
https://issues.apache.org/jira/browse/SPARK-29232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939712#comment-16939712
]
Jiaqi Guo commented on SPARK-29232:
-----------------------------------
[~aman_omer], here is [an example from the Spark
documentation|[https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression]].
{code:java}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.{RandomForestRegressionModel,
RandomForestRegressor}
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Set maxCategories so features with > 4 distinct values are treated as
continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestRegressor()
.setNumTrees(5)
.setMaxDepth(10)
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
// Chain indexer and forest in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, rf))
// Train model. This also runs the indexer.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")
val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel]
println(s"Learned regression forest model:\n ${rfModel.toDebugString}")
{code}
This gives you a random forest model called rfModel. I modified the max depth
to 10 for the trees.
{code:java}
rfModel.extractParamMap()
// Printout
res23: org.apache.spark.ml.param.ParamMap = { rfr_8197914ca605-cacheNodeIds:
false, rfr_8197914ca605-checkpointInterval: 10,
rfr_8197914ca605-featureSubsetStrategy: auto, rfr_8197914ca605-featuresCol:
indexedFeatures, rfr_8197914ca605-impurity: variance,
rfr_8197914ca605-labelCol: label, rfr_8197914ca605-maxBins: 32,
rfr_8197914ca605-maxDepth: 10, rfr_8197914ca605-maxMemoryInMB: 256,
rfr_8197914ca605-minInfoGain: 0.0, rfr_8197914ca605-minInstancesPerNode: 1,
rfr_8197914ca605-numTrees: 5, rfr_8197914ca605-predictionCol: prediction,
rfr_8197914ca605-seed: 235498149, rfr_8197914ca605-subsamplingRate: 1.0 }
{code}
As you can see the maxDepth here is correct. However, if we were to check the
parameter map of the trees.
{code:java}
rfModel.trees(0).extractParamMap()
// Printout
res22: org.apache.spark.ml.param.ParamMap = { dtr_bfcfc13f1334-cacheNodeIds:
false, dtr_bfcfc13f1334-checkpointInterval: 10, dtr_bfcfc13f1334-featuresCol:
features, dtr_bfcfc13f1334-impurity: variance, dtr_bfcfc13f1334-labelCol:
label, dtr_bfcfc13f1334-maxBins: 32, dtr_bfcfc13f1334-maxDepth: 5,
dtr_bfcfc13f1334-maxMemoryInMB: 256, dtr_bfcfc13f1334-minInfoGain: 0.0,
dtr_bfcfc13f1334-minInstancesPerNode: 1, dtr_bfcfc13f1334-predictionCol:
prediction, dtr_bfcfc13f1334-seed: 1366634793 }
{code}
The max depth stays at the default value 5. In fact, parameter maps of
individual trees will only give the default decision tree values.
> RandomForestRegressionModel does not update the parameter maps of the
> DecisionTreeRegressionModels underneath
> -------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-29232
> URL: https://issues.apache.org/jira/browse/SPARK-29232
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.4.0
> Reporter: Jiaqi Guo
> Priority: Major
>
> We trained a RandomForestRegressionModel, and tried to access the trees. Even
> though the DecisionTreeRegressionModel is correctly built with the proper
> parameters from random forest, the parameter map is not updated, and still
> contains only the default value.
> For example, if a RandomForestRegressor was trained with maxDepth of 12, then
> accessing the tree information, extractParamMap still returns the default
> values, with maxDepth=5. Calling the depth itself of
> DecisionTreeRegressionModel returns the correct value of 12 though.
> This creates issues when we want to access each individual tree and build the
> trees back up for the random forest estimator.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]