[
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-25959:
------------------------------------
Assignee: Apache Spark
> Difference in featureImportances results on computed vs saved models
> --------------------------------------------------------------------
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib
> Affects Versions: 2.2.0
> Reporter: Suraj Nayak
> Assignee: Apache Spark
> Priority: Major
>
> I tried to implement GBT and found that the feature Importance computed while
> the model was fit is different when the same model was saved into a storage
> and loaded back.
>
> I also found that once the persistent model is loaded and saved back again
> and loaded, the feature importance remains the same.
>
> Not sure if its bug while storing and reading the model first time or am
> missing some parameter that need to be set before saving the model (thus
> model is picking some defaults - causing feature importance to change)
>
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]