[ https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677752#comment-16677752 ]
shahid commented on SPARK-25959: -------------------------------- Thanks. I will analyze the issue. > Difference in featureImportances results on computed vs saved models > -------------------------------------------------------------------- > > Key: SPARK-25959 > URL: https://issues.apache.org/jira/browse/SPARK-25959 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.2.0 > Reporter: Suraj Nayak > Priority: Major > > I tried to implement GBT and found that the feature Importance computed while > the model was fit is different when the same model was saved into a storage > and loaded back. > > I also found that once the persistent model is loaded and saved back again > and loaded, the feature importance remains the same. > > Not sure if its bug while storing and reading the model first time or am > missing some parameter that need to be set before saving the model (thus > model is picking some defaults - causing feature importance to change) > > *Below is the test code:* > val testDF = Seq( > (1, 3, 2, 1, 1), > (3, 2, 1, 2, 0), > (2, 2, 1, 1, 0), > (3, 4, 2, 2, 0), > (2, 2, 1, 3, 1) > ).toDF("a", "b", "c", "d", "e") > val featureColumns = testDF.columns.filter(_ != "e") > // Assemble the features into a vector > val assembler = new > VectorAssembler().setInputCols(featureColumns).setOutputCol("features") > // Transform the data to get the feature data set > val featureDF = assembler.transform(testDF) > // Train a GBT model. > val gbt = new GBTClassifier() > .setLabelCol("e") > .setFeaturesCol("features") > .setMaxDepth(2) > .setMaxBins(5) > .setMaxIter(10) > .setSeed(10) > .fit(featureDF) > gbt.transform(featureDF).show(false) > // Write out the model > featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* Prints > (d,0.5931875075767403) > (a,0.3747184548362353) > (b,0.03209403758702444) > (c,0.0) > */ > gbt.write.overwrite().save("file:///tmp/test123") > println("Reading model again") > val gbtload = GBTClassificationModel.load("file:///tmp/test123") > featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* > Prints > (d,0.6455841215290767) > (a,0.3316126797964181) > (b,0.022803198674505094) > (c,0.0) > */ > gbtload.write.overwrite().save("file:///tmp/test123_rewrite") > val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite") > featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* prints > (d,0.6455841215290767) > (a,0.3316126797964181) > (b,0.022803198674505094) > (c,0.0) > */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org