[ 
https://issues.apache.org/jira/browse/SPARK-35370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342559#comment-17342559
 ] 

Avenash Kabeera commented on SPARK-35370:
-----------------------------------------

Some additional details.  My model was saved as  parquet format and after some 
research I found that parquet columns should be case insensitive.  I confirmed 
this by trying a workaround to load my model rename the column to "nodeData" 
and resave it but everything I tried ended up saving the model with the column 
"nodedata."  Given this logic for case insensitivity, doesn't it make more 
sense that the fix mentioned above to supposed loading spark2 models should be 
checking for "nodedata" not "nodeData"?

> IllegalArgumentException when loading a PipelineModel with Spark 3
> ------------------------------------------------------------------
>
>                 Key: SPARK-35370
>                 URL: https://issues.apache.org/jira/browse/SPARK-35370
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 3.1.0, 3.1.1
>         Environment: spark 3.1.1
>            Reporter: Avenash Kabeera
>            Priority: Minor
>              Labels: V3, decisiontree, scala, treemodels
>
> Hi, 
> This is a followup of the this issue 
> https://issues.apache.org/jira/browse/SPARK-33398 that fixed an exception 
> when loading a model in Spark 3 that trained in Spark2.  After incorporating 
> this fix in my project, I ran into another issue which was introduced in the 
> fix [https://github.com/apache/spark/pull/30889/files.]  
> While loading my random forest model which was trained in Spark 2.2, I ran 
> into the following exception:
> {code:java}
> 16:03:34 ERROR Instrumentation:73 - java.lang.IllegalArgumentException: 
> nodeData does not exist. Available: treeid, nodedata
>  at 
> org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
>  at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:147)
>  at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
>  at 
> org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:522)
>  at 
> org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:420)
>  at 
> org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:410)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
>  at 
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277)
>  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>  at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
>  at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
>  at 
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
>  at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
>  at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355)
>  at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337){code}
> When I looked at the data for the model, I see the schema is using 
> "*nodedata*" instead of "*nodeData*."  Here is what my model looks like:
> {code:java}
> +------+-----------------------------------------------------------------------------------------------------------------+
> |treeid|nodedata                                                              
>                                            |
> +------+-----------------------------------------------------------------------------------------------------------------+
> |12    |{0, 1.0, 0.20578590428109744, [249222.0, 1890856.0], 
> 0.046774779237015784, 1, 128, {1, [0.7468856332819338], -1}}|
> |12    |{1, 1.0, 0.49179982674596906, [173902.0, 224985.0], 
> 0.022860340952237657, 2, 65, {4, [0.6627218934911243], -1}}  |
> |12    |{2, 0.0, 0.4912259578159168, [90905.0, 69638.0], 0.10950848921275999, 
> 3, 34, {9, [0.13666873125270484], -1}}     |
> |12    |{3, 1.0, 0.4308078797704775, [23317.0, 50941.0], 0.04311282777881931, 
> 4, 19, {10, [0.506218002482692], -1}}      | {code}
> I'm new to spark and the training of this model predates me so I can't say 
> whether specifying the column as "nodedata" was specific to our code or was 
> internal spark code.  But I'm suspecting it's internal spark code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to