[
https://issues.apache.org/jira/browse/SPARK-49615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weichen Xu reopened SPARK-49615:
--------------------------------
> Feature transformers are case sensitive when unintented
> -------------------------------------------------------
>
> Key: SPARK-49615
> URL: https://issues.apache.org/jira/browse/SPARK-49615
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib, Spark Core
> Affects Versions: 3.4.3
> Reporter: Chhavi Bansal
> Assignee: Weichen Xu
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Hi team,
> https://spark.apache.org/docs/latest/ml-features
> The feature transformers are case sensitive even though the configuration
>
> {code:java}
> spark.conf.get("spark.sql.caseSensitive") {code}
>
> is set to false. The user of all these transformers are forced to abide by
> case of the column in the dataframe
>
> {code:java}
> val data = List(Row("the movie was great", "positive", 10, "greatest of all
> time"),
> Row("the movie was average", "negative", 11, "just average things,
> average storyline"),
> Row("movie was fun", "positive", 2, "superb screen play"))
> val schema = new StructType()
> .add("comments", StringType, true)
> .add("reviews", StringType, true)
> .add("counts", IntegerType, true)
> .add("Additional_COMMENTS", StringType, true)
> val df = spark.createDataFrame(data.asJava, schema)
> val si = new
> StringIndexer().setInputCol("additional_comments").setOutputCol("si_additional_comments")
> si.fit(df).transform(df).show() {code}
> The above code fails with
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Input column
> additional_comments does not exist.
> at
> org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
> at
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
> at
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
> at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
> at
> org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema(StringIndexer.scala:123)
> at
> org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema$(StringIndexer.scala:115)
> {code}
> Which means that the column "additional_comments" needs to be provided in the
> same case as in the dataframe.
>
> I think when the caseSensitive setting is set to false we should be able to
> use the naming in any case.
>
> Can someone please help to solve this bug for all transformers.?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]