Chhavi Bansal created SPARK-49615:
-------------------------------------

             Summary: Feature transformers are case sensitive when unintented
                 Key: SPARK-49615
                 URL: https://issues.apache.org/jira/browse/SPARK-49615
             Project: Spark
          Issue Type: Bug
          Components: ML, MLlib, Spark Core
    Affects Versions: 3.4.3
            Reporter: Chhavi Bansal


Hi team,

https://spark.apache.org/docs/latest/ml-features

The feature transformers are case sensitive even though the configuration 

 
{code:java}
spark.conf.get("spark.sql.caseSensitive") {code}
 

is set to false. The user of all these transformers are forced to abide by case 
of the column in the dataframe

 
{code:java}
 val data = List(Row("the movie was great", "positive", 10, "greatest of all 
time"),
    Row("the movie was average", "negative", 11, "just average things, average 
storyline"),
    Row("movie was fun", "positive", 2, "superb screen play"))
  val schema = new StructType()
    .add("comments", StringType, true)
    .add("reviews", StringType, true)
    .add("counts", IntegerType, true)
    .add("Additional_COMMENTS", StringType, true)
val df = spark.createDataFrame(data.asJava, schema)
  val si = new 
StringIndexer().setInputCol("additional_comments").setOutputCol("si_additional_comments")
  si.fit(df).transform(df).show() {code}
The above code fails with 
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Input column 
additional_comments does not exist.
    at 
org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
    at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
    at 
org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema(StringIndexer.scala:123)
    at 
org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema$(StringIndexer.scala:115)
 {code}
Which means that the column "additional_comments" needs to be provided in the 
same case as in the dataframe. 

 

I think when the caseSensitive  setting is set to false we should be able to 
use the naming in any case.

 


Can someone please help to solve this bug for all transformers.?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to