[
https://issues.apache.org/jira/browse/SPARK-48652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-48652:
----------------------------------
Priority: Critical (was: Blocker)
> Casting Issue in Spark SQL: String Column Compared to Integer Value Yields
> Empty Results
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-48652
> URL: https://issues.apache.org/jira/browse/SPARK-48652
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 3.3.2
> Reporter: Abhishek Singh
> Priority: Critical
> Labels: newbie, pull-request-available
>
> In Spark SQL, comparing a string column to an integer value can lead to
> unexpected results due to type casting resulting in an empty result set.
> {code:java}
> case class Person(id: String, name: String)
> val personDF = Seq(Person("a", "amit"), Person("b", "abhishek")).toDF()
> personDF.createOrReplaceTempView("person_ddf")
> val sqlQuery = "SELECT * FROM person_ddf WHERE id <> -1"
> val resultDF = spark.sql(sqlQuery)
> resultDF.show() // Empty result due to type casting issue
> {code}
> Below is the logical and physical plan which I m getting
> {code:java}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter NOT ('id = -1)
> +- 'UnresolvedRelation [person_ddf], [], false
> == Analyzed Logical Plan ==
> id: string, name: string
> Project [id#356, name#357]
> +- Filter NOT (cast(id#356 as int) = -1)
> +- SubqueryAlias person_ddf
> +- View (`person_ddf`, [id#356,name#357])
> +- LocalRelation [id#356, name#357]{code}
> *But when I m using the same query and table in Redshift which is based on
> PostGreSQL. I am getting the desired result.*
> {code:java}
> select * from person where id <> -1; {code}
> Explain plan obtained in Redshift.
> {code:java}
> XN Seq Scan on person (cost=0.00..0.03 rows=1 width=336)
> Filter: ((id)::text <> '-1'::text) {code}
>
> In the execution plan for Spark, the ID column is cast as an integer, while
> in Redshift, the ID column is cast as a varchar.
> Shouldn't Spark SQL handle this the same way as Redshift, using the datatype
> of the ID column rather than the datatype of -1?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]