[ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6201:
---------------------------------
    Description: 
Suppose we have the following table:

{code}
sqlc.jsonRDD(sc.parallelize(Seq("{\"a\": \"1\"}}", "{\"a\": \"2\"}}", "{\"a\": 
\"3\"}}"))).registerTempTable("d")
{code}

The schema is
{noformat}
root
 |-- a: string (nullable = true)
{noformat}

Then,

{code}
sql("select * from d where (d.a = 1 or d.a = 2)").collect
=>
Array([1], [2])
{code}

where d.a and constants 1,2 will be casted to Double first and do the 
comparison as you can find it out in the plan:

{noformat}
Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
DoubleType) = CAST(2, DoubleType)))
{noformat}

However, if I use

{code}
sql("select * from d where d.a in (1,2)").collect
{code}

The result is empty.

The physical plan shows it's using INSET:
{noformat}
== Physical Plan ==
Filter a#155 INSET (1,2)
 PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
{noformat}


*It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
where Hive does. We should make SparkSQL conform to Hive's behavior, even 
though doing implicit coercion here is very confusing for comparing String and 
Int.*

Jianshi


  was:
Suppose we have the following table:

{code}
sqlc.jsonRDD(sc.parallelize(Seq("{\"a\": \"1\"}}", "{\"a\": \"2\"}}", "{\"a\": 
\"3\"}}"))).registerTempTable("d")
{code}

The schema is
{noformat}
root
 |-- a: string (nullable = true)
{noformat}

Then,

{code}
sql("select * from d where (d.a = 1 or d.a = 2)").collect
=>
Array([1], [2])
{code}

where d.a and constants 1,2 will be casted to Double first and do the 
comparison as you can find it out in the plan:

{noformat}
Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
DoubleType) = CAST(2, DoubleType)))
{noformat}

However, if I use

{code}
sql("select * from d where d.a in (1,2)").collect
{code}

The result is empty.

The physical plan shows it's using INSET:
{noformat}
== Physical Plan ==
Filter a#155 INSET (1,2)
 PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
{noformat}

*But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
where Hive does.*

Jianshi



> INSET should coerce types
> -------------------------
>
>                 Key: SPARK-6201
>                 URL: https://issues.apache.org/jira/browse/SPARK-6201
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0, 1.3.0, 1.2.1
>            Reporter: Jianshi Huang
>
> Suppose we have the following table:
> {code}
> sqlc.jsonRDD(sc.parallelize(Seq("{\"a\": \"1\"}}", "{\"a\": \"2\"}}", 
> "{\"a\": \"3\"}}"))).registerTempTable("d")
> {code}
> The schema is
> {noformat}
> root
>  |-- a: string (nullable = true)
> {noformat}
> Then,
> {code}
> sql("select * from d where (d.a = 1 or d.a = 2)").collect
> =>
> Array([1], [2])
> {code}
> where d.a and constants 1,2 will be casted to Double first and do the 
> comparison as you can find it out in the plan:
> {noformat}
> Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
> DoubleType) = CAST(2, DoubleType)))
> {noformat}
> However, if I use
> {code}
> sql("select * from d where d.a in (1,2)").collect
> {code}
> The result is empty.
> The physical plan shows it's using INSET:
> {noformat}
> == Physical Plan ==
> Filter a#155 INSET (1,2)
>  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
> {noformat}
> *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
> where Hive does. We should make SparkSQL conform to Hive's behavior, even 
> though doing implicit coercion here is very confusing for comparing String 
> and Int.*
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to