Imran Rashid created SPARK-22108:
------------------------------------

             Summary: Logical Inconsistency in Timestamp Cast
                 Key: SPARK-22108
                 URL: https://issues.apache.org/jira/browse/SPARK-22108
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.0
            Reporter: Imran Rashid


When Spark compares a Timestamp and a String, the cast depends on the type of 
operation.  Equality casts the string to a timestamp; other comparisons, eg. <= 
or >= cast the timestamp to a String.  This can lead to logical 
inconsistencies, where {{a == b}} but not {{a <= b}}.  

The difference appears to be intentional, from SPARK-8420 : 
https://github.com/apache/spark/commit/a333a72e029d2546a66b36d6b3458e965430c530

{noformat}
... we casted `TimestampType` and `DataType` to `StringType` when it was 
involved
in a binary comparison with a `StringType`.  This allowed comparing a timestamp 
with
 a partial date as a user would expect.
 - `time > "2014-06-10"`
 - `time > "2014"`
{noformat}

However, this motivation seems flawed; casting to Strings also leads to other 
unnatural comparisons, eg. "2015" < "5" by string comparison -- this extra test 
case in [{{DateFunctionsSuite."timestamp comparison with date 
strings"}}|https://github.com/apache/spark/blob/3e6a714c9ee97ef13b3f2010babded3b63fd9d74/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala#L76]
 would fail:

{code}
    checkAnswer(
      df.select("t").filter($"t" >= "5"),
      Row(Date.valueOf("2015-01-01")) :: Nil)
{code}

Here's an example showing the logical inconsistency:

{noformat}
scala> spark.sql("""select
cast('2015-01-01' as timestamp) >= '2015',
cast('2015-01-01' as timestamp) <= '2015',
cast('2015-01-01' as timestamp) = '2015'""").show()
+-------------------------------------------------------
+-------------------------------------------------------
+---------------------------------------------------------+
|(CAST(CAST(2015-01-01 AS TIMESTAMP) AS STRING) >= 2015)
|(CAST(CAST(2015-01-01 AS TIMESTAMP) AS STRING) <= 2015)
|(CAST(2015-01-01 AS TIMESTAMP) = CAST(2015 AS TIMESTAMP))|
+-------------------------------------------------------
+-------------------------------------------------------
+---------------------------------------------------------+
|                                                   true
|                                                  false
|                                                     true|
+-------------------------------------------------------
+-------------------------------------------------------
+---------------------------------------------------------+
{noformat}

FWIW, I looked at the sql spec and other engines.  As near as I can tell, the 
SQL spec does not specify what implicit casts should be applied (but I could be 
wrong).  Hive always returns null.  Postgres has a few more cases: for literal 
inputs, it doesn't accept anything that is just a year, eg "2015".  And when 
joining two different tables, it forces the user to explicitly specify the cast.

{noformat}
psql (9.6.2)
Type "help" for help.

irashid=# select
irashid-# cast('2015-01-01' as timestamp) >= '2015',
irashid-# cast('2015-01-01' as timestamp) <= '2015',
irashid-# cast('2015-01-01' as timestamp) = '2015'
irashid-# ;
ERROR:  invalid input syntax for type timestamp: "2015"
LINE 2: cast('2015-01-01' as timestamp) >= '2015',
                                           ^
irashid=# select
irashid-# cast('2015-01-01' as timestamp) >= '2015-01-01',
irashid-# cast('2015-01-01' as timestamp) <= '2015-01-01',
irashid-# cast('2015-01-01' as timestamp) = '2015-01-01'
irashid-# ;
 ?column? | ?column? | ?column? 
----------+----------+----------
 t        | t        | t
(1 row)



irashid=# create table cast_test_ts (x varchar, ts timestamp);
CREATE TABLE
irashid=# insert into cast_test_ts (x, ts) values ('2015-01-01', '2015-01-01');
INSERT 0 1
irashid=# select * from cast_test_ts;
     x      |         ts          
------------+---------------------
 2015-01-01 | 2015-01-01 00:00:00
(1 row)

irashid=# select a.x, b.ts from cast_test_ts a join cast_test_ts b on (a.x = 
b.ts);
ERROR:  operator does not exist: character varying = timestamp without time zone
LINE 1: ....ts from cast_test_ts a join cast_test_ts b on (a.x = b.ts);
                                                               ^
HINT:  No operator matches the given name and argument type(s). You might need 
to add explicit type casts.
irashid=# 
irashid=# select a.x, b.ts from cast_test_ts a join cast_test_ts b on (cast(a.x 
as timestamp) = b.ts);
     x      |         ts          
------------+---------------------
 2015-01-01 | 2015-01-01 00:00:00
(1 row)

irashid=# select a.x, b.ts from cast_test_ts a join cast_test_ts b on (a.x = 
cast(b.ts as varchar));
 x | ts 
---+----
(0 rows)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to