Imran Rashid created SPARK-22108:
------------------------------------
Summary: Logical Inconsistency in Timestamp Cast
Key: SPARK-22108
URL: https://issues.apache.org/jira/browse/SPARK-22108
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.2.0
Reporter: Imran Rashid
When Spark compares a Timestamp and a String, the cast depends on the type of
operation. Equality casts the string to a timestamp; other comparisons, eg. <=
or >= cast the timestamp to a String. This can lead to logical
inconsistencies, where {{a == b}} but not {{a <= b}}.
The difference appears to be intentional, from SPARK-8420 :
https://github.com/apache/spark/commit/a333a72e029d2546a66b36d6b3458e965430c530
{noformat}
... we casted `TimestampType` and `DataType` to `StringType` when it was
involved
in a binary comparison with a `StringType`. This allowed comparing a timestamp
with
a partial date as a user would expect.
- `time > "2014-06-10"`
- `time > "2014"`
{noformat}
However, this motivation seems flawed; casting to Strings also leads to other
unnatural comparisons, eg. "2015" < "5" by string comparison -- this extra test
case in [{{DateFunctionsSuite."timestamp comparison with date
strings"}}|https://github.com/apache/spark/blob/3e6a714c9ee97ef13b3f2010babded3b63fd9d74/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala#L76]
would fail:
{code}
checkAnswer(
df.select("t").filter($"t" >= "5"),
Row(Date.valueOf("2015-01-01")) :: Nil)
{code}
Here's an example showing the logical inconsistency:
{noformat}
scala> spark.sql("""select
cast('2015-01-01' as timestamp) >= '2015',
cast('2015-01-01' as timestamp) <= '2015',
cast('2015-01-01' as timestamp) = '2015'""").show()
+-------------------------------------------------------
+-------------------------------------------------------
+---------------------------------------------------------+
|(CAST(CAST(2015-01-01 AS TIMESTAMP) AS STRING) >= 2015)
|(CAST(CAST(2015-01-01 AS TIMESTAMP) AS STRING) <= 2015)
|(CAST(2015-01-01 AS TIMESTAMP) = CAST(2015 AS TIMESTAMP))|
+-------------------------------------------------------
+-------------------------------------------------------
+---------------------------------------------------------+
| true
| false
| true|
+-------------------------------------------------------
+-------------------------------------------------------
+---------------------------------------------------------+
{noformat}
FWIW, I looked at the sql spec and other engines. As near as I can tell, the
SQL spec does not specify what implicit casts should be applied (but I could be
wrong). Hive always returns null. Postgres has a few more cases: for literal
inputs, it doesn't accept anything that is just a year, eg "2015". And when
joining two different tables, it forces the user to explicitly specify the cast.
{noformat}
psql (9.6.2)
Type "help" for help.
irashid=# select
irashid-# cast('2015-01-01' as timestamp) >= '2015',
irashid-# cast('2015-01-01' as timestamp) <= '2015',
irashid-# cast('2015-01-01' as timestamp) = '2015'
irashid-# ;
ERROR: invalid input syntax for type timestamp: "2015"
LINE 2: cast('2015-01-01' as timestamp) >= '2015',
^
irashid=# select
irashid-# cast('2015-01-01' as timestamp) >= '2015-01-01',
irashid-# cast('2015-01-01' as timestamp) <= '2015-01-01',
irashid-# cast('2015-01-01' as timestamp) = '2015-01-01'
irashid-# ;
?column? | ?column? | ?column?
----------+----------+----------
t | t | t
(1 row)
irashid=# create table cast_test_ts (x varchar, ts timestamp);
CREATE TABLE
irashid=# insert into cast_test_ts (x, ts) values ('2015-01-01', '2015-01-01');
INSERT 0 1
irashid=# select * from cast_test_ts;
x | ts
------------+---------------------
2015-01-01 | 2015-01-01 00:00:00
(1 row)
irashid=# select a.x, b.ts from cast_test_ts a join cast_test_ts b on (a.x =
b.ts);
ERROR: operator does not exist: character varying = timestamp without time zone
LINE 1: ....ts from cast_test_ts a join cast_test_ts b on (a.x = b.ts);
^
HINT: No operator matches the given name and argument type(s). You might need
to add explicit type casts.
irashid=#
irashid=# select a.x, b.ts from cast_test_ts a join cast_test_ts b on (cast(a.x
as timestamp) = b.ts);
x | ts
------------+---------------------
2015-01-01 | 2015-01-01 00:00:00
(1 row)
irashid=# select a.x, b.ts from cast_test_ts a join cast_test_ts b on (a.x =
cast(b.ts as varchar));
x | ts
---+----
(0 rows)
{noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]