Adrian Bridgett created SPARK-22006:
---------------------------------------
Summary: date/datetime comparisons should avoid casting
Key: SPARK-22006
URL: https://issues.apache.org/jira/browse/SPARK-22006
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.2.0
Reporter: Adrian Bridgett
Priority: Minor
I believe there's a relatively simple optimisation that can be done here -
comparing timestamps with dates involves a cast whereas comparing with
datetimes avoids this (and pushes the query down into parquet:
{code}
df.filter(df['local_read_at'] > datetime.date('2017-01-01')).count()
{code}
Results in a plan of:
{code}
+- *Filter (isnotnull(local_read_at#324) && (cast(local_read_at#324 as
string) > 2017-01-01))
+- *FileScan parquet [local_read_at#324] Batched: true, Format:
Parquet, Location: InMemoryFileIndex[s3a://...], PartitionFilters: [],
PushedFilters: [IsNotNull(local_read_at)], ReadSchema:
struct<local_read_at:timestamp>
{code}
Whereas:
{code}
df.filter(df['local_read_at'] > datetime.datetime(2017,1,1)).count()
{code}
Results in:
{code}
+- *Filter (isnotnull(local_read_at#324) && (local_read_at#324 >
1483228800000000))
+- *FileScan parquet [local_read_at#324] Batched: true, Format:
Parquet, Location: InMemoryFileIndex[s3a://...], PartitionFilters: [],
PushedFilters: [IsNotNull(local_read_at), GreaterThan(local_read_at,2017-01-01
00:00:00.0)], ReadSchema: struct<local_read_at:timestamp>
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]