Adrian Bridgett created SPARK-22006:
---------------------------------------

             Summary: date/datetime comparisons should avoid casting
                 Key: SPARK-22006
                 URL: https://issues.apache.org/jira/browse/SPARK-22006
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.2.0
            Reporter: Adrian Bridgett
            Priority: Minor


I believe there's a relatively simple optimisation that can be done here - 
comparing timestamps with dates involves a cast whereas comparing with 
datetimes avoids this (and pushes the query down into parquet:

{code}
df.filter(df['local_read_at'] > datetime.date('2017-01-01')).count()
{code}
Results in a plan of:
{code}
         +- *Filter (isnotnull(local_read_at#324) && (cast(local_read_at#324 as 
string) > 2017-01-01))
            +- *FileScan parquet [local_read_at#324] Batched: true, Format: 
Parquet, Location: InMemoryFileIndex[s3a://...], PartitionFilters: [], 
PushedFilters: [IsNotNull(local_read_at)], ReadSchema: 
struct<local_read_at:timestamp>
{code}

Whereas:
{code}
df.filter(df['local_read_at'] > datetime.datetime(2017,1,1)).count()
{code}
Results in:
{code}
         +- *Filter (isnotnull(local_read_at#324) && (local_read_at#324 > 
1483228800000000))
            +- *FileScan parquet [local_read_at#324] Batched: true, Format: 
Parquet, Location: InMemoryFileIndex[s3a://...], PartitionFilters: [], 
PushedFilters: [IsNotNull(local_read_at), GreaterThan(local_read_at,2017-01-01 
00:00:00.0)], ReadSchema: struct<local_read_at:timestamp>
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to