[GitHub] spark issue #17174: [SPARK-19145][SQL] Timestamp to String casting is slowin...

hindog Tue, 04 Sep 2018 18:13:07 -0700

Github user hindog commented on the issue:

    https://github.com/apache/spark/pull/17174
  
    I believe another performance impact related to this may be attributed to 
the `cast` operator failing to match during filter-pushdown, meaning that the 
filter on the timestamp will NOT get pushed down to the reader when there's a 
cast involved.  You can also see this in the [original 
ticket's](https://issues.apache.org/jira/browse/SPARK-19145) query plan output 
for both versions:
    
    > == Physical Plan ==
    > CollectLimit 50000
    > +- *HashAggregate(keys=[], functions=[count(1)], output=count#3290L)
    > +- Exchange SinglePartition
    > +- *HashAggregate(keys=[], functions=[partial_count(1)], 
output=count#3339L)
    > +- *Project
    > +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) >= 
2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 19:53:51))
    > +- *FileScan parquet default.cstattime#3314 Batched: true, Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
struct<time:timestamp>
    
    > == Physical Plan ==
    > CollectLimit 50000
    > +- *HashAggregate(keys=[], functions=[count(1)], output=count#3238L)
    > +- Exchange SinglePartition
    > +- *HashAggregate(keys=[], functions=[partial_count(1)], 
output=count#3287L)
    > +- *Project
    > +- *Filter ((isnotnull(time#3262) && (time#3262 >= 1483404831000000)) && 
(time#3262 <= 1484009631000000))
    > +- *FileScan parquet default.cstattime#3262 Batched: true, Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
PartitionFilters: [], PushedFilters: [IsNotNull(time), 
GreaterThanOrEqual(time,2017-01-02 19:53:51.0), 
LessThanOrEqual(time,2017-01-09..., ReadSchema: struct<time:timestamp>
    
    Note the `PushedFilters` is missing the `GreaterThanOrEqual` and 
`LessThanOrEqual` predicates on the former query, but present in the latter.  
I've narrowed where the PushedFilters get removed to 
[DataSourceStrategry.translateFilter](https://github.com/apache/spark/blob/1a29fec8e278a98e69f2e2b6faa11332e8550f30/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L439),
 where there's no match for the `Cast` expression so the cast+filter gets 
removed from the list to be pushed down.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17174: [SPARK-19145][SQL] Timestamp to String casting is slowin...

Reply via email to