Github user hindog commented on the issue:
https://github.com/apache/spark/pull/17174
I believe another performance impact related to this may be attributed to
the `cast` operator failing to match during filter-pushdown, meaning that the
filter on the timestamp will NOT get pushed down to the reader when there's a
cast involved. You can also see this in the [original
ticket's](https://issues.apache.org/jira/browse/SPARK-19145) query plan output
for both versions:
> == Physical Plan ==
> CollectLimit 50000
> +- *HashAggregate(keys=[], functions=[count(1)], output=count#3290L)
> +- Exchange SinglePartition
> +- *HashAggregate(keys=[], functions=[partial_count(1)],
output=count#3339L)
> +- *Project
> +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) >=
2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 19:53:51))
> +- *FileScan parquet default.cstattime#3314 Batched: true, Format:
Parquet, Location:
InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat],
PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema:
struct<time:timestamp>
> == Physical Plan ==
> CollectLimit 50000
> +- *HashAggregate(keys=[], functions=[count(1)], output=count#3238L)
> +- Exchange SinglePartition
> +- *HashAggregate(keys=[], functions=[partial_count(1)],
output=count#3287L)
> +- *Project
> +- *Filter ((isnotnull(time#3262) && (time#3262 >= 1483404831000000)) &&
(time#3262 <= 1484009631000000))
> +- *FileScan parquet default.cstattime#3262 Batched: true, Format:
Parquet, Location:
InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat],
PartitionFilters: [], PushedFilters: [IsNotNull(time),
GreaterThanOrEqual(time,2017-01-02 19:53:51.0),
LessThanOrEqual(time,2017-01-09..., ReadSchema: struct<time:timestamp>
Note the `PushedFilters` is missing the `GreaterThanOrEqual` and
`LessThanOrEqual` predicates on the former query, but present in the latter.
I've narrowed where the PushedFilters get removed to
[DataSourceStrategry.translateFilter](https://github.com/apache/spark/blob/1a29fec8e278a98e69f2e2b6faa11332e8550f30/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L439),
where there's no match for the `Cast` expression so the cast+filter gets
removed from the list to be pushed down.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]