Github user maropu commented on the issue:
https://github.com/apache/spark/pull/17174
I looked into the code and I thought out another solution for this issue;
it tries to detect specific binary comparisons in `Optimizer`, and then
replaces them with specialized ones;
POC: https://github.com/apache/spark/compare/master...maropu:SPARK-19145
quick benchmarks
```
scala> spark.range(10000000L).selectExpr("CAST(current_timestamp AS string)
c0", "current_timestamp + interval 1 day AS
c1").repartition(1).write.parquet("/tmp/t")
scala> paste:
def timer[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block
val t1 = System.nanoTime()
println("Elapsed time: " + ((t1 - t0 + 0.0) / 1000000000.0)+ "s")
result
}
// base number
scala> timer {
spark.read.parquet("/tmp/t").repartition(1).selectExpr("CAST(c0 AS TIMESTAMP) <
c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
Elapsed time: 21.861871939s
// without this patch
scala> timer {
spark.read.parquet("/tmp/t").repartition(1).selectExpr("CAST(c0 AS TIMESTAMP) <
c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
Elapsed time: 48.424095598s
// with this patch
scala> timer { spark.read.parquet("/tmp/t").repartition(1).selectExpr("c0 <
c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
Elapsed time: 22.459832991s
```
WDYT? cc: @hvanhovell @gatorsmile @tanejagagan
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]