Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/17174
  
    I looked into the code and I thought out another solution for this issue; 
it tries to detect specific binary comparisons in `Optimizer`, and then 
replaces them with specialized ones;
    POC: https://github.com/apache/spark/compare/master...maropu:SPARK-19145
    
    quick benchmarks
    ```
    scala> spark.range(10000000L).selectExpr("CAST(current_timestamp AS string) 
c0", "current_timestamp + interval 1 day AS 
c1").repartition(1).write.parquet("/tmp/t")
    scala> paste:
    def timer[R](block: => R): R = {
      val t0 = System.nanoTime()
      val result = block
      val t1 = System.nanoTime()
      println("Elapsed time: " + ((t1 - t0 + 0.0) / 1000000000.0)+ "s")
      result
    }
    
    // base number
    scala> timer { 
spark.read.parquet("/tmp/t").repartition(1).selectExpr("CAST(c0 AS TIMESTAMP) < 
c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
    Elapsed time: 21.861871939s    
    
    // without this patch
    scala> timer { 
spark.read.parquet("/tmp/t").repartition(1).selectExpr("CAST(c0 AS TIMESTAMP) < 
c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
    Elapsed time: 48.424095598s   
    
    // with this patch
    scala> timer { spark.read.parquet("/tmp/t").repartition(1).selectExpr("c0 < 
c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
    Elapsed time: 22.459832991s                                                 
    
    ```
    WDYT? cc: @hvanhovell @gatorsmile @tanejagagan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to