Github user zecevicp commented on the issue: https://github.com/apache/spark/pull/21109 Hey Liang-Chi, thanks for looking into this. Yes, the problem can be circumvented by changing the join condition as you describe, but only in the benchmark case, because my "expensive function" was a bit misleading. The problem is not in the function itself, but in the number of rows that are checked for each pair of matching equi-join keys. I changed the benchmark test case now so to better demonstrate this. I completely removed the expensive function and I'm only doing a count on the matched rows. The results are the following. Without the optimization: ``` AMD EPYC 7401 24-Core Processor sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------- sort merge join wholestage off 30956 / 31374 0.0 75575.5 1.0X sort merge join wholestage on 10864 / 11043 0.0 26523.6 2.8X ``` With the optimization: ``` AMD EPYC 7401 24-Core Processor sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------- sort merge join wholestage off 30734 / 31135 0.0 75035.2 1.0X sort merge join wholestage on 959 / 1040 0.4 2341.3 32.0X ``` This shows a 10x improvement over the non-optimized case (as I already said, this depends on the range condition, number of matched rows, the calculated function, etc.). Regarding your second question as to why is the "wholestage off" case in the optimized version so slow, that is because the optimization is turned off when the wholestage code generation is turned off. And that is simply because it was too hard to debug it and I figured the wholestage generation is on by default, so I'm guessing (and hoping) that it would not be too hard of a requirement to have to turn wholestage codegen on if you want to use this optimization.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org