[GitHub] [spark] IvanVergiliev commented on a change in pull request #24068: [SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion

GitBox Wed, 12 Jun 2019 21:52:54 -0700

IvanVergiliev commented on a change in pull request #24068: [SPARK-27105][SQL] 
Optimize away exponential complexity in ORC predicate conversion
URL: https://github.com/apache/spark/pull/24068#discussion_r293204913


 ##########
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ##########
 @@ -362,6 +394,13 @@ object FilterPushdownBenchmark extends BenchmarkBase with 
SQLHelper {
     }
 
     runBenchmark(s"Pushdown benchmark with many filters") {
+      // This benchmark and the next one are similar in that they both test 
predicate pushdown
+      // where the filter itself is very large. There have been cases where 
the filter conversion
+      // would take minutes to hours for large filters due to it being 
implemented with exponential
+      // complexity in the height of the filter tree.
+      // The difference between these two benchmarks is that this one 
benchmarks pushdown with a
+      // large string filter (`a AND b AND c ...`), whereas the next one 
benchmarks pushdown with
+      // a large Column-based filter (`col(a) || (col(b) || (col(c)...))`).
 
 Review comment:
   @cloud-fan the two go through different code paths. The string-based one was 
added in https://github.com/apache/spark/pull/22313 , but it doesn't expose the 
slowness when passing a `Column` filter directly. That is, the string-based one 
was fast before this PR. The one this PR fixes is specifically when passing in 
a `Column` directly to something like `df.filter(Column)`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] IvanVergiliev commented on a change in pull request #24068: [SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion

Reply via email to