IvanVergiliev commented on a change in pull request #24068: [SPARK-27105][SQL] 
Optimize away exponential complexity in ORC predicate conversion
URL: https://github.com/apache/spark/pull/24068#discussion_r288015721
 
 

 ##########
 File path: 
sql/core/v1.2.1/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ##########
 @@ -150,138 +118,232 @@ private[sql] object OrcFilters extends OrcFiltersBase {
       dataTypeMap: Map[String, DataType],
       expression: Filter,
       builder: Builder): Option[Builder] = {
-    createBuilder(dataTypeMap, expression, builder, 
canPartialPushDownConjuncts = true)
+    filterAndBuild(dataTypeMap, expression, builder)
   }
 
+  sealed trait ActionType
+  case object FilterAction extends ActionType
+  case object BuildAction extends ActionType
+
   /**
-   * @param dataTypeMap a map from the attribute name to its data type.
-   * @param expression the input filter predicates.
-   * @param builder the input SearchArgument.Builder.
-   * @param canPartialPushDownConjuncts whether a subset of conjuncts of 
predicates can be pushed
-   *                                    down safely. Pushing ONLY one side of 
AND down is safe to
-   *                                    do at the top level or none of its 
ancestors is NOT and OR.
-   * @return the builder so far.
+   * Builds a SearchArgument for a Filter by first trimming the 
non-convertible nodes, and then
+   * only building the remaining convertible nodes.
+   *
+   * Doing the conversion in this way avoids the computational complexity 
problems introduced by
+   * checking whether a node is convertible while building it. The approach 
implemented here has
+   * complexity that's linear in the size of the Filter tree - O(number of 
Filter nodes) - we run
+   * a single pass over the tree to trim it, and then another pass on the 
trimmed tree to convert
+   * the remaining nodes.
+   *
+   * The alternative approach of checking-while-building can (and did) result
+   * in exponential complexity in the height of the tree, causing perf 
problems with Filters with
+   * as few as ~35 nodes if they were skewed.
    */
-  private def createBuilder(
+  private def filterAndBuild(
       dataTypeMap: Map[String, DataType],
       expression: Filter,
-      builder: Builder,
-      canPartialPushDownConjuncts: Boolean): Option[Builder] = {
+      builder: Builder
+  ): Option[Builder] = {
     def getType(attribute: String): PredicateLeaf.Type =
       getPredicateLeafType(dataTypeMap(attribute))
 
     import org.apache.spark.sql.sources._
 
-    expression match {
-      case And(left, right) =>
-        // At here, it is not safe to just convert one side and remove the 
other side
-        // if we do not understand what the parent filters are.
-        //
-        // Here is an example used to explain the reason.
-        // Let's say we have NOT(a = 2 AND b in ('1')) and we do not 
understand how to
-        // convert b in ('1'). If we only convert a = 2, we will end up with a 
filter
-        // NOT(a = 2), which will generate wrong results.
-        //
-        // Pushing one side of AND down is only safe to do at the top level or 
in the child
-        // AND before hitting NOT or OR conditions, and in this case, the 
unsupported predicate
-        // can be safely removed.
-        val leftBuilderOption =
-          createBuilder(dataTypeMap, left, newBuilder, 
canPartialPushDownConjuncts)
-        val rightBuilderOption =
-          createBuilder(dataTypeMap, right, newBuilder, 
canPartialPushDownConjuncts)
-        (leftBuilderOption, rightBuilderOption) match {
-          case (Some(_), Some(_)) =>
-            for {
-              lhs <- createBuilder(dataTypeMap, left,
-                builder.startAnd(), canPartialPushDownConjuncts)
-              rhs <- createBuilder(dataTypeMap, right, lhs, 
canPartialPushDownConjuncts)
-            } yield rhs.end()
-
-          case (Some(_), None) if canPartialPushDownConjuncts =>
-            createBuilder(dataTypeMap, left, builder, 
canPartialPushDownConjuncts)
+    // The performAction method can run both the filtering and building 
operations for a given
+    // node - we signify which one we want with the `actionType` parameter.
+    //
+    // There are a couple of benefits to coupling the two operations like this:
+    // 1. All the logic for a given predicate is grouped logically in the same 
place. You don't
+    //   have to scroll across the whole file to see what the filter action 
for an And is while
+    //   you're looking at the build action.
+    // 2. It's much easier to keep the implementations of the two operations 
up-to-date with
+    //   each other. If the `filter` and `build` operations are implemented as 
separate case-matches
+    //   in different methods, it's very easy to change one without 
appropriately updating the
+    //   other. For example, if we add a new supported node type to `filter`, 
it would be very
+    //   easy to forget to update `build` to support it too, thus leading to 
conversion errors.
+    //
+    // Doing things this way does have some annoying side effects:
+    // - We need to return an `Either`, with one action type always returning 
a Left and the other
+    //   always returning a Right.
+    // - We always need to pass the canPartialPushDownConjuncts parameter even 
though the build
 
 Review comment:
   Yep, that's a good idea - implemented.
   
   I had a similar thought around type members per action, but couldn't get it 
to work so kind of gave up on putting more things there.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to