IvanVergiliev commented on a change in pull request #24068: [SPARK-27105][SQL]
Optimize away exponential complexity in ORC predicate conversion
URL: https://github.com/apache/spark/pull/24068#discussion_r288946083
##########
File path:
sql/core/v1.2.1/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
##########
@@ -143,145 +115,231 @@ private[sql] object OrcFilters extends OrcFiltersBase {
case _ => value
}
+ import org.apache.spark.sql.sources._
+ import OrcFilters._
+
+ private[sql] def trimUnconvertibleFilters(expression: Filter):
Option[Filter] = {
+ performFilter(expression, canPartialPushDownConjuncts = true)
+ }
+
/**
- * Build a SearchArgument and return the builder so far.
+ * Builds a SearchArgument for a Filter by first trimming the
non-convertible nodes, and then
+ * only building the remaining convertible nodes.
+ *
+ * Doing the conversion in this way avoids the computational complexity
problems introduced by
+ * checking whether a node is convertible while building it. The approach
implemented here has
+ * complexity that's linear in the size of the Filter tree - O(number of
Filter nodes) - we run
+ * a single pass over the tree to trim it, and then another pass on the
trimmed tree to convert
+ * the remaining nodes.
+ *
+ * The alternative approach of checking-while-building can (and did) result
+ * in exponential complexity in the height of the tree, causing perf
problems with Filters with
+ * as few as ~35 nodes if they were skewed.
*/
- private def buildSearchArgument(
- dataTypeMap: Map[String, DataType],
+ private[sql] def buildSearchArgument(
expression: Filter,
builder: Builder): Option[Builder] = {
- createBuilder(dataTypeMap, expression, builder,
canPartialPushDownConjuncts = true)
+ performFilter(expression, canPartialPushDownConjuncts = true).map { filter
=>
Review comment:
`performFilter` now has only a single usage - the call in
`trimUnconvertibleFilters`. I'm considering removing it completely and just
calling `performAction`, although I like the symmetry of having both
`performFilter` and `updateBuilder` as mini-wrappers around `performAction`.
Let me know if you think I should get rid of `performFilter`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]