Github user yhuai commented on a diff in the pull request:
https://github.com/apache/spark/pull/10377#discussion_r48374502
--- Diff:
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala ---
@@ -26,15 +26,47 @@ import org.apache.spark.Logging
import org.apache.spark.sql.sources._
/**
- * It may be optimized by push down partial filters. But we are
conservative here.
- * Because if some filters fail to be parsed, the tree may be corrupted,
- * and cannot be used anymore.
+ * Helper object for building ORC `SearchArgument`s, which are used for
ORC predicate push-down.
+ *
+ * Due to limitation of ORC `SearchArgument` builder, we had to end up
with a pretty weird double-
+ * checking pattern when converting `And`/`Or`/`Not` filters.
+ *
+ * An ORC `SearchArgument` must be built in one pass using a single
builder. For example, you can't
+ * build `a = 1` and `b = 2` first, and then combine them into `a = 1 AND
b = 2`. This is quite
+ * different from the cases in Spark SQL or Parquet, where complex filters
can be easily built using
+ * existing simpler ones.
+ *
+ * The annoying part is that, `SearchArgument` builder methods like
`startAnd()`, `startOr()`, and
+ * `startNot()` mutate internal state of the builder instance. This
forces us to translate all
+ * convertible filters with a single builder instance. However, before
actually converting a filter,
+ * we've no idea whether it can be recognized by ORC or not. Thus, when an
inconvertible filter is
+ * found, we may already end up with a builder whose internal state is
inconsistent.
+ *
+ * For example, to convert an `And` filter with builder `b`, we call
`b.startAnd()` first, and then
+ * try to convert its children. Say we convert `left` child successfully,
but find that `right`
+ * child is inconvertible. Alas, `b.startAnd()` call can't be rolled
back, and `b` is inconsistent
+ * now.
+ *
+ * The workaround employed here is that, for `And`/`Or`/`Not`, we first
try to convert their
--- End diff --
Do we need to mention `for And Or Not`? `createFilter` is dealing with the
top level filters (they will be connected by `AND`), right? I think it is
important to emphasize that `createFilter` is for top level filters.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]