Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/22313#discussion_r214775155
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
---
@@ -71,12 +71,24 @@ private[orc] object OrcFilters {
for {
// Combines all convertible filters using `And` to produce a single
conjunction
- conjunction <-
convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
+ conjunction <- buildTree(convertibleFilters)
--- End diff --
For the first question, I don't think Parquet has the same issue because
Parquet uses `canMakeFilterOn` while ORC is trying to build a full result (with
a fresh builder) to check if it's okay or not.
For the second question, in ORC, we already did the first half(`flatMap`)
to compute `convertibleFilters`, but it can change it with `filters.filter`.
```scala
val convertibleFilters = for {
filter <- filters
_ <- buildSearchArgument(dataTypeMap, filter,
SearchArgumentFactory.newBuilder())
} yield filter
```
2. And, the second half `reduceOption(FilterApi.and)` was the original ORC
code which generated a skewed tree having exponential time complexity. We need
to use `buildTree`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]