Pavan Lanka created ORC-743:
-------------------------------

             Summary: Conversion of SArg into Filters, to take advantage of 
LazyIO
                 Key: ORC-743
                 URL: https://issues.apache.org/jira/browse/ORC-743
             Project: ORC
          Issue Type: New Feature
          Components: Reader
            Reporter: Pavan Lanka
            Assignee: Pavan Lanka


ORC-742 introduces lazy evaluation of the non-filter columns in the presence of 
filters. This builds further on that to convert SArg into filters.
h3. SArg to Filter

SArg to Filter converts the passed SArg into a filter. This enables automatic 
compatibility with both Spark and Hive as they already push down Search 
Arguments down to ORC.

The SArg is automatically converted into a [Vector 
Filter|https://github.pie.apple.com/planka/orc/blob/4_optimize_filter/java/core/src/java/org/apache/orc/filter/VectorFilter.java].
 Which is applied during the read process.

The builder for search argument should allow skipping normalization during the 
[build|https://github.com/apache/hive/blob/storage-branch-2.7/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgumentImpl.java#L491].
 This has already been proposed as part of HIVE-24458.

Normalization is very poor in performance in the presence of multilevel 
predicates.
||Benchmark||(fSize)||(fType)||(normalize)||Mode||Cnt||Score||Error||Units||
|ComplexFilterBench.filter|2|vector|true|avgt|20|74.321|± 0.156|us/op|
|ComplexFilterBench.filter|2|vector|false|avgt|20|78.119|± 0.351|us/op|
|ComplexFilterBench.filter|4|vector|true|avgt|20|267.405|± 1.202|us/op|
|ComplexFilterBench.filter|4|vector|false|avgt|20|136.284|± 0.637|us/op|
|ComplexFilterBench.filter|8|vector|true|avgt|20|9907.765|± 49.208|us/op|
|ComplexFilterBench.filter|8|vector|false|avgt|20|247.714|± 0.651|us/op|

Explanation:
 * *fSize* identifies the size of the OR clause that will be normalized.
 * *normalize* identifies whether normalize was carried out on the Search 
Argument.

Observations:
 * Normalizing the search argument results in a significant performance penalty 
given the explosion of the operator tree
 ** In case where an AND includes 8 ORs, the unnormalized version is faster by 
*97.32%*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to