Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
......................................................................


Patch Set 5:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@10
PS4, Line 10: ++ reader supports pushing down predicates to skip
            : unreleated RowGroups. The pushed down predicates will be 
evaludated on
            : file indexes (i.e. statistics and bloom filter indexes). Note 
that only
            : EQUALS and IN-list predicates can leverage bloom
> May reword as
Sorry for making this unclear.

* The native ORC library can accept many kinds of predicates, not just EQUALS 
and IN-list predicates, but also comparison (e.g. <, >, >=) and IS-[NOT]-NULL 
predicates, etc. They can both be used to skip unreleated ORC RowGroups.
* Each ORC files can have optional bloom filters on different columns.
* Only EQUALS and IN-list predicates can leverage these file-level bloom 
filters.

Updated the sentenses. But not sure if they are clear enough.


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@18
PS4, Line 18: indexes.
            :
            : This patch adds runtime IN-list filters for this
> Suggest to mention it after the introduction section. That is, right before
Done


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@30
PS4, Line 30:
> nit. remove
change to "with"? It means the IN-list has 4 items.


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@34
PS4, Line 34:  ps_partkey and l_suppkey = ps_suppkey;
            :
> Not sure if this is right. I thought IN-list will be done inside ORC librar
You are right but not sure we have misunderstanding here. There are two kinds 
of bloom filters:

* Runtime bloom filters generated by Impala
* Bloom filter indexes in the ORC files (generated by Hive when inserting the 
table)

If the lineitem table is generated with bloom filter indexes, the runtime 
IN-list filter can have a better filter rate.
Updated the sentense.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221
PS4, Line 1221: / Only apply runtime filters on non-partition columns.
> Looks like this can be done in FE.
Yeah, the check is done by FE: 
https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L936

Here we only check the flag set by FE.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271
PS4, Line 1271: < filter->id();
> I wonder if this method UpdateSearchArgumentWithFilters() is called only on
PrepareSearchArguments() will be called multiple times after this patch. Thus 
the same as UpdateSearchArgumentWithFilters(). The reason is runtime filters 
will arrive in runtime. So we re-build the SearchArgument each time we start 
reading a new ORC stripe.

However, the above situation seems impossible. When an IN-list filter arrived, 
it won't be updated anymore. So the predicate should remain the same.

BTW, I updated the method comment of PrepareSearchArguments(). Please let me 
know if it's unclear.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/partitioned-hash-join-builder.cc@959
PS4, Line 959: //TODO: IN-list filter threshold (default 1024).
> Sounds like this is quite important.  When the # items in the list in HJ bu
Yeah, I added this in the commit message in PS5. Also added the query option.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/coordinator.cc@599
PS4, Line 599: In-l
> In-list size?
Done


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter-ir.cc
File be/src/runtime/runtime-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter-ir.cc@40
PS4, Line 40:
            :     case TRuntimeFilterType::IN_LIST: {
> Seems to me IN_list will shine in performance when applied to partition col
Thanks for catching this! I thought this will only be evaludated in rows level. 
I should add the skip logic in scanners.

EDIT: moved the check to HdfsScanner::EvalRuntimeFilter()


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h
File be/src/runtime/runtime-filter.inline.h:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h@32
PS4, Line 32: switch (filter_desc()
> Switch on filter_desc().type to save some IF tests?
Good point! Done.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h@43
PS4, Line 43: line bool RuntimeFilt
> Switch on filter_desc().type to save some IF tests?
Done


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter-ir.cc
File be/src/util/in-list-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter-ir.cc@30
PS4, Line 30:  ent
> Turn this into a query option?
Done


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter.h
File be/src/util/in-list-filter.h:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter.h@89
PS4, Line 89: ll nume
> May consider the exact type (int8_t, int16_t, int32_t or int64_t), similar
Yeah, I planned to change the implementation in the future. So choose int64_t 
for simplicity. Added a TODO for this.


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@394
PS4, Line 394: r
> Seems we need also check the inner side too?
I think it's assumed that both sides are casted to the same type. EQUALS 
predicates are analyzed in BinaryPredicate#analyzeImpl(). A builtin function 
will be matched if the parameters can be casted [1]. All EQUALS builtin 
functions are registered in the way that the two args are in the same type [2]. 
The parameters are then wrapped with CAST expressions in castForFunctionCall() 
[3].

There is a check for this in BinaryPredicate#toThrift() [4]. I'll also add a 
check here in case I misunderstand something.

[1] 
https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java#L205
[2] 
https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java#L126-L127
[3] 
https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java#L238
[4] 
https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java#L184-L185


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@742
PS4, Line 742:   public int compare(RuntimeFilter a, RuntimeFilter b) {
> Just wonder why not for partitioned HJs. We may use the following to decide
I think it's very likely that partitioned HJs will exceed the threshold. But if 
the build side has a selective predicate, it could still be a good scenerio for 
IN-list filter.

Can we do this in a follow-up JIRA? Some BE codes currently assume that only 
broadcast joins have runtime IN-list filtlers.


http://gerrit.cloudera.org:8080/#/c/18141/4/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/18141/4/tests/query_test/test_runtime_filters.py@322
PS4, Line 322:
> nit. remove
Done



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Tue, 08 Feb 2022 02:30:28 +0000
Gerrit-HasComments: Yes

Reply via email to