Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
......................................................................


Patch Set 4:

(16 comments)

Looks good to me!

http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@10
PS4, Line 10: Unfortunately they can't leverage the bloom filters in
            : ORC files. Because only EQUALS and IN-list predicates can 
leverage them
            : to skip unrelated ORC RowGroups, and we can't convert runtime 
bloom
            : filters or min-max filters into such predicates.
May reword as

Unfortunately the native ORC library can only accept EQUALS and IN-list to skip 
related ORC RowGroups, to which both runtime bloom or min-max filters can't be 
converted.


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@18
PS4, Line 18: Evaluating runtime IN-list filters is much slower than evaluating
            : runtime bloom filters due to the current simple implementation 
(i.e.
            : std::unorder_set). So we disable it at row level.
Suggest to mention it after the introduction section. That is, right before the 
TODO.


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@30
PS4, Line 30: of
nit. remove


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@34
PS4, Line 34: so the pushed down IN-list filter can have a better
            : filter rate.
Not sure if this is right. I thought IN-list will be done inside ORC library 
layer and bloom in impala layer.

Maybe say it as: Note that in-list filters and bloom filters are orthogonal 
because of different operation locations, it is desirable to keep the bloom 
filters in the query plan.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221
PS4, Line 1221: f (filter->IsBoundByPartitionColumn(GetScanNodeId())) continue;
Looks like this can be done in FE.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271
PS4, Line 1271: PrepareInListPredicate
I wonder if this method UpdateSearchArgumentWithFilters() is called only once. 
Since PrepareInListPredicate() can put the ORC predicate in two forms, and if 
this method is called multiple times, then we could end up with the following 
interesting situation:

1. List of 1 item -> EQUALS form;
2. List of 4 times -> IN-LIST form;

The final form should be the IN-LIST form including the item from 1.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/partitioned-hash-join-builder.cc@959
PS4, Line 959: //TODO: IN-list filter threshold (default 1024).
Sounds like this is quite important.  When the # items in the list in HJ 
builder is over the threshold, we set the filter to ALWAYS TRUE.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/coordinator.cc@599
PS4, Line 599: List
In-list size?


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter-ir.cc
File be/src/runtime/runtime-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter-ir.cc@40
PS4, Line 40: Evaluating IN-list filter is much slower than evaluating the 
corresponding bloom
            :     // filter. Skip it until we improve its performance.
Seems to me IN_list will shine in performance when applied to partition columns.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h
File be/src/runtime/runtime-filter.inline.h:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h@32
PS4, Line 32: if (is_bloom_filter()
Switch on filter_desc().type to save some IF tests?


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h@43
PS4, Line 43: if (is_bloom_filter()
Switch on filter_desc().type to save some IF tests?


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter-ir.cc
File be/src/util/in-list-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter-ir.cc@30
PS4, Line 30: 1024
Turn this into a query option?


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter.h
File be/src/util/in-list-filter.h:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter.h@89
PS4, Line 89: int64_t
May consider the exact type (int8_t, int16_t, int32_t or int64_t), similar to 
min/max filters, to save memory space.


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@394
PS4, Line 394: 0
Seems we need also check the inner side too?


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@742
PS4, Line 742: // Don't generate IN-list filters for partitioned joins.
Just wonder why not for partitioned HJs. We may use the following to decide.

1) Partitioned or broadcasted HJ
2) Each join column =has stats and the total # of rows <= in-list threshold 
(say 1024)


http://gerrit.cloudera.org:8080/#/c/18141/4/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/18141/4/tests/query_test/test_runtime_filters.py@322
PS4, Line 322: both
nit. remove



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Comment-Date: Fri, 04 Feb 2022 17:57:38 +0000
Gerrit-HasComments: Yes

Reply via email to