Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18141 )
Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables ...................................................................... Patch Set 5: (16 comments) http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@10 PS4, Line 10: ++ reader supports pushing down predicates to skip : unreleated RowGroups. The pushed down predicates will be evaludated on : file indexes (i.e. statistics and bloom filter indexes). Note that only : EQUALS and IN-list predicates can leverage bloom > May reword as Sorry for making this unclear. * The native ORC library can accept many kinds of predicates, not just EQUALS and IN-list predicates, but also comparison (e.g. <, >, >=) and IS-[NOT]-NULL predicates, etc. They can both be used to skip unreleated ORC RowGroups. * Each ORC files can have optional bloom filters on different columns. * Only EQUALS and IN-list predicates can leverage these file-level bloom filters. Updated the sentenses. But not sure if they are clear enough. http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@18 PS4, Line 18: indexes. : : This patch adds runtime IN-list filters for this > Suggest to mention it after the introduction section. That is, right before Done http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@30 PS4, Line 30: > nit. remove change to "with"? It means the IN-list has 4 items. http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@34 PS4, Line 34: ps_partkey and l_suppkey = ps_suppkey; : > Not sure if this is right. I thought IN-list will be done inside ORC librar You are right but not sure we have misunderstanding here. There are two kinds of bloom filters: * Runtime bloom filters generated by Impala * Bloom filter indexes in the ORC files (generated by Hive when inserting the table) If the lineitem table is generated with bloom filter indexes, the runtime IN-list filter can have a better filter rate. Updated the sentense. http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221 PS4, Line 1221: / Only apply runtime filters on non-partition columns. > Looks like this can be done in FE. Yeah, the check is done by FE: https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L936 Here we only check the flag set by FE. http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271 PS4, Line 1271: < filter->id(); > I wonder if this method UpdateSearchArgumentWithFilters() is called only on PrepareSearchArguments() will be called multiple times after this patch. Thus the same as UpdateSearchArgumentWithFilters(). The reason is runtime filters will arrive in runtime. So we re-build the SearchArgument each time we start reading a new ORC stripe. However, the above situation seems impossible. When an IN-list filter arrived, it won't be updated anymore. So the predicate should remain the same. BTW, I updated the method comment of PrepareSearchArguments(). Please let me know if it's unclear. http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/partitioned-hash-join-builder.cc File be/src/exec/partitioned-hash-join-builder.cc: http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/partitioned-hash-join-builder.cc@959 PS4, Line 959: //TODO: IN-list filter threshold (default 1024). > Sounds like this is quite important. When the # items in the list in HJ bu Yeah, I added this in the commit message in PS5. Also added the query option. http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/coordinator.cc@599 PS4, Line 599: In-l > In-list size? Done http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter-ir.cc File be/src/runtime/runtime-filter-ir.cc: http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter-ir.cc@40 PS4, Line 40: : case TRuntimeFilterType::IN_LIST: { > Seems to me IN_list will shine in performance when applied to partition col Thanks for catching this! I thought this will only be evaludated in rows level. I should add the skip logic in scanners. EDIT: moved the check to HdfsScanner::EvalRuntimeFilter() http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h File be/src/runtime/runtime-filter.inline.h: http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h@32 PS4, Line 32: switch (filter_desc() > Switch on filter_desc().type to save some IF tests? Good point! Done. http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h@43 PS4, Line 43: line bool RuntimeFilt > Switch on filter_desc().type to save some IF tests? Done http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter-ir.cc File be/src/util/in-list-filter-ir.cc: http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter-ir.cc@30 PS4, Line 30: ent > Turn this into a query option? Done http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter.h File be/src/util/in-list-filter.h: http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter.h@89 PS4, Line 89: ll nume > May consider the exact type (int8_t, int16_t, int32_t or int64_t), similar Yeah, I planned to change the implementation in the future. So choose int64_t for simplicity. Added a TODO for this. http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java: http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@394 PS4, Line 394: r > Seems we need also check the inner side too? I think it's assumed that both sides are casted to the same type. EQUALS predicates are analyzed in BinaryPredicate#analyzeImpl(). A builtin function will be matched if the parameters can be casted [1]. All EQUALS builtin functions are registered in the way that the two args are in the same type [2]. The parameters are then wrapped with CAST expressions in castForFunctionCall() [3]. There is a check for this in BinaryPredicate#toThrift() [4]. I'll also add a check here in case I misunderstand something. [1] https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java#L205 [2] https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java#L126-L127 [3] https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java#L238 [4] https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java#L184-L185 http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@742 PS4, Line 742: public int compare(RuntimeFilter a, RuntimeFilter b) { > Just wonder why not for partitioned HJs. We may use the following to decide I think it's very likely that partitioned HJs will exceed the threshold. But if the build side has a selective predicate, it could still be a good scenerio for IN-list filter. Can we do this in a follow-up JIRA? Some BE codes currently assume that only broadcast joins have runtime IN-list filtlers. http://gerrit.cloudera.org:8080/#/c/18141/4/tests/query_test/test_runtime_filters.py File tests/query_test/test_runtime_filters.py: http://gerrit.cloudera.org:8080/#/c/18141/4/tests/query_test/test_runtime_filters.py@322 PS4, Line 322: > nit. remove Done -- To view, visit http://gerrit.cloudera.org:8080/18141 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501 Gerrit-Change-Number: 18141 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Tue, 08 Feb 2022 02:30:28 +0000 Gerrit-HasComments: Yes
