Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18141 )
Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables ...................................................................... Patch Set 8: (8 comments) Replied to and added some more. Can you please also point out the explain output with in-list filters? Love to see them. It is unfortunate that there are massive number of filter Ids changes due to the introduction of the in-list type. I think some day we should re-assign the Ids at the end of compilation so that they are consecutive. http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221 PS4, Line 1221: > Sorry that I'm not quite understand these. > > > I was originally thinking that when the target of a IN-list > filter is partition columns, then the target can be removed in FE. > > Doing the test here means such targets are retained in the plan > and do not contribute. > > Do you mean eliminating the partitions in FE? The IN-list filters > are generated in runtime based on the build side data of hash > joins. I'm afraid we are unable to eliminate them in the plan. > Instead, we will eliminate them in runtime in the code link you > pasted, ie. HdfsScanNodeBase::PartitionPassesFilters(). Did I miss > something? > > > Personally, I feel we should allow the target to be a partition > column in this patch to pick up good performance gain, especially > for large tables with hundreds of partitions. The code to deal with > partition column is here: > https://github.com/apache/impala/blob/master/be/src/exec/hdfs-scan-node-base.cc#L922. > Seems your code will work out of box in this situation if line > @1221 is removed. > > UpdateSearchArgumentWithFilters() is only used in the orc scanner > to push down filters into the ORC lib. We need line 1221 since > partition columns don't exist in the ORC files. > > The logics of HdfsScanNodeBase::PartitionPassesFilters() still > apply on IN-list filters. I don't see it skip using IN-list > filters. So we already support it that filtering out unrelated > partitions by the IN-list filters. Or did I miss something? Okay. I think you are right. The line at 1221 is a protection for not applying the filter on the data files. Sorry I missed that one. http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271 PS4, Line 1271: ataType predicate_type > > Calling PrepareSearchArguments() for each ORC stripe may be an overkill. It seems to me starting filtering without waiting for the merge version to arrive can produce incorrect/non-deterministic results. For example, assume values [1, 2, 10] in the first stripe, and the merged filter is [1, 2]. If a partial filter [2] arrives and is applied, then [1, 10] will be eliminated. However [1] is the answer. Since all filter predicates are conjunctive, it is okay to use a subset of it, which may reduce the filtering efficiency. But the result is still correct. Each filter must be the merged version though. http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/ImpalaService.thrift@725 PS8, Line 725: RUNTIME_IN_LIST_FILTER_ENTRY_LIMIT nit. IN_LIST_FILTER_ENTRY_LIMIT? http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/Query.thrift File common/thrift/Query.thrift: http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/Query.thrift@578 PS8, Line 578: runtime_in_list_filter_entry_limit nit. in_list_filter_entry_limit? http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java: http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@394 PS4, Line 394: r > The above casting is handled in BE in the orc scanner, because the underlyi Okay. Agree casting in BE is the right way to go if data types in orc file can be different from table schema. But doing a feasibility check here for the inner should be done for the reasons mentioned. http://gerrit.cloudera.org:8080/#/c/18141/8/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java: http://gerrit.cloudera.org:8080/#/c/18141/8/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@689 PS8, Line 689: 8 I wonder if this can be improved a little bit, especially for int type, to save some spaces. It is impossible for a column in ORC data file to contain 8-byte integer while the column type is 4-byte int, right? http://gerrit.cloudera.org:8080/#/c/18141/8/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/18141/8/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@8 PS8, Line 8: 3.42K Saw quite many changes on cardinality. Can you explain the reason? http://gerrit.cloudera.org:8080/#/c/18141/8/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test: http://gerrit.cloudera.org:8080/#/c/18141/8/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test@13 PS8, Line 13: HDFS Is this change introduced in this patch? -- To view, visit http://gerrit.cloudera.org:8080/18141 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501 Gerrit-Change-Number: 18141 Gerrit-PatchSet: 8 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Wed, 16 Feb 2022 17:01:53 +0000 Gerrit-HasComments: Yes
