Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18141 )
Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables ...................................................................... Patch Set 6: (6 comments) Thanks! http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@30 PS4, Line 30: > change to "with"? It means the IN-list has 4 items. Okay. http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@34 PS4, Line 34: ps_partkey and l_suppkey = ps_suppkey; : > You are right but not sure we have misunderstanding here. There are two kin Good to know! Thanks for the explanation. http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221 PS4, Line 1221: f (in_list_filter->AlwaysTrue()) continue; > Yeah, the check is done by FE: https://github.com/apache/impala/blob/6c845e I was originally thinking that when the target of a IN-list filter is partition columns, then the target can be removed in FE. Doing the test here means such targets are retained in the plan and do not contribute. Personally, I feel we should allow the target to be a partition column in this patch to pick up good performance gain, especially for large tables with hundreds of partitions. The code to deal with partition column is here: https://github.com/apache/impala/blob/master/be/src/exec/hdfs-scan-node-base.cc#L922. Seems your code will work out of box in this situation if line @1221 is removed. http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271 PS4, Line 1271: > PrepareSearchArguments() will be called multiple times after this patch. Th Okay. Calling PrepareSearchArguments() for each ORC stripe may be an overkill. My understanding is that there is a consolidation step to merge the filters from different partitions (for PARTITIONED HJ). Only the merged filter can arrive at the scan node. For BROADCAST HJ, such merge step os not needed. http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java: http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@394 PS4, Line 394: r > I think it's assumed that both sides are casted to the same type. EQUALS pr It also depends on how ORC layer handles the types. >From https://orc.apache.org/api/orc-core/org/apache/orc/Reader.Options.html, >https://orc.apache.org/api/hive-storage-api/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.html?is-external=true > and >https://orc.apache.org/api/hive-storage-api/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.html, > it seems the literal list can only take one of the four primitive typed >objects: Integer, Long, Double, or String. Denote such a type T. Then >technically, it is sufficient that both the inner and the outer, after >optional casting, are of type T. Note also that we need to verify the >surviving column values because of IN-list predicates being mapped to ORC >bloom filters. The rules of casting may be like this, in the order of priority. 1. If either the inner or outer is small/tiny int, cast both to int; 2. If either is less than or equal to int, cast both to int; 3. If either is less than or equal to big int, cast both to big int; 4. If either is less than or equal to double, cast both to double; 5. If either is SQL character types, cast both to string; I think it is a good idea to verify the types here to make it possible to detect type mismatch early. http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@742 PS4, Line 742: public int compare(RuntimeFilter a, RuntimeFilter b) { > I think it's very likely that partitioned HJs will exceed the threshold. Bu Sounds like a good idea to handle partitioned HJs in another JIRA. We can borrow BE code from min/max filters to handle both 1) and 2). -- To view, visit http://gerrit.cloudera.org:8080/18141 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501 Gerrit-Change-Number: 18141 Gerrit-PatchSet: 6 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Mon, 14 Feb 2022 17:04:02 +0000 Gerrit-HasComments: Yes
