Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
......................................................................


Patch Set 8:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221
PS4, Line 1221:
Sorry that I'm not quite understand these.

> I was originally thinking that when the target of a IN-list filter is 
> partition columns, then the target can be removed in FE.
> Doing the test here means such targets are retained in the plan and do not 
> contribute.

Do you mean eliminating the partitions in FE? The IN-list filters are generated 
in runtime based on the build side data of hash joins. I'm afraid we are unable 
to eliminate them in the plan. Instead, we will eliminate them in runtime in 
the code link you pasted, ie. HdfsScanNodeBase::PartitionPassesFilters(). Did I 
miss something?

> Personally, I feel we should allow the target to be a partition column in 
> this patch to pick up good performance gain, especially for large tables with 
> hundreds of partitions. The code to deal with partition column is here: 
> https://github.com/apache/impala/blob/master/be/src/exec/hdfs-scan-node-base.cc#L922.
>  Seems your code will work out of box in this situation if line @1221 is 
> removed.

UpdateSearchArgumentWithFilters() is only used in the orc scanner to push down 
filters into the ORC lib. We need line 1221 since partition columns don't exist 
in the ORC files.

The logics of HdfsScanNodeBase::PartitionPassesFilters() still apply on IN-list 
filters. I don't see it skip using IN-list filters. So we already support it 
that filtering out unrelated partitions by the IN-list filters. Or did I miss 
something?


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271
PS4, Line 1271: ataType predicate_type
> Calling PrepareSearchArguments() for each ORC stripe may be an overkill.

Yeah, it could be an overkill if we have lots of predicates and runtime IN-list 
filters to push down. Runtime filters arrive randomly so we need to call this 
whenever there is a new runtime filter arrive. I think we can improve this by 
checking the arrival filters count in PrepareSearchArguments() and return if no 
new IN-list filters arrive.

> My understanding is that there is a consolidation step to merge the filters 
> from different partitions (for PARTITIONED HJ). Only the merged filter can 
> arrive at the scan node. For BROADCAST HJ, such merge step os not needed.

Yeah, we don't have the merge step for IN-list filter. However, they can arrive 
here since the coordinator will still publish them.


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@394
PS4, Line 394: r
> It also depends on how ORC layer handles the types.
The above casting is handled in BE in the orc scanner, because the underlying 
ORC files could have different schemas. We can only know the file schema after 
we parse the file footer. The casting codes are in 
HdfsOrcScanner::GetSearchArgumentLiteral().

I think in FE, we just need to make sure these types are supported in BE. The 
BE codes will cast values based on the ORC file schema, or skip using the 
filter if the casting failed.

BTW, the Java implementation of the ORC lib is slightly different to its C++ 
implementation. The ORC C++ lib currently supports these types: 
https://github.com/apache/orc/blob/rel/release-1.7.0/c++/include/orc/sargs/Literal.hh#L72-L110



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Wed, 16 Feb 2022 02:55:14 +0000
Gerrit-HasComments: Yes

Reply via email to