Zoltán Borók-Nagy created IMPALA-12308:
------------------------------------------
Summary: Implement DIRECTED distribution mode for Iceberg tables
Key: IMPALA-12308
URL: https://issues.apache.org/jira/browse/IMPALA-12308
Project: IMPALA
Issue Type: Bug
Components: Backend, Frontend
Reporter: Zoltán Borók-Nagy
Currently there are two distribution modes for JOIN-operators:
* BROADCAST: RHS is delivered to all executors of LHS
* PARTITIONED: both LHS and RHS are shuffled across executors
We implement reading of an Iceberg V2 table (with position delete files) via an
ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is the
SCAN operator of the delete records. The delete record contain (file_path, pos)
information of the deleted rows.
This means we can invent another distribution mode, just for Iceberg V2 tables
with position deletes: DIRECTED distribution mode.
At scheduling we must save the information about data SCAN operators, i.e. on
which nodes are they going to be executed. The LHS don't need to be shuffled
over the network.
The delete records of RHS can use the scheduling information to transfer delete
records to the hosts that process the corresponding data file.
This minimizes network communication.
We can also add further optimizations to the Iceberg V2 operator
(IcebergDeleteNode):
* Compare the pointers of the file paths instead of doing string compare
* Each tuple in a rowbatch belong to the same file, and positions are in
ascending order
** Onlyone lookup is needed from the Hash table
** We can add fast paths to skip testing the whole rowbatch (when the row
batch's position range is outside of the delete position range)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)