Zoltán Borók-Nagy created IMPALA-12308:
------------------------------------------

             Summary: Implement DIRECTED distribution mode for Iceberg tables
                 Key: IMPALA-12308
                 URL: https://issues.apache.org/jira/browse/IMPALA-12308
             Project: IMPALA
          Issue Type: Bug
          Components: Backend, Frontend
            Reporter: Zoltán Borók-Nagy


Currently there are two distribution modes for JOIN-operators:
* BROADCAST: RHS is delivered to all executors of LHS
* PARTITIONED: both LHS and RHS are shuffled across executors

We implement reading of an Iceberg V2 table (with position delete files) via an 
ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is the 
SCAN operator of the delete records. The delete record contain (file_path, pos) 
information of the deleted rows.

This means we can invent another distribution mode, just for Iceberg V2 tables 
with position deletes: DIRECTED distribution mode.

At scheduling we must save the information about data SCAN operators, i.e. on 
which nodes are they going to be executed. The LHS don't need to be shuffled 
over the network.
The delete records of RHS can use the scheduling information to transfer delete 
records to the hosts that process the corresponding data file.

This minimizes network communication.
We can also add further optimizations to the Iceberg V2 operator 
(IcebergDeleteNode):
* Compare the pointers of the file paths instead of doing string compare
* Each tuple in a rowbatch belong to the same file, and positions are in 
ascending order
** Onlyone lookup is needed from the Hash table
** We can add fast paths to skip testing the whole rowbatch (when the row 
batch's position range is outside of the delete position range)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to