[
https://issues.apache.org/jira/browse/IMPALA-12308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabor Kaszab updated IMPALA-12308:
----------------------------------
Issue Type: Improvement (was: Bug)
> Implement DIRECTED distribution mode for Iceberg tables
> -------------------------------------------------------
>
> Key: IMPALA-12308
> URL: https://issues.apache.org/jira/browse/IMPALA-12308
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend, Frontend
> Reporter: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-iceberg, performance
>
> Currently there are two distribution modes for JOIN-operators:
> * BROADCAST: RHS is delivered to all executors of LHS
> * PARTITIONED: both LHS and RHS are shuffled across executors
> We implement reading of an Iceberg V2 table (with position delete files) via
> an ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is
> the SCAN operator of the delete records. The delete record contain
> (file_path, pos) information of the deleted rows.
> This means we can invent another distribution mode, just for Iceberg V2
> tables with position deletes: DIRECTED distribution mode.
> At scheduling we must save the information about data SCAN operators, i.e. on
> which nodes are they going to be executed. The LHS don't need to be shuffled
> over the network.
> The delete records of RHS can use the scheduling information to transfer
> delete records to the hosts that process the corresponding data file.
> This minimizes network communication.
> We can also add further optimizations to the Iceberg V2 operator
> (IcebergDeleteNode):
> * Compare the pointers of the file paths instead of doing string compare
> * Each tuple in a rowbatch belong to the same file, and positions are in
> ascending order
> ** Onlyone lookup is needed from the Hash table
> ** We can add fast paths to skip testing the whole rowbatch (when the row
> batch's position range is outside of the delete position range)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]