[ 
https://issues.apache.org/jira/browse/IMPALA-12308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785706#comment-17785706
 ] 

ASF subversion and git services commented on IMPALA-12308:
----------------------------------------------------------

Commit 18c31fd383d81082b06071b03b49650ec89e7086 in impala's branch 
refs/heads/master from Gabor Kaszab
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=18c31fd38 ]

IMPALA-12308: DIRECTED distribution mode for V2 Iceberg tables

For Iceberg tables, when joining the data files with the delete files,
both of the current distribution modes (broadcast, partitioned) are
wasteful. The idea is that when we read a row from a delete file it
contains the name of the data file that this particular delete row is
referring to so if we knew where that data file is scheduled we could
directly send that delete file row there.

This patch enhances the scheduler to collect the information about
which data file is scheduled on which host. Since, the scan node for
the data files are on the same host as the Iceberg join node, we can
send the delete files directly to that specific host.

Functional testing:
 - Re-run full test suite to check for regressions.

Performance testing:
1) Local machine: SELECT COUNT(1) FROM TPCH10_parquet.lineitem
Around 15% of the rows are deleted.
As the table is unpartitioned I got a small number of delete files with
relatively large size.
Query runtime decreased by ~80%

2) Local machine: SELECT COUNT(1) FROM TPCDS10_store_sales
Around 15% of the rows are deleted.
Table is partitioned that results more delete files but smaller in
size.
Query runtime decreased by ~50%

3) Performance testing in a multi-node with data stored on S3.
SELECT COUNT(1) FROM a scaled store_sales table having ~8.6B rows and
~15% are deleted.
Here we had 2 scenarios:
  a) Table is written by Impala: One delete file row is sent exactly to
     one host.
  b) Table is written by Hive: Here apparently the data files are
     bigger and one data file might be spread to multiple scan ranges.
     As a result one delete file row might be sent to multiple hosts.
     The time difference between the a) run is the time spent on
     sending out more delete file rows.
  - Results with 10-node
    a) Runtime decreased by ~80%.
    b) Runtime decreased by ~60%.
  - Results with 20-node
    a) Runtime decreased by ~65%.
    b) Runtime decreased by ~42%.
  - Results with 40-node
    a) Runtime decreased by ~55%.
    b) Runtime decreased by ~42%.

Change-Id: I212afd7c9e94551a1c50a40ccb0e3c1f7ecdf3d2
Reviewed-on: http://gerrit.cloudera.org:8080/20548
Reviewed-by: Tamas Mate <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Implement DIRECTED distribution mode for Iceberg tables
> -------------------------------------------------------
>
>                 Key: IMPALA-12308
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12308
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend, Frontend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Gabor Kaszab
>            Priority: Major
>              Labels: impala-iceberg, performance
>
> Currently there are two distribution modes for JOIN-operators:
> * BROADCAST: RHS is delivered to all executors of LHS
> * PARTITIONED: both LHS and RHS are shuffled across executors
> We implement reading of an Iceberg V2 table (with position delete files) via 
> an ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is 
> the SCAN operator of the delete records. The delete record contain 
> (file_path, pos) information of the deleted rows.
> This means we can invent another distribution mode, just for Iceberg V2 
> tables with position deletes: DIRECTED distribution mode.
> At scheduling we must save the information about data SCAN operators, i.e. on 
> which nodes are they going to be executed. The LHS don't need to be shuffled 
> over the network.
> The delete records of RHS can use the scheduling information to transfer 
> delete records to the hosts that process the corresponding data file.
> This minimizes network communication.
> We can also add further optimizations to the Iceberg V2 operator 
> (IcebergDeleteNode):
> * Compare the pointers of the file paths instead of doing string compare
> * Each tuple in a rowbatch belong to the same file, and positions are in 
> ascending order
> ** Onlyone lookup is needed from the Hash table
> ** We can add fast paths to skip testing the whole rowbatch (when the row 
> batch's position range is outside of the delete position range)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to