[jira] [Commented] (IMPALA-12327) Iceberg V2 operator wrong results in PARTITIONED mode

ASF subversion and git services (Jira) Mon, 21 Aug 2023 15:30:04 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757111#comment-17757111
 ]


ASF subversion and git services commented on IMPALA-12327:
----------------------------------------------------------

Commit a34f7ce63299c72ef45a99b01bb4e80210befbff in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=a34f7ce63 ]

IMPALA-12342: Erasure coding build fails on loading iceberg_lineitem_multiblock

Previous to this patch we tried to load table
iceberg_lineitem_multiblock with HDFS block size 524288. This failed
in builds that use HDFS erasure coding which requires block size at
least 1048576.

This patch increases the block size to 1048576. This also triggers
the bug that was fixed by IMPALA-12327. But to have more tests with
multiblock tables this patch also adds table iceberg_lineitem_sixblocks
and few tests with different MT_DOP settings.

Testing:
 * tested in build with HDFS EC

Change-Id: Iad15a335407c12578eb822bb1cb4450647502e50
Reviewed-on: http://gerrit.cloudera.org:8080/20359
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Iceberg V2 operator wrong results in PARTITIONED mode
> -----------------------------------------------------
>
>                 Key: IMPALA-12327
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12327
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>             Fix For: Impala 4.3.0
>
>
> The Iceberg delete node tries to do mini merge-joins between data records and 
> delete records. This works in DISTRIBUTED mode, and most of the time in 
> PARTITIONED mode as well. The Iceberg delete node had the wrong assumption 
> that if the rows in a row batch belong to the same file, and come in 
> ascending order, we don't need to update the IcebergDeleteState which tracks 
> the state of the probing.
> But when PARTITIONED mode is used, we cannot rely on ascending row order, not 
> even inside row batches, not even when the previous file path is the same as 
> the current one.
> This is because files with multiple blocks can be processed by multiple hosts 
> in parallel, then the rows are getting hash-exchanged based on their file 
> paths. Then the exchange-receiver at the LHS coalesces the row batches from 
> multiple senders, hence the row IDs getting unordered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-12327) Iceberg V2 operator wrong results in PARTITIONED mode

Reply via email to