Zoltán Borók-Nagy created IMPALA-12327:
------------------------------------------

             Summary: Iceberg V2 operator wrong results in PARTITIONED mode
                 Key: IMPALA-12327
                 URL: https://issues.apache.org/jira/browse/IMPALA-12327
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
            Reporter: Zoltán Borók-Nagy


The Iceberg delete node tries to do mini merge-joins between data records and 
delete records. This works in DISTRIBUTED mode, and most of the time in 
PARTITIONED mode as well. The Iceberg delete node had the wrong assumption that 
if the rows in a row batch belong to the same file, and come in ascending 
order, we don't need to update the IcebergDeleteState which tracks the state of 
the probing.

But when PARTITIONED mode is used, we cannot rely on ascending row order, not 
even inside row batches, not even when the previous file path is the same as 
the current one.

This is because files with multiple blocks can be processed by multiple hosts 
in parallel, then the rows are getting hash-exchanged based on their file 
paths. Then the exchange-receiver at the LHS coalesces the row batches from 
multiple senders, hence the row IDs getting unordered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to