[ 
https://issues.apache.org/jira/browse/IMPALA-13768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy updated IMPALA-13768:
---------------------------------------
    Description: 
IcebergDeleteBuilder assumes that it should only receive delete records for 
paths of data files that are scheduled for its corresponding SCAN operator.

It is not true in the following cases:
 * single node plan is executed (no DIRECTED mode, no filtering)
 * number of output channels is 1 (again, no DIRECTED mode, no filtering)
 * bug in DIRECTED mode, see below

In KrpcDataStreamSender::Send(), variable 'skipped_prev_row' is never checked: 
[https://github.com/apache/impala/blob/1b6395b8db09d271bd166bf501bdf7038d8be644/be/src/runtime/krpc-data-stream-sender.cc#L1174]

Repro:
{noformat}
create table ice_invalid_deletes (bi bigint, year int)
partitioned by spec (year)
stored as iceberg tblproperties ('format-version'='2');

insert into ice_invalid_deletes select bigint_col, year from 
functional.alltypes where month = 10;

with v as (select max(bi) as max_bi from ice_invalid_deletes) insert into 
ice_invalid_deletes select bi + v.max_bi, year from v, ice_invalid_deletes;

delete from ice_invalid_deletes where bi % 11 = 0;

-- All the followings result in error:
-- single output channel
select count(*) from ice_invalid_deletes where year=2010 and bi = 180;
-- bug in KrpcDataStreamSender::Send
select count(*) from ice_invalid_deletes where year>2000 and bi = 180;
-- single node plan
set num_nodes=1;
select count(*) from ice_invalid_deletes where year>2000 and bi = 180;{noformat}

  was:
IcebergDeleteBuilder assumes that it should only receive delete records for 
paths of data files that are scheduled for its corresponding SCAN operator.

It is not true in the following cases:
 * single node plan is executed (no DIRECTED mode, no filtering)
 * number of output channels is 1 (again, no DIRECTED mode, no filtering)
 * bug in DIRECTED mode, see below

In KrpcDataStreamSender::Send(), variable 'skipped_prev_row' is never checked: 
https://github.com/apache/impala/blob/1b6395b8db09d271bd166bf501bdf7038d8be644/be/src/runtime/krpc-data-stream-sender.cc#L1174

Repro:
{noformat}
create table ice_invalid_deletes (bi bigint, year int)
partitioned by spec (year)
stored as iceberg tblproperties ('format-version'='2');

insert into ice_invalid_deletes select bigint_col, year from 
functional.alltypes where month = 10;

with v as (select max(bi) as max_bi from ice_invalid_deletes) insert into 
ice_invalid_deletes select bi + v.max_bi, year from v, ice_invalid_deletes;

delete from ice_invalid_deletes where bi % 11 = 0;

-- All the followings result in error:
-- single output channel
select count(*) from ice_invalid_deletes where year=2010 and bi = 180;
-- bug in KrpcDataStreamSender::Send
select count(*) from ice_invalid_deletes where year>2010 and bi = 180;
-- single node plan
set num_nodes=1;
select count(*) from ice_invalid_deletes where year>2010 and bi = 180;{noformat}


> Redundant Iceberg delete records are shuffled around which cause error 
> "Invalid file path arrived at builder"
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-13768
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13768
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> IcebergDeleteBuilder assumes that it should only receive delete records for 
> paths of data files that are scheduled for its corresponding SCAN operator.
> It is not true in the following cases:
>  * single node plan is executed (no DIRECTED mode, no filtering)
>  * number of output channels is 1 (again, no DIRECTED mode, no filtering)
>  * bug in DIRECTED mode, see below
> In KrpcDataStreamSender::Send(), variable 'skipped_prev_row' is never 
> checked: 
> [https://github.com/apache/impala/blob/1b6395b8db09d271bd166bf501bdf7038d8be644/be/src/runtime/krpc-data-stream-sender.cc#L1174]
> Repro:
> {noformat}
> create table ice_invalid_deletes (bi bigint, year int)
> partitioned by spec (year)
> stored as iceberg tblproperties ('format-version'='2');
> insert into ice_invalid_deletes select bigint_col, year from 
> functional.alltypes where month = 10;
> with v as (select max(bi) as max_bi from ice_invalid_deletes) insert into 
> ice_invalid_deletes select bi + v.max_bi, year from v, ice_invalid_deletes;
> delete from ice_invalid_deletes where bi % 11 = 0;
> -- All the followings result in error:
> -- single output channel
> select count(*) from ice_invalid_deletes where year=2010 and bi = 180;
> -- bug in KrpcDataStreamSender::Send
> select count(*) from ice_invalid_deletes where year>2000 and bi = 180;
> -- single node plan
> set num_nodes=1;
> select count(*) from ice_invalid_deletes where year>2000 and bi = 
> 180;{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to