[
https://issues.apache.org/jira/browse/IMPALA-13768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy updated IMPALA-13768:
---------------------------------------
Description:
IcebergDeleteBuilder assumes that it should only receive delete records for
paths of data files that are scheduled for its corresponding SCAN operator.
It is not true in the following cases:
* single node plan is executed (no DIRECTED mode, no filtering)
* number of output channels is 1 (again, no DIRECTED mode, no filtering)
* bug in DIRECTED mode, see below
In KrpcDataStreamSender::Send(), variable 'skipped_prev_row' is never checked:
[https://github.com/apache/impala/blob/1b6395b8db09d271bd166bf501bdf7038d8be644/be/src/runtime/krpc-data-stream-sender.cc#L1174]
Repro:
{noformat}
create table ice_invalid_deletes (bi bigint, year int)
partitioned by spec (year)
stored as iceberg tblproperties ('format-version'='2');
insert into ice_invalid_deletes select bigint_col, year from
functional.alltypes where month = 10;
with v as (select max(bi) as max_bi from ice_invalid_deletes) insert into
ice_invalid_deletes select bi + v.max_bi, year from v, ice_invalid_deletes;
delete from ice_invalid_deletes where bi % 11 = 0;
-- All the followings result in error:
-- single output channel
select count(*) from ice_invalid_deletes where year=2010 and bi = 180;
-- bug in KrpcDataStreamSender::Send
select count(*) from ice_invalid_deletes where year>2000 and bi = 180;
-- single node plan
set num_nodes=1;
select count(*) from ice_invalid_deletes where year>2000 and bi = 180;{noformat}
was:
IcebergDeleteBuilder assumes that it should only receive delete records for
paths of data files that are scheduled for its corresponding SCAN operator.
It is not true in the following cases:
* single node plan is executed (no DIRECTED mode, no filtering)
* number of output channels is 1 (again, no DIRECTED mode, no filtering)
* bug in DIRECTED mode, see below
In KrpcDataStreamSender::Send(), variable 'skipped_prev_row' is never checked:
https://github.com/apache/impala/blob/1b6395b8db09d271bd166bf501bdf7038d8be644/be/src/runtime/krpc-data-stream-sender.cc#L1174
Repro:
{noformat}
create table ice_invalid_deletes (bi bigint, year int)
partitioned by spec (year)
stored as iceberg tblproperties ('format-version'='2');
insert into ice_invalid_deletes select bigint_col, year from
functional.alltypes where month = 10;
with v as (select max(bi) as max_bi from ice_invalid_deletes) insert into
ice_invalid_deletes select bi + v.max_bi, year from v, ice_invalid_deletes;
delete from ice_invalid_deletes where bi % 11 = 0;
-- All the followings result in error:
-- single output channel
select count(*) from ice_invalid_deletes where year=2010 and bi = 180;
-- bug in KrpcDataStreamSender::Send
select count(*) from ice_invalid_deletes where year>2010 and bi = 180;
-- single node plan
set num_nodes=1;
select count(*) from ice_invalid_deletes where year>2010 and bi = 180;{noformat}
> Redundant Iceberg delete records are shuffled around which cause error
> "Invalid file path arrived at builder"
> -------------------------------------------------------------------------------------------------------------
>
> Key: IMPALA-13768
> URL: https://issues.apache.org/jira/browse/IMPALA-13768
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Zoltán Borók-Nagy
> Assignee: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-iceberg
>
> IcebergDeleteBuilder assumes that it should only receive delete records for
> paths of data files that are scheduled for its corresponding SCAN operator.
> It is not true in the following cases:
> * single node plan is executed (no DIRECTED mode, no filtering)
> * number of output channels is 1 (again, no DIRECTED mode, no filtering)
> * bug in DIRECTED mode, see below
> In KrpcDataStreamSender::Send(), variable 'skipped_prev_row' is never
> checked:
> [https://github.com/apache/impala/blob/1b6395b8db09d271bd166bf501bdf7038d8be644/be/src/runtime/krpc-data-stream-sender.cc#L1174]
> Repro:
> {noformat}
> create table ice_invalid_deletes (bi bigint, year int)
> partitioned by spec (year)
> stored as iceberg tblproperties ('format-version'='2');
> insert into ice_invalid_deletes select bigint_col, year from
> functional.alltypes where month = 10;
> with v as (select max(bi) as max_bi from ice_invalid_deletes) insert into
> ice_invalid_deletes select bi + v.max_bi, year from v, ice_invalid_deletes;
> delete from ice_invalid_deletes where bi % 11 = 0;
> -- All the followings result in error:
> -- single output channel
> select count(*) from ice_invalid_deletes where year=2010 and bi = 180;
> -- bug in KrpcDataStreamSender::Send
> select count(*) from ice_invalid_deletes where year>2000 and bi = 180;
> -- single node plan
> set num_nodes=1;
> select count(*) from ice_invalid_deletes where year>2000 and bi =
> 180;{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]