xinmingchang created HIVE-28120:
-----------------------------------
Summary: When insert overwrite the iceberg table, data will loss
if the sql contains union all
Key: HIVE-28120
URL: https://issues.apache.org/jira/browse/HIVE-28120
Project: Hive
Issue Type: Bug
Components: Iceberg integration
Affects Versions: 4.0.0-beta-1
Environment: hadoop version: 3.3.1
hive version: 4.0.0-beta-1
iceberg version: 1.3.0
Reporter: xinmingchang
h1. step1:create iceberg test table
create table tmp.test_iceberg_overwrite_union_all(
a string,
ptime bigint
)
partitioned by spec(ptime)
stored by iceberg
stored as parquet
location
'hdfs://ns1/user/hive/warehouse_4/tmp.db/test_iceberg_overwrite_union_all'
;
h1. step2:create data table
create table tmp.data(
a string
)
stored by iceberg
stored as parquet
location 'hdfs://ns1/user/hive/warehouse_4/tmp.db/data'
;
insert into tmp.data values('1'),('2'),('3');
h1. step3:write data to iceberg test table
insert overwrite table tmp.test_iceberg_overwrite_union_all
select a,20240315120000 as ptime
from tmp.data
union all
select a,20240315120000 as ptime
from tmp.data
;
select * from tmp.test_iceberg_overwrite_union_all;
the result only has 3 records:
+-------------------------------------+-----------------------------------------+
| test_iceberg_overwrite_union_all.a | test_iceberg_overwrite_union_all.ptime
|
+-------------------------------------+-----------------------------------------+
| 1 | 20240315120000
|
| 2 | 20240315120000
|
| 3 | 20240315120000
|
+-------------------------------------+-----------------------------------------+
the rigtht result should be:
+-------------------------------------+-----------------------------------------+
| test_iceberg_overwrite_union_all.a | test_iceberg_overwrite_union_all.ptime
|
+-------------------------------------+-----------------------------------------+
| 1 | 20240315120000
|
| 2 | 20240315120000
|
| 3 | 20240315120000
|
| 1 | 20240315120000
|
| 2 | 20240315120000
|
| 3 | 20240315120000
|
+-------------------------------------+-----------------------------------------+
According to the hiveserver log, this query will start two jobs, and each job
will be committed. The problem is that the job that is committed later is also
an overwrite, causing the result of the first commit to be overwritten.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)