[
https://issues.apache.org/jira/browse/HIVE-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xinmingchang updated HIVE-28120:
--------------------------------
Description:
h1. step1:create iceberg test table
create table tmp.test_iceberg_overwrite_union_all(
a string,
ptime bigint
)
partitioned by spec(ptime)
stored by iceberg
stored as parquet
location
'hdfs://ns1/user/hive/warehouse_4/tmp.db/test_iceberg_overwrite_union_all'
;
h1. step2:create data table
create table tmp.data(
a string
)
stored by iceberg
stored as parquet
location 'hdfs://ns1/user/hive/warehouse_4/tmp.db/data'
;
insert into tmp.data values('1'),('2'),('3');
h1. step3:write data to iceberg test table
insert overwrite table tmp.test_iceberg_overwrite_union_all
select a,20240315120000 as ptime
from tmp.data
union all
select a,20240315120000 as ptime
from tmp.data
;
select * from tmp.test_iceberg_overwrite_union_all;
the result only has 3 records:
{+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
|test_iceberg_overwrite_union_all.a |test_iceberg_overwrite_union_all.ptime |
{+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
|1 |20240315120000 |
|2 |20240315120000 |
|3 |20240315120000 |
{+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
the right result should has 6 records:
{+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
|test_iceberg_overwrite_union_all.a |test_iceberg_overwrite_union_all.ptime |
{+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
|1 |20240315120000 |
|2 |20240315120000 |
|3 |20240315120000 |
|1 |20240315120000 |
|2 |20240315120000 |
|3 |20240315120000 |
{+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
According to the hiveserver log, this query will start two jobs, and each job
will be committed. The problem is that the job that is committed later is also
an overwrite, causing the result of the first commit to be overwritten. like
this:
2024-03-05T22:10:12,995 INFO [iceberg-commit-table-pool-0]:
hive.HiveIcebergOutputCommitter (:()) - Committing job has started for table:
default_iceberg.tmp.test_iceberg_overwrite_union_all
2024-03-05T22:10:13,081 INFO [iceberg-commit-table-pool-1]:
hive.HiveIcebergOutputCommitter (:()) - Committing job has started for table:
default_iceberg.tmp.test_iceberg_overwrite_union_all
2024-03-05T22:10:15,152 INFO [iceberg-commit-table-pool-0]:
hive.HiveIcebergOutputCommitter (:()) - Overwrite commit took 2157 ms for
table: default_iceberg.tmp.test_iceberg_overwrite_union_all with 1 file(s)
2024-03-05T22:10:16,980 INFO [iceberg-commit-table-pool-1]:
hive.HiveIcebergOutputCommitter (:()) - Overwrite commit took 3899 ms for
table: default_iceberg.tmp.test_iceberg_overwrite_union_all with 1 file(s)
was:
h1. step1:create iceberg test table
create table tmp.test_iceberg_overwrite_union_all(
a string,
ptime bigint
)
partitioned by spec(ptime)
stored by iceberg
stored as parquet
location
'hdfs://ns1/user/hive/warehouse_4/tmp.db/test_iceberg_overwrite_union_all'
;
h1. step2:create data table
create table tmp.data(
a string
)
stored by iceberg
stored as parquet
location 'hdfs://ns1/user/hive/warehouse_4/tmp.db/data'
;
insert into tmp.data values('1'),('2'),('3');
h1. step3:write data to iceberg test table
insert overwrite table tmp.test_iceberg_overwrite_union_all
select a,20240315120000 as ptime
from tmp.data
union all
select a,20240315120000 as ptime
from tmp.data
;
select * from tmp.test_iceberg_overwrite_union_all;
the result only has 3 records:
{+}------------------------------------{-}{-}{+}----------------------------------------+
|test_iceberg_overwrite_union_all.a |test_iceberg_overwrite_union_all.ptime |
{+}------------------------------------{-}{-}{+}----------------------------------------+
|1 |20240315120000 |
|2 |20240315120000 |
|3 |20240315120000 |
{+}------------------------------------{-}{-}{+}----------------------------------------+
the right result should has 6 records:
{+}------------------------------------{-}{-}{+}----------------------------------------+
|test_iceberg_overwrite_union_all.a |test_iceberg_overwrite_union_all.ptime |
{+}------------------------------------{-}{-}{+}----------------------------------------+
|1 |20240315120000 |
|2 |20240315120000 |
|3 |20240315120000 |
|1 |20240315120000 |
|2 |20240315120000 |
|3 |20240315120000 |
{+}------------------------------------{-}{-}{+}----------------------------------------+
According to the hiveserver log, this query will start two jobs, and each job
will be committed. The problem is that the job that is committed later is also
an overwrite, causing the result of the first commit to be overwritten.
> When insert overwrite the iceberg table, data will loss if the sql contains
> union all
> -------------------------------------------------------------------------------------
>
> Key: HIVE-28120
> URL: https://issues.apache.org/jira/browse/HIVE-28120
> Project: Hive
> Issue Type: Bug
> Components: Iceberg integration
> Affects Versions: 4.0.0-beta-1
> Environment: hadoop version: 3.3.1
> hive version: 4.0.0-beta-1
> iceberg version: 1.3.0
> Reporter: xinmingchang
> Priority: Critical
>
> h1. step1:create iceberg test table
> create table tmp.test_iceberg_overwrite_union_all(
> a string,
> ptime bigint
> )
> partitioned by spec(ptime)
> stored by iceberg
> stored as parquet
> location
> 'hdfs://ns1/user/hive/warehouse_4/tmp.db/test_iceberg_overwrite_union_all'
> ;
> h1. step2:create data table
> create table tmp.data(
> a string
> )
> stored by iceberg
> stored as parquet
> location 'hdfs://ns1/user/hive/warehouse_4/tmp.db/data'
> ;
> insert into tmp.data values('1'),('2'),('3');
> h1. step3:write data to iceberg test table
> insert overwrite table tmp.test_iceberg_overwrite_union_all
> select a,20240315120000 as ptime
> from tmp.data
> union all
> select a,20240315120000 as ptime
> from tmp.data
> ;
> select * from tmp.test_iceberg_overwrite_union_all;
>
> the result only has 3 records:
> {+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
> |test_iceberg_overwrite_union_all.a |test_iceberg_overwrite_union_all.ptime
> |
> {+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
> |1 |20240315120000 |
> |2 |20240315120000 |
> |3 |20240315120000 |
> {+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
> the right result should has 6 records:
> {+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
> |test_iceberg_overwrite_union_all.a |test_iceberg_overwrite_union_all.ptime
> |
> {+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
> |1 |20240315120000 |
> |2 |20240315120000 |
> |3 |20240315120000 |
> |1 |20240315120000 |
> |2 |20240315120000 |
> |3 |20240315120000 |
> {+}-----------------------------------{-}{{-}}{-}{-}{+}---------------------------------------+
>
> According to the hiveserver log, this query will start two jobs, and each job
> will be committed. The problem is that the job that is committed later is
> also an overwrite, causing the result of the first commit to be overwritten.
> like this:
> 2024-03-05T22:10:12,995 INFO [iceberg-commit-table-pool-0]:
> hive.HiveIcebergOutputCommitter (:()) - Committing job has started for table:
> default_iceberg.tmp.test_iceberg_overwrite_union_all
> 2024-03-05T22:10:13,081 INFO [iceberg-commit-table-pool-1]:
> hive.HiveIcebergOutputCommitter (:()) - Committing job has started for table:
> default_iceberg.tmp.test_iceberg_overwrite_union_all
> 2024-03-05T22:10:15,152 INFO [iceberg-commit-table-pool-0]:
> hive.HiveIcebergOutputCommitter (:()) - Overwrite commit took 2157 ms for
> table: default_iceberg.tmp.test_iceberg_overwrite_union_all with 1 file(s)
> 2024-03-05T22:10:16,980 INFO [iceberg-commit-table-pool-1]:
> hive.HiveIcebergOutputCommitter (:()) - Overwrite commit took 3899 ms for
> table: default_iceberg.tmp.test_iceberg_overwrite_union_all with 1 file(s)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)