[jira] [Updated] (HIVE-29120) When hive.default.fileformat is set to TextFile or ORC, using MR and Spark engines to query bucketed tables may result in data loss issues.

Junkun Sheng (Jira) Sun, 03 Aug 2025 19:11:12 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-29120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Junkun Sheng updated HIVE-29120:
--------------------------------
    Description: 
When hive.default.fileformat is set to TextFile or ORC, using MR and Spark 
engines to query bucketed tables may result in data loss issues.*

*The SQL that caused data loss is as follows:*
set hive.execution.engine=mr;
set hive.default.fileformat=TextFile;
drop table if exists lxw_test;
drop table if exists lxw_test1;
drop table if exists woa_all_user_info_his;
create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED 
BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED 
BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
create table woa_all_user_info_his(imei string,sndaid string,data_time 
string,pt string) row format delimited fields terminated by ',' stored as 
textfile;
insert into table woa_all_user_info_his 
values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
insert into table woa_all_user_info_his 
values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
insert overwrite table lxw_test  select imei,sndaid,data_time     from 
woa_all_user_info_his    where pt = '2012-05-28';
insert overwrite table lxw_test1  select imei,sndaid,data_time    from 
woa_all_user_info_his    where pt = '2012-05-28';
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select count(1) from lxw_test1 a  join lxw_test b  on a.imei = b.imei ;

expect result：
+------+
|   _c1   |
+------+
|    6      |
+------+

but, actual result:
+------+
|    _c1  |
+------+
|     5    |
+------+

  was:
??citation??*When hive.default.fileformat is set to TextFile or ORC, using MR 
and Spark engines to query bucketed tables may result in data loss issues.*

*The SQL that caused data loss is as follows:*
set hive.execution.engine=mr;
set hive.default.fileformat=TextFile;
drop table if exists lxw_test;
drop table if exists lxw_test1;
drop table if exists woa_all_user_info_his;
create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED 
BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED 
BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
create table woa_all_user_info_his(imei string,sndaid string,data_time 
string,pt string) row format delimited fields terminated by ',' stored as 
textfile;
insert into table woa_all_user_info_his 
values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
insert into table woa_all_user_info_his 
values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
insert overwrite table lxw_test  select imei,sndaid,data_time     from 
woa_all_user_info_his    where pt = '2012-05-28';
insert overwrite table lxw_test1  select imei,sndaid,data_time    from 
woa_all_user_info_his    where pt = '2012-05-28';
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select count(1) from lxw_test1 a  join lxw_test b  on a.imei = b.imei ;

expect result：
+------+
|   _c1   |
+------+
|    6      |
+------+

but, actual result:
+------+
|    _c1  |
+------+
|     5    |
+------+


> When hive.default.fileformat is set to TextFile or ORC, using MR and Spark 
> engines to query bucketed tables may result in data loss issues.
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29120
>                 URL: https://issues.apache.org/jira/browse/HIVE-29120
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 3.1.0
>            Reporter: Junkun Sheng
>            Priority: Major
>
> When hive.default.fileformat is set to TextFile or ORC, using MR and Spark 
> engines to query bucketed tables may result in data loss issues.*
> *The SQL that caused data loss is as follows:*
> set hive.execution.engine=mr;
> set hive.default.fileformat=TextFile;
> drop table if exists lxw_test;
> drop table if exists lxw_test1;
> drop table if exists woa_all_user_info_his;
> create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED 
> BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
> create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED 
> BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
> create table woa_all_user_info_his(imei string,sndaid string,data_time 
> string,pt string) row format delimited fields terminated by ',' stored as 
> textfile;
> insert into table woa_all_user_info_his 
> values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
> insert into table woa_all_user_info_his 
> values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
> insert overwrite table lxw_test  select imei,sndaid,data_time     from 
> woa_all_user_info_his    where pt = '2012-05-28';
> insert overwrite table lxw_test1  select imei,sndaid,data_time    from 
> woa_all_user_info_his    where pt = '2012-05-28';
> set hive.optimize.bucketmapjoin = true;
> set hive.optimize.bucketmapjoin.sortedmerge = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
> select count(1) from lxw_test1 a  join lxw_test b  on a.imei = b.imei ;
> expect result：
> +------+
> |   _c1   |
> +------+
> |    6      |
> +------+
> but, actual result:
> +------+
> |    _c1  |
> +------+
> |     5    |
> +------+



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-29120) When hive.default.fileformat is set to TextFile or ORC, using MR and Spark engines to query bucketed tables may result in data loss issues.

Reply via email to