[
https://issues.apache.org/jira/browse/HIVE-29120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junkun Sheng updated HIVE-29120:
--------------------------------
Description:
*When hive.default.fileformat is set to TextFile or ORC, using MR and Spark
engines to query bucketed tables may result in data loss issues.*
*The SQL that caused data loss is as follows:*
set hive.execution.engine=mr;
set hive.default.fileformat=TextFile;
drop table if exists lxw_test;
drop table if exists lxw_test1;
drop table if exists woa_all_user_info_his;
create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED
BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED
BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
create table woa_all_user_info_his(imei string,sndaid string,data_time
string,pt string) row format delimited fields terminated by ',' stored as
textfile;
insert into table woa_all_user_info_his
values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
insert into table woa_all_user_info_his
values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
insert overwrite table lxw_test select imei,sndaid,data_time from
woa_all_user_info_his where pt = '2012-05-28';
insert overwrite table lxw_test1 select imei,sndaid,data_time from
woa_all_user_info_his where pt = '2012-05-28';
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select count(1) from lxw_test1 a join lxw_test b on a.imei = b.imei ;
expect result:
+------+
| _c1 |
+------+
| 6 |
+------+
but, actual result:
+------+
| _c1 |
+------+
| 5 |
+------+
was:
When hive.default.fileformat is set to TextFile or ORC, using MR and Spark
engines to query bucketed tables may result in data loss issues.
The SQL that caused data loss is as follows:
set hive.execution.engine=mr;
set hive.default.fileformat=TextFile;
drop table if exists lxw_test;
drop table if exists lxw_test1;
drop table if exists woa_all_user_info_his;
create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED
BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED
BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
create table woa_all_user_info_his(imei string,sndaid string,data_time
string,pt string) row format delimited fields terminated by ',' stored as
textfile;
insert into table woa_all_user_info_his
values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
insert into table woa_all_user_info_his
values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
insert overwrite table lxw_test select imei,sndaid,data_time from
woa_all_user_info_his where pt = '2012-05-28';
insert overwrite table lxw_test1 select imei,sndaid,data_time from
woa_all_user_info_his where pt = '2012-05-28';
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select count(1) from lxw_test1 a join lxw_test b on a.imei = b.imei ;
expect result:
+------+
| _c1 |
+------+
| 6 |
+------+
but, actual result:
+------+
| _c1 |
+------+
| 5 |
+------+
> When hive.default.fileformat is set to TextFile or ORC, using MR and Spark
> engines to query bucketed tables may result in data loss issues.
> -------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29120
> URL: https://issues.apache.org/jira/browse/HIVE-29120
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 3.1.0
> Reporter: Junkun Sheng
> Priority: Major
>
> *When hive.default.fileformat is set to TextFile or ORC, using MR and Spark
> engines to query bucketed tables may result in data loss issues.*
> *The SQL that caused data loss is as follows:*
> set hive.execution.engine=mr;
> set hive.default.fileformat=TextFile;
> drop table if exists lxw_test;
> drop table if exists lxw_test1;
> drop table if exists woa_all_user_info_his;
> create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED
> BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
> create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED
> BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
> create table woa_all_user_info_his(imei string,sndaid string,data_time
> string,pt string) row format delimited fields terminated by ',' stored as
> textfile;
> insert into table woa_all_user_info_his
> values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
> insert into table woa_all_user_info_his
> values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
> insert overwrite table lxw_test select imei,sndaid,data_time from
> woa_all_user_info_his where pt = '2012-05-28';
> insert overwrite table lxw_test1 select imei,sndaid,data_time from
> woa_all_user_info_his where pt = '2012-05-28';
> set hive.optimize.bucketmapjoin = true;
> set hive.optimize.bucketmapjoin.sortedmerge = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
> select count(1) from lxw_test1 a join lxw_test b on a.imei = b.imei ;
> expect result:
> +------+
> | _c1 |
> +------+
> | 6 |
> +------+
> but, actual result:
> +------+
> | _c1 |
> +------+
> | 5 |
> +------+
--
This message was sent by Atlassian Jira
(v8.20.10#820010)