[
https://issues.apache.org/jira/browse/HIVE-29120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junkun Sheng updated HIVE-29120:
--------------------------------
Description:
When hive.default.fileformat is set to TextFile or ORC, using MR and Spark
engines to query bucketed tables may result in data loss issues.*
*The SQL that caused data loss is as follows:*
set hive.execution.engine=mr;
set hive.default.fileformat=TextFile;
drop table if exists lxw_test;
drop table if exists lxw_test1;
drop table if exists woa_all_user_info_his;
create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED
BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED
BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
create table woa_all_user_info_his(imei string,sndaid string,data_time
string,pt string) row format delimited fields terminated by ',' stored as
textfile;
insert into table woa_all_user_info_his
values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
insert into table woa_all_user_info_his
values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
insert overwrite table lxw_test select imei,sndaid,data_time from
woa_all_user_info_his where pt = '2012-05-28';
insert overwrite table lxw_test1 select imei,sndaid,data_time from
woa_all_user_info_his where pt = '2012-05-28';
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select count(1) from lxw_test1 a join lxw_test b on a.imei = b.imei ;
expect result:
+------+
| _c1 |
+------+
| 6 |
+------+
but, actual result:
+------+
| _c1 |
+------+
| 5 |
+------+
was:
??citation??*When hive.default.fileformat is set to TextFile or ORC, using MR
and Spark engines to query bucketed tables may result in data loss issues.*
*The SQL that caused data loss is as follows:*
set hive.execution.engine=mr;
set hive.default.fileformat=TextFile;
drop table if exists lxw_test;
drop table if exists lxw_test1;
drop table if exists woa_all_user_info_his;
create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED
BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED
BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
create table woa_all_user_info_his(imei string,sndaid string,data_time
string,pt string) row format delimited fields terminated by ',' stored as
textfile;
insert into table woa_all_user_info_his
values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
insert into table woa_all_user_info_his
values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
insert overwrite table lxw_test select imei,sndaid,data_time from
woa_all_user_info_his where pt = '2012-05-28';
insert overwrite table lxw_test1 select imei,sndaid,data_time from
woa_all_user_info_his where pt = '2012-05-28';
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select count(1) from lxw_test1 a join lxw_test b on a.imei = b.imei ;
expect result:
+------+
| _c1 |
+------+
| 6 |
+------+
but, actual result:
+------+
| _c1 |
+------+
| 5 |
+------+
> When hive.default.fileformat is set to TextFile or ORC, using MR and Spark
> engines to query bucketed tables may result in data loss issues.
> -------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29120
> URL: https://issues.apache.org/jira/browse/HIVE-29120
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 3.1.0
> Reporter: Junkun Sheng
> Priority: Major
>
> When hive.default.fileformat is set to TextFile or ORC, using MR and Spark
> engines to query bucketed tables may result in data loss issues.*
> *The SQL that caused data loss is as follows:*
> set hive.execution.engine=mr;
> set hive.default.fileformat=TextFile;
> drop table if exists lxw_test;
> drop table if exists lxw_test1;
> drop table if exists woa_all_user_info_his;
> create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED
> BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
> create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED
> BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
> create table woa_all_user_info_his(imei string,sndaid string,data_time
> string,pt string) row format delimited fields terminated by ',' stored as
> textfile;
> insert into table woa_all_user_info_his
> values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
> insert into table woa_all_user_info_his
> values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
> insert overwrite table lxw_test select imei,sndaid,data_time from
> woa_all_user_info_his where pt = '2012-05-28';
> insert overwrite table lxw_test1 select imei,sndaid,data_time from
> woa_all_user_info_his where pt = '2012-05-28';
> set hive.optimize.bucketmapjoin = true;
> set hive.optimize.bucketmapjoin.sortedmerge = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
> select count(1) from lxw_test1 a join lxw_test b on a.imei = b.imei ;
> expect result:
> +------+
> | _c1 |
> +------+
> | 6 |
> +------+
> but, actual result:
> +------+
> | _c1 |
> +------+
> | 5 |
> +------+
--
This message was sent by Atlassian Jira
(v8.20.10#820010)