[ 
https://issues.apache.org/jira/browse/HIVE-29120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junkun Sheng updated HIVE-29120:
--------------------------------
    Description: 
When hive.default.fileformat is set to TextFile or ORC, using MR and Spark 
engines to query bucketed tables may result in data loss issues.*

*The SQL that caused data loss is as follows:*
set hive.execution.engine=mr;
set hive.default.fileformat=TextFile;
drop table if exists lxw_test;
drop table if exists lxw_test1;
drop table if exists woa_all_user_info_his;
create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED 
BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED 
BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
create table woa_all_user_info_his(imei string,sndaid string,data_time 
string,pt string) row format delimited fields terminated by ',' stored as 
textfile;
insert into table woa_all_user_info_his 
values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
insert into table woa_all_user_info_his 
values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
insert overwrite table lxw_test  select imei,sndaid,data_time     from 
woa_all_user_info_his    where pt = '2012-05-28';
insert overwrite table lxw_test1  select imei,sndaid,data_time    from 
woa_all_user_info_his    where pt = '2012-05-28';
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select count(1) from lxw_test1 a  join lxw_test b  on a.imei = b.imei ;

expect result:
+------+
|   _c1   |
+------+
|    6      |
+------+

but, actual result:
+------+
|    _c1  |
+------+
|     5    |
+------+

  was:
??citation??*When hive.default.fileformat is set to TextFile or ORC, using MR 
and Spark engines to query bucketed tables may result in data loss issues.*

*The SQL that caused data loss is as follows:*
set hive.execution.engine=mr;
set hive.default.fileformat=TextFile;
drop table if exists lxw_test;
drop table if exists lxw_test1;
drop table if exists woa_all_user_info_his;
create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED 
BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED 
BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
create table woa_all_user_info_his(imei string,sndaid string,data_time 
string,pt string) row format delimited fields terminated by ',' stored as 
textfile;
insert into table woa_all_user_info_his 
values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
insert into table woa_all_user_info_his 
values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
insert overwrite table lxw_test  select imei,sndaid,data_time     from 
woa_all_user_info_his    where pt = '2012-05-28';
insert overwrite table lxw_test1  select imei,sndaid,data_time    from 
woa_all_user_info_his    where pt = '2012-05-28';
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select count(1) from lxw_test1 a  join lxw_test b  on a.imei = b.imei ;

expect result:
+------+
|   _c1   |
+------+
|    6      |
+------+

but, actual result:
+------+
|    _c1  |
+------+
|     5    |
+------+


> When hive.default.fileformat is set to TextFile or ORC, using MR and Spark 
> engines to query bucketed tables may result in data loss issues.
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29120
>                 URL: https://issues.apache.org/jira/browse/HIVE-29120
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 3.1.0
>            Reporter: Junkun Sheng
>            Priority: Major
>
> When hive.default.fileformat is set to TextFile or ORC, using MR and Spark 
> engines to query bucketed tables may result in data loss issues.*
> *The SQL that caused data loss is as follows:*
> set hive.execution.engine=mr;
> set hive.default.fileformat=TextFile;
> drop table if exists lxw_test;
> drop table if exists lxw_test1;
> drop table if exists woa_all_user_info_his;
> create table lxw_test(imei string,sndaid string,data_time string) CLUSTERED 
> BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
> create table lxw_test1(imei string,sndaid string,data_time string) CLUSTERED 
> BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
> create table woa_all_user_info_his(imei string,sndaid string,data_time 
> string,pt string) row format delimited fields terminated by ',' stored as 
> textfile;
> insert into table woa_all_user_info_his 
> values('wo','ni','ta','2012-05-28'),('wom','nim','tam','2012-05-28');
> insert into table woa_all_user_info_his 
> values('wto','nti','tat','2012-05-28'),('wom','nim','tam','2012-05-28');
> insert overwrite table lxw_test  select imei,sndaid,data_time     from 
> woa_all_user_info_his    where pt = '2012-05-28';
> insert overwrite table lxw_test1  select imei,sndaid,data_time    from 
> woa_all_user_info_his    where pt = '2012-05-28';
> set hive.optimize.bucketmapjoin = true;
> set hive.optimize.bucketmapjoin.sortedmerge = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
> select count(1) from lxw_test1 a  join lxw_test b  on a.imei = b.imei ;
> expect result:
> +------+
> |   _c1   |
> +------+
> |    6      |
> +------+
> but, actual result:
> +------+
> |    _c1  |
> +------+
> |     5    |
> +------+



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to