Navis created HIVE-3244:
---------------------------
Summary: Add table property which constraints sorting/bucketing
for data loading
Key: HIVE-3244
URL: https://issues.apache.org/jira/browse/HIVE-3244
Project: Hive
Issue Type: Improvement
Components: Query Processor
Affects Versions: 0.10.0
Environment: ubuntu 10.10
Reporter: Navis
Assignee: Navis
Priority: Minor
This ticket is intended to implement "INSERT INTO" to bucketed table.
With hive.enforce.bucketing option, user can append data to bucketed table. But
current implementation depends on lexical order of file names for determining
bucket number of file, which is not always true.
So if file name is suffixed with bucket number when inserting(moving), it can
be acquired rightly when it is needed, such as in BucketMapJoinOptimizer.
With simple prototype codes, which will be attached after writing this, the
test query
{noformat}
create table bucket_test (key int, value string) clustered by (key) sorted by
(key) into 4 buckets TBLPROPERTIES
('FORCEDBUCKETING'='TRUE', 'FORCEDSORTING'='TRUE');
set hive.optimize.bucketmapjoin = true;
insert into table bucket_test select key, value from src1;
explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test b
on a.key=b.key;
insert into table bucket_test select key, value from src1;
explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test b
on a.key=b.key;
{noformat}
resulted as below
{noformat}
1. first plan
b {000000_0_[0]=[000000_0_[0]], 000001_0_[1]=[000001_0_[1]],
000002_0_[2]=[000002_0_[2]], 000003_0_[3]=[000003_0_[3]]}
2. second plan
b {000000_0_[0]=[000000_0_[0], 000000_0_copy_1_[0]],
000000_0_copy_1_[0]=[000000_0_[0], 000000_0_copy_1_[0]],
000001_0_[1]=[000001_0_[1], 000001_0_copy_1_[1]],
000001_0_copy_1_[1]=[000001_0_[1], 000001_0_copy_1_[1]],
000002_0_[2]=[000002_0_[2], 000002_0_copy_1_[2]],
000002_0_copy_1_[2]=[000002_0_[2], 000002_0_copy_1_[2]],
000003_0_[3]=[000003_0_[3], 000003_0_copy_1_[3]],
000003_0_copy_1_[3]=[000003_0_[3], 000003_0_copy_1_[3]]}
{noformat}
Currently, I've prevented direct loading via 'LOAD DATA' for forced bucket
table. But with proper name validation, that could be allowed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira