Navis created HIVE-3244:
---------------------------

             Summary: Add table property which constraints sorting/bucketing 
for data loading
                 Key: HIVE-3244
                 URL: https://issues.apache.org/jira/browse/HIVE-3244
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor
    Affects Versions: 0.10.0
         Environment: ubuntu 10.10
            Reporter: Navis
            Assignee: Navis
            Priority: Minor


This ticket is intended to implement "INSERT INTO" to bucketed table.

With hive.enforce.bucketing option, user can append data to bucketed table. But 
current implementation depends on lexical order of file names for determining 
bucket number of file, which is not always true.

So if file name is suffixed with bucket number when inserting(moving), it can 
be acquired rightly when it is needed, such as in BucketMapJoinOptimizer.

With simple prototype codes, which will be attached after writing this, the 
test query
{noformat}
create table bucket_test (key int, value string) clustered by (key) sorted by 
(key) into 4 buckets TBLPROPERTIES
('FORCEDBUCKETING'='TRUE', 'FORCEDSORTING'='TRUE');

set hive.optimize.bucketmapjoin = true;

insert into table bucket_test select key, value from src1;
explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test b 
on a.key=b.key;

insert into table bucket_test select key, value from src1;
explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test b 
on a.key=b.key;
{noformat}

resulted as below
{noformat}
1. first plan
 b {000000_0_[0]=[000000_0_[0]], 000001_0_[1]=[000001_0_[1]], 
000002_0_[2]=[000002_0_[2]], 000003_0_[3]=[000003_0_[3]]}

2. second plan
 b {000000_0_[0]=[000000_0_[0], 000000_0_copy_1_[0]], 
000000_0_copy_1_[0]=[000000_0_[0], 000000_0_copy_1_[0]], 
000001_0_[1]=[000001_0_[1], 000001_0_copy_1_[1]], 
000001_0_copy_1_[1]=[000001_0_[1], 000001_0_copy_1_[1]], 
000002_0_[2]=[000002_0_[2], 000002_0_copy_1_[2]], 
000002_0_copy_1_[2]=[000002_0_[2], 000002_0_copy_1_[2]], 
000003_0_[3]=[000003_0_[3], 000003_0_copy_1_[3]], 
000003_0_copy_1_[3]=[000003_0_[3], 000003_0_copy_1_[3]]}
{noformat}

Currently, I've prevented direct loading via 'LOAD DATA' for forced bucket 
table. But with proper name validation, that could be allowed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to