[ https://issues.apache.org/jira/browse/HIVE-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Navis updated HIVE-3244: ------------------------ Status: Patch Available (was: Open) https://reviews.facebook.net/D4053 > Add table property which constraints sorting/bucketing for data loading > ----------------------------------------------------------------------- > > Key: HIVE-3244 > URL: https://issues.apache.org/jira/browse/HIVE-3244 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Affects Versions: 0.10.0 > Environment: ubuntu 10.10 > Reporter: Navis > Assignee: Navis > Priority: Minor > > This ticket is intended to implement "INSERT INTO" to bucketed table. > With hive.enforce.bucketing option, user can append data to bucketed table. > But current implementation depends on lexical order of file names for > determining bucket number of file, which is not always true. > So if file name is suffixed with bucket number when inserting(moving), it can > be acquired rightly when it is needed, such as in BucketMapJoinOptimizer. > With simple prototype codes, which will be attached after writing this, the > test query > {noformat} > create table bucket_test (key int, value string) clustered by (key) sorted by > (key) into 4 buckets TBLPROPERTIES > ('FORCEDBUCKETING'='TRUE', 'FORCEDSORTING'='TRUE'); > set hive.optimize.bucketmapjoin = true; > insert into table bucket_test select key, value from src1; > explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test > b on a.key=b.key; > insert into table bucket_test select key, value from src1; > explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test > b on a.key=b.key; > {noformat} > resulted as below > {noformat} > 1. first plan > b {000000_0_[0]=[000000_0_[0]], 000001_0_[1]=[000001_0_[1]], > 000002_0_[2]=[000002_0_[2]], 000003_0_[3]=[000003_0_[3]]} > 2. second plan > b {000000_0_[0]=[000000_0_[0], 000000_0_copy_1_[0]], > 000000_0_copy_1_[0]=[000000_0_[0], 000000_0_copy_1_[0]], > 000001_0_[1]=[000001_0_[1], 000001_0_copy_1_[1]], > 000001_0_copy_1_[1]=[000001_0_[1], 000001_0_copy_1_[1]], > 000002_0_[2]=[000002_0_[2], 000002_0_copy_1_[2]], > 000002_0_copy_1_[2]=[000002_0_[2], 000002_0_copy_1_[2]], > 000003_0_[3]=[000003_0_[3], 000003_0_copy_1_[3]], > 000003_0_copy_1_[3]=[000003_0_[3], 000003_0_copy_1_[3]]} > {noformat} > Currently, I've prevented direct loading via 'LOAD DATA' for forced bucket > table. But with proper name validation, that could be allowed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira