[ https://issues.apache.org/jira/browse/HIVE-7956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123117#comment-14123117 ]
Xuefu Zhang commented on HIVE-7956: ----------------------------------- [~lirui] Thank you very much for the detailed analysis. Besides what Brock outlined above, I think the following is probably the easiest: 1. Use HiveKey as the key type in our code. This is also necessary for join, and so we have to do this any way. 2. Define the row stored in RowContainer as <byte[], BytesWrtiable> instead of either the current <BytesWritable, BytesWritable> or <HiveKey, BytesWritable>. We get the bytes for HiveKey using kryo. We can do this when we copy the <key, value> pair in HiveBaseFunctionResultList.collect(). That is, we copy the the byte array representing HiveKey rather than copying it as a ByteWritable. Please let me know what you think. > When inserting into a bucketed table, all data goes to a single bucket [Spark > Branch] > ------------------------------------------------------------------------------------- > > Key: HIVE-7956 > URL: https://issues.apache.org/jira/browse/HIVE-7956 > Project: Hive > Issue Type: Bug > Components: Spark > Reporter: Rui Li > Assignee: Rui Li > > I created a bucketed table: > {code} > create table testBucket(x int,y string) clustered by(x) into 10 buckets; > {code} > Then I run a query like: > {code} > set hive.enforce.bucketing = true; > insert overwrite table testBucket select intCol,stringCol from src; > {code} > Here {{src}} is a simple textfile-based table containing 40000000 records > (not bucketed). The query launches 10 reduce tasks but all the data goes to > only one of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)