[ 
https://issues.apache.org/jira/browse/HADOOP-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645318#action_12645318
 ] 

Zheng Shao commented on HADOOP-4591:
------------------------------------

Instead of specifying which column will be "special" in the create table 
statement, we can also specify that at insertion time:

CREATE TABLE tname (cname1 INT)
COMMENT 'This is a table'
PARTITIONED BY(dt STRING, pcol INT)
STORED AS SEQUENCEFILE; 

INSERT OVERWRITE tname PARTITION (dt = '2008-11-04') (pcol, cname1)
SELECT another.a, another.b FROM another;

This conforms to the standard SQL Syntax: 
http://dev.mysql.com/doc/refman/5.0/en/insert.html


The implementation would involves several steps:
1. Change the grammar to support specifying columns in the insert clause (the 
default will be all columns except partition columns), and change the plan 
generation to support that;
2. Allow specifying partition columns in the insert clause, change the 
execution code to create a new file on each new partition key.
3. Optimization: The bad thing about 2 is that if we have 100 reducers, then 
each of the partition will have 100 files. We could first generate the result 
of the select, and then do a count to see the number of rows in each partition, 
and then decide whether we should do another map-reduce job to gather all rows 
in the same partition together or not.


Overall this will be a very useful feature.


> Tag columns as partitioning columns
> -----------------------------------
>
>                 Key: HADOOP-4591
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4591
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: contrib/hive
>            Reporter: Venky Iyer
>
>     CREATE TABLE tname (INT cname1, INT pcol PARTITIONING )
>     COMMENT 'This is a table' 
>     PARTITIONED BY(dt STRING) 
>     STORED AS SEQUENCEFILE; 
> The goal here is to annotate a column as being a "partitioning" column. 
> Consider pcol in the above example. It is annotated with 'PARTITIONING', 
> which implies that the create table
> has 
> PARTITIONED BY (dt, pcol)
> and every write to this table has implicitly
> INSERT OVERWRITE tname PARTITION (pcol='X')
> WHERE output.pcol = 'X'
> for every distinct value X that pcol takes.
> This is ideally an addition on top of the explicit partitioning that is 
> already in the syntax, so that if I said
> INSERT OVERWRITE tname PARTITION (dt='D')
> it would still go into the partition (dt='D", pcol='Y') when the value of 
> pcol is Y.
> It would be up to the user to make sure the cardinality of these columns is 
> reasonable, and that enough data goes into each partition that there is some 
> net benefit (just as it is in the explicit case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to