[
https://issues.apache.org/jira/browse/HADOOP-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645318#action_12645318
]
Zheng Shao commented on HADOOP-4591:
------------------------------------
Instead of specifying which column will be "special" in the create table
statement, we can also specify that at insertion time:
CREATE TABLE tname (cname1 INT)
COMMENT 'This is a table'
PARTITIONED BY(dt STRING, pcol INT)
STORED AS SEQUENCEFILE;
INSERT OVERWRITE tname PARTITION (dt = '2008-11-04') (pcol, cname1)
SELECT another.a, another.b FROM another;
This conforms to the standard SQL Syntax:
http://dev.mysql.com/doc/refman/5.0/en/insert.html
The implementation would involves several steps:
1. Change the grammar to support specifying columns in the insert clause (the
default will be all columns except partition columns), and change the plan
generation to support that;
2. Allow specifying partition columns in the insert clause, change the
execution code to create a new file on each new partition key.
3. Optimization: The bad thing about 2 is that if we have 100 reducers, then
each of the partition will have 100 files. We could first generate the result
of the select, and then do a count to see the number of rows in each partition,
and then decide whether we should do another map-reduce job to gather all rows
in the same partition together or not.
Overall this will be a very useful feature.
> Tag columns as partitioning columns
> -----------------------------------
>
> Key: HADOOP-4591
> URL: https://issues.apache.org/jira/browse/HADOOP-4591
> Project: Hadoop Core
> Issue Type: Wish
> Components: contrib/hive
> Reporter: Venky Iyer
>
> CREATE TABLE tname (INT cname1, INT pcol PARTITIONING )
> COMMENT 'This is a table'
> PARTITIONED BY(dt STRING)
> STORED AS SEQUENCEFILE;
> The goal here is to annotate a column as being a "partitioning" column.
> Consider pcol in the above example. It is annotated with 'PARTITIONING',
> which implies that the create table
> has
> PARTITIONED BY (dt, pcol)
> and every write to this table has implicitly
> INSERT OVERWRITE tname PARTITION (pcol='X')
> WHERE output.pcol = 'X'
> for every distinct value X that pcol takes.
> This is ideally an addition on top of the explicit partitioning that is
> already in the syntax, so that if I said
> INSERT OVERWRITE tname PARTITION (dt='D')
> it would still go into the partition (dt='D", pcol='Y') when the value of
> pcol is Y.
> It would be up to the user to make sure the cardinality of these columns is
> reasonable, and that enough data goes into each partition that there is some
> net benefit (just as it is in the explicit case).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.