[
https://issues.apache.org/jira/browse/DRILL-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595278#comment-14595278
]
Aman Sinha commented on DRILL-3246:
-----------------------------------
[~jni] Partitioning for CTAS is supported in some MPP systems, although the
syntax is not exactly the same, see:
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_AS.html (this has
a 'distkey' clause instead of 'partition by').
Since 'partition by' is more closely associated with a DDL statement, it would
seem natural to have it after the CREATE TABLE. I do understand the motivation
for doing it after the SELECT for cases where the columns are not explicitly
provided in the CREATE TABLE or with the '*' column in the SELECT list.
However, we should think about potential implications when extending the SELECT
syntax. I think we would have to test it more. Could we not require the
CTAS to always specify the columns in CREATE TABLE ? This will ensure that
aliases, arbitrary expressions in the SELECT list will continue to work.
> Query planning support for partition by clause in Drill's CTAS statement
> ------------------------------------------------------------------------
>
> Key: DRILL-3246
> URL: https://issues.apache.org/jira/browse/DRILL-3246
> Project: Apache Drill
> Issue Type: New Feature
> Components: Query Planning & Optimization
> Affects Versions: 1.0.0
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
> Fix For: 1.1.0
>
>
> We are going to add "PARTITION BY" clause in Drill's CTAS statement. The
> "PARTITION BY" clause will specify the list of columns out of the result
> table's column list that will be used to partition the data.
> CREATE TABLE table_name [ (col_name, .... ) ]
> [PARTITION BY (col_name, ...)]
> AS SELECT_STATEMENT;
> Semantics restriction for the PARTITION BY clause:
> - All the columns in the PARTITION BY clause have to be in the table's
> column list, or the SELECT_STATEMENT has a * column, when the base table in
> the SELECT_STATEMENT is schema-less. Otherwise, an query validation error
> would be raised.
> - When the partition column is resolved to * column in a schema-less query,
> this * column could not be a result of join operation. This restriction is
> added, since for * out of join operation, query planner would not know which
> table might produce this partition column.
> Example :
> {code}
> create table mytable1 partition by (r_regionkey) as
> select r_regionkey, r_name from cp.`tpch/region.parquet`
> {code}
> {code}
> create table mytable2 partition by (r_regionkey) as
> select * from cp.`tpch/region.parquet`
> {code}
> {code}
> create table mytable3 partition by (r_regionkey) as
> select r.r_regionkey, r.r_name, n.n_nationkey, n.n_name
> from cp.`tpch/nation.parquet` n, cp.`tpch/region.parquet` r
> where n.n_regionkey = r.r_regionkey
> {code}
> Invalid case 1: Partition column is not in table's column list.
> {code}
> create table mytable4 partition by (r_regionkey2) as
> select r_regionkey, r_name from cp.`tpch/region.parquet`
> {code}
> Invalid case 2: Partition column is resolved to * out of a join operator.
> {code}
> create table mytable5 partition by (r_regionkey) as
> select *
> from cp.`tpch/nation.parquet` n, cp.`tpch/region.parquet` r
> where n.n_regionkey = r.r_regionkey
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)