Repository: carbondata Updated Branches: refs/heads/master 4bd11072b -> 2c88bffdc
[CARBONDATA-1325] Add partition guidance doc This closes #1258 Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/2c88bffd Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/2c88bffd Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/2c88bffd Branch: refs/heads/master Commit: 2c88bffdc816e3e45e02dd1c655348ab96dfe1ac Parents: 4bd1107 Author: lionelcao <[email protected]> Authored: Tue Aug 15 18:26:42 2017 +0800 Committer: chenliang613 <[email protected]> Committed: Sat Aug 26 23:34:48 2017 +0800 ---------------------------------------------------------------------- README.md | 1 + docs/partition-guide.md | 161 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 162 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/carbondata/blob/2c88bffd/README.md ---------------------------------------------------------------------- diff --git a/README.md b/README.md index 3226853..7033238 100644 --- a/README.md +++ b/README.md @@ -48,6 +48,7 @@ CarbonData is built using Apache Maven, to [build CarbonData](https://github.com * [Data Management](https://github.com/apache/carbondata/blob/master/docs/data-management.md) * [DDL Operations on CarbonData](https://github.com/apache/carbondata/blob/master/docs/ddl-operation-on-carbondata.md) * [DML Operations on CarbonData](https://github.com/apache/carbondata/blob/master/docs/dml-operation-on-carbondata.md) +* [Partition Table](https://https://github.com/apache/carbondata/blob/master/docs/partition-guide.md) * [Cluster Installation and Deployment](https://github.com/apache/carbondata/blob/master/docs/installation-guide.md) * [Configuring Carbondata](https://github.com/apache/carbondata/blob/master/docs/configuration-parameters.md) * [FAQ](https://github.com/apache/carbondata/blob/master/docs/faq.md) http://git-wip-us.apache.org/repos/asf/carbondata/blob/2c88bffd/docs/partition-guide.md ---------------------------------------------------------------------- diff --git a/docs/partition-guide.md b/docs/partition-guide.md new file mode 100644 index 0000000..2a0df76 --- /dev/null +++ b/docs/partition-guide.md @@ -0,0 +1,161 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +### CarbonData Partition Table Guidance +This guidance illustrates how to create & use partition table in CarbonData. + +* [Create Partition Table](#create-partition-table) + - [Create Hash Partition Table](#create-hash-partition-table) + - [Create Range Partition Table](#create-range-partition-table) + - [Create List Partition Table](#create-list-partition-table) +* [Show Partitions](#show-partitions) +* [Maintain the Partitions](#maintain-the-partitions) +* [Partition Id](#partition-id) +* [Tips](#tips) + +### Create Partition Table + +##### Create Hash Partition Table +``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name + [(col_name data_type , ...)] + PARTITIONED BY (partition_col_name data_type) + STORED BY 'carbondata' + [TBLPROPERTIES ('PARTITION_TYPE'='HASH', + 'PARTITION_NUM'='N' ...)] + //N is the number of hash partitions +``` + +Example: +``` + create table if not exists hash_partition_table( + col_A String, + col_B Int, + col_C Long, + col_D Decimal(10,2), + col_F Timestamp + ) partitioned by (col_E Long) + stored by 'carbondata' + tblproperties('partition_type'='Hash','partition_num'='9') +``` + +##### Create Range Partition Table +``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name + [(col_name data_type , ...)] + PARTITIONED BY (partition_col_name data_type) + STORED BY 'carbondata' + [TBLPROPERTIES ('PARTITION_TYPE'='RANGE', + 'RANGE_INFO'='2014-01-01, 2015-01-01, 2016-01-01' ...)] +``` +Notes: +1. The 'RANGE_INFO' defined in table properties must be in ascending order. +2. If the partition column is Date/Timestamp type, the format could be defined in CarbonProperties. By default it's yyyy-MM-dd. + +Example: +``` + create table if not exists hash_partition_table( + col_A String, + col_B Int, + col_C Long, + col_D Decimal(10,2), + col_E Long + ) partitioned by (col_F Timestamp) + stored by 'carbondata' + tblproperties('partition_type'='Range', + 'range_info'='2015-01-01, 2016-01-01, 2017-01-01, 2017-02-01') +``` + +##### Create List Partition Table +``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name + [(col_name data_type , ...)] + PARTITIONED BY (partition_col_name data_type) + STORED BY 'carbondata' + [TBLPROPERTIES ('PARTITION_TYPE'='LIST', + 'LIST_INFO'='A, B, C' ...)] +``` +Notes: +1. List partition support list info in one level group. + +Exampleï¼ +``` + create table if not exists hash_partition_table( + col_B Int, + col_C Long, + col_D Decimal(10,2), + col_E Long, + col_F Timestamp + ) partitioned by (col_A String) + stored by 'carbondata' + tblproperties('partition_type'='List', + 'list_info'='aaaa, bbbb, (cccc, dddd), eeee') +``` + + +### Show Partitions +Execute following command to get the partition information +``` + SHOW PARTITIONS [db_name.]table_name + +``` + +### Maintain the Partitions +##### Add a new partition +``` + ALTER TABLE [db_name].table_name ADD PARTITION('new_partition') +``` +##### Split a partition +``` + ALTER TABLE [db_name].table_name SPLIT PARTITION(partition_id) INTO('new_partition1', 'new_partition2'...) +``` +##### Drop a partition +``` + //Drop partition definition only and keep data + ALTER TABLE [db_name].table_name DROP PARTITION(partition_id) + + //Drop both partition definition and data + ALTER TABLE [db_name].table_name DROP PARTITION(partition_id) WITH DATA +``` +Notes: +1. For the 1st case(keep data), + * if the table is a range partition table, data will be merged into the next partition, and if the dropped partition is the last one, then data will be merged into default partition. + * if the table is a list partition table, data will be merged into default partition. +2. Drop default partition is not allowed, but you can use DELETE statement to delete data in default partition. +3. partition_id could be got from SHOW PARTITIONS command. +4. Hash partition table is not supported for the ADD, SPLIT, DROP command. + +### Partition Id +In Carbondata, we don't use folders to divide partitions(just like hive did), instead we use partition id to replace the task id. +It could make use of the characteristic and meanwhile reduce some metadata. +``` +SegmentDir/0_batchno0-0-1502703086921.carbonindex + ^ +SegmentDir/part-0-0_batchno0-0-1502703086921.carbondata + ^ +``` + +### Tips +Here are some tips to improve query performance of carbon partition table: +##### 1. Do some analysis before choose the proper partition column +The distribution of data on some column could be very skew, building a skewed partition table is meaningless, so do some basic statistic analysis to avoid creating partition table on an extremely skewed column. +##### 2. Exclude partition column from sort columns +If you have many dimensions need to be sorted, then exclude partition column from sort columns, that will put other dimensions in a better position of sorting. +##### 3. Remember to add filter on partition column when writing SQLs +When writing SQLs on partition table, try to use filters on partition column.
