[1/2] carbondata git commit: [CARBONDATA-1770] Update error docs and consolidate DDL, DML, Partition docs

ravipesala Wed, 22 Nov 2017 07:19:53 -0800

Repository: carbondata
Updated Branches:
  refs/heads/master ab9c2c083 -> 9a69d638b



http://git-wip-us.apache.org/repos/asf/carbondata/blob/9a69d638/docs/partition-guide.md
----------------------------------------------------------------------
diff --git a/docs/partition-guide.md b/docs/partition-guide.md
deleted file mode 100644
index b0b7862..0000000
--- a/docs/partition-guide.md
+++ /dev/null
@@ -1,188 +0,0 @@
-<!--
-    Licensed to the Apache Software Foundation (ASF) under one
-    or more contributor license agreements.  See the NOTICE file
-    distributed with this work for additional information
-    regarding copyright ownership.  The ASF licenses this file
-    to you under the Apache License, Version 2.0 (the
-    "License"); you may not use this file except in compliance
-    with the License.  You may obtain a copy of the License at
-
-      http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing,
-    software distributed under the License is distributed on an
-    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-    KIND, either express or implied.  See the License for the
-    specific language governing permissions and limitations
-    under the License.
--->
-
-# CarbonData Partition Table Guide
-This tutorial is designed to provide a quick introduction to create and use 
partition table in Apache CarbonData.
-
-* [Create Partition Table](#create-partition-table)
-  - [Create Hash Partition Table](#create-hash-partition-table)
-  - [Create Range Partition Table](#create-range-partition-table)
-  - [Create List Partition Table](#create-list-partition-table)
-* [Show Partitions](#show-partitions)
-* [Maintaining the Partitions](#maintaining-the-partitions)
-* [Partition Id](#partition-id)
-* [Useful Tips](#useful-tips)
-
-## Create Partition Table
-
-### Create Hash Partition Table
-
-```
-   CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
-                    [(col_name data_type , ...)]
-   PARTITIONED BY (partition_col_name data_type)
-   STORED BY 'carbondata'
-   [TBLPROPERTIES ('PARTITION_TYPE'='HASH',
-                   'PARTITION_NUM'='N' ...)]
-   //N is the number of hash partitions
-```
-
-Example:
-
-```
-   create table if not exists hash_partition_table(
-      col_A String,
-      col_B Int,
-      col_C Long,
-      col_D Decimal(10,2),
-      col_F Timestamp
-   ) partitioned by (col_E Long)
-   stored by 'carbondata'
-   tblproperties('partition_type'='Hash','partition_num'='9')
-```
-
-### Create Range Partition Table
-
-```
-   CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
-                    [(col_name data_type , ...)]
-   PARTITIONED BY (partition_col_name data_type)
-   STORED BY 'carbondata'
-   [TBLPROPERTIES ('PARTITION_TYPE'='RANGE',
-                   'RANGE_INFO'='2014-01-01, 2015-01-01, 2016-01-01' ...)]
-```
-
-**Note:**
-
-- The 'RANGE_INFO' must be defined in ascending order in the table properties.
-
-- The default format for partition column of Date/Timestamp type is 
yyyy-MM-dd. Alternate formats for Date/Timestamp could be defined in 
CarbonProperties.
-
-Example:
-
-```
-   create table if not exists hash_partition_table(
-      col_A String,
-      col_B Int,
-      col_C Long,
-      col_D Decimal(10,2),
-      col_E Long
-   ) partitioned by (col_F Timestamp)
-   stored by 'carbondata'
-   tblproperties('partition_type'='Range',
-   'range_info'='2015-01-01, 2016-01-01, 2017-01-01, 2017-02-01')
-```
-
-### Create List Partition Table
-
-```
-   CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
-                    [(col_name data_type , ...)]
-   PARTITIONED BY (partition_col_name data_type)
-   STORED BY 'carbondata'
-   [TBLPROPERTIES ('PARTITION_TYPE'='LIST',
-                   'LIST_INFO'='A, B, C' ...)]
-```
-**Note :**
-- List partition supports list info in one level group.
-
-Example:
-
-```
-   create table if not exists hash_partition_table(
-      col_B Int,
-      col_C Long,
-      col_D Decimal(10,2),
-      col_E Long,
-      col_F Timestamp
-   ) partitioned by (col_A String)
-   stored by 'carbondata'
-   tblproperties('partition_type'='List',
-   'list_info'='aaaa, bbbb, (cccc, dddd), eeee')
-```
-
-
-## Show Partitions
-The following command is executed to get the partition information of the table
-
-```
-   SHOW PARTITIONS [db_name.]table_name
-```
-
-## Maintaining the Partitions
-### Add a new partition
-
-```
-   ALTER TABLE [db_name].table_name ADD PARTITION('new_partition')
-```
-### Split a partition
-
-```
-   ALTER TABLE [db_name].table_name SPLIT PARTITION(partition_id)
-   INTO('new_partition1', 'new_partition2'...)
-```
-
-### Drop a partition
-
-```
-   //Drop partition definition only and keep data
-   ALTER TABLE [db_name].table_name DROP PARTITION(partition_id)
-
-   //Drop both partition definition and data
-   ALTER TABLE [db_name].table_name DROP PARTITION(partition_id) WITH DATA
-```
-
-**Note**:
-
-- In the first case where the data in the table is preserved there can be 
multiple scenarios as described below :
-
-   * if the table is a range partition table, data will be merged into the 
next partition, and if the dropped partition is the last partition, then data 
will be merged into the default partition.
-
-   * if the table is a list partition table, data will be merged into default 
partition.
-
-- Dropping the default partition is not allowed, but DELETE statement can be 
used to delete data in default partition.
-
-- The partition_id could be fetched using the [SHOW 
PARTITIONS](#show-partitions) command.
-
-- Hash partition table is not supported for ADD, SPLIT and DROP commands.
-
-## Partition Id
-In CarbonData like the hive, folders are not used to divide partitions instead 
partition id is used to replace the task id. It could make use of the 
characteristic and meanwhile reduce some metadata.
-
-```
-SegmentDir/0_batchno0-0-1502703086921.carbonindex
-           ^
-SegmentDir/part-0-0_batchno0-0-1502703086921.carbondata
-                  ^
-```
-
-## Useful Tips
-Here are some useful tips to improve query performance of carbonData partition 
table:
-
-**Prior analysis of proper partition column**
-
-The distribution of data based on some random column could be skewed, building 
a skewed partition table is meaningless. Some basic statistical analysis before 
the creation of partition table can avoid an extremely skewed column.
-
-**Exclude partition column from sort columns**
-
-If you have many dimensions, that need to be sorted then one must exclude 
column present in the partition from sort columns, this will allow another 
dimension to do the efficient sorting.
-
-**Remember to add filter on partition column when writing SQL**
-
-When writing SQL on a partition table, try to use filters on the partition 
column.

http://git-wip-us.apache.org/repos/asf/carbondata/blob/9a69d638/docs/quick-start-guide.md
----------------------------------------------------------------------
diff --git a/docs/quick-start-guide.md b/docs/quick-start-guide.md
index d833679..84f871d 100644
--- a/docs/quick-start-guide.md
+++ b/docs/quick-start-guide.md
@@ -1,20 +1,18 @@
 <!--
-    Licensed to the Apache Software Foundation (ASF) under one
-    or more contributor license agreements.  See the NOTICE file
-    distributed with this work for additional information
-    regarding copyright ownership.  The ASF licenses this file
-    to you under the Apache License, Version 2.0 (the
-    "License"); you may not use this file except in compliance
-    with the License.  You may obtain a copy of the License at
+    Licensed to the Apache Software Foundation (ASF) under one or more 
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership. 
+    The ASF licenses this file to you under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with 
+    the License.  You may obtain a copy of the License at
 
       http://www.apache.org/licenses/LICENSE-2.0
 
-    Unless required by applicable law or agreed to in writing,
-    software distributed under the License is distributed on an
-    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-    KIND, either express or implied.  See the License for the
-    specific language governing permissions and limitations
-    under the License.
+    Unless required by applicable law or agreed to in writing, software 
+    distributed under the License is distributed on an "AS IS" BASIS, 
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and 
+    limitations under the License.
 -->
 
 # Quick Start
@@ -87,7 +85,8 @@ scala>carbon.sql("CREATE TABLE
 scala>carbon.sql("LOAD DATA INPATH '/path/to/sample.csv'
                   INTO TABLE test_table")
 ```
-**NOTE**: Please provide the real file path of `sample.csv` for the above 
script.
+**NOTE**: Please provide the real file path of `sample.csv` for the above 
script. 
+If you get "tablestatus.lock" issue, please refer to 
[troubleshooting](troubleshooting.md)
 
 ###### Query Data from a Table
 

http://git-wip-us.apache.org/repos/asf/carbondata/blob/9a69d638/docs/release-guide.md
----------------------------------------------------------------------
diff --git a/docs/release-guide.md b/docs/release-guide.md
index d33dc3e..c63bc1b 100644
--- a/docs/release-guide.md
+++ b/docs/release-guide.md
@@ -1,20 +1,18 @@
 <!--
-    Licensed to the Apache Software Foundation (ASF) under one
-    or more contributor license agreements.  See the NOTICE file
-    distributed with this work for additional information
-    regarding copyright ownership.  The ASF licenses this file
-    to you under the Apache License, Version 2.0 (the
-    "License"); you may not use this file except in compliance
-    with the License.  You may obtain a copy of the License at
+    Licensed to the Apache Software Foundation (ASF) under one or more 
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership. 
+    The ASF licenses this file to you under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with 
+    the License.  You may obtain a copy of the License at
 
       http://www.apache.org/licenses/LICENSE-2.0
 
-    Unless required by applicable law or agreed to in writing,
-    software distributed under the License is distributed on an
-    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-    KIND, either express or implied.  See the License for the
-    specific language governing permissions and limitations
-    under the License.
+    Unless required by applicable law or agreed to in writing, software 
+    distributed under the License is distributed on an "AS IS" BASIS, 
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and 
+    limitations under the License.
 -->
 
 # Apache CarbonData Release Guide

http://git-wip-us.apache.org/repos/asf/carbondata/blob/9a69d638/docs/supported-data-types-in-carbondata.md
----------------------------------------------------------------------
diff --git a/docs/supported-data-types-in-carbondata.md 
b/docs/supported-data-types-in-carbondata.md
index 4ef9987..6c21508 100644
--- a/docs/supported-data-types-in-carbondata.md
+++ b/docs/supported-data-types-in-carbondata.md
@@ -1,20 +1,18 @@
 <!--
-    Licensed to the Apache Software Foundation (ASF) under one
-    or more contributor license agreements.  See the NOTICE file
-    distributed with this work for additional information
-    regarding copyright ownership.  The ASF licenses this file
-    to you under the Apache License, Version 2.0 (the
-    "License"); you may not use this file except in compliance
-    with the License.  You may obtain a copy of the License at
+    Licensed to the Apache Software Foundation (ASF) under one or more 
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership. 
+    The ASF licenses this file to you under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with 
+    the License.  You may obtain a copy of the License at
 
       http://www.apache.org/licenses/LICENSE-2.0
 
-    Unless required by applicable law or agreed to in writing,
-    software distributed under the License is distributed on an
-    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-    KIND, either express or implied.  See the License for the
-    specific language governing permissions and limitations
-    under the License.
+    Unless required by applicable law or agreed to in writing, software 
+    distributed under the License is distributed on an "AS IS" BASIS, 
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and 
+    limitations under the License.
 -->
 
 #  Data Types

http://git-wip-us.apache.org/repos/asf/carbondata/blob/9a69d638/docs/troubleshooting.md
----------------------------------------------------------------------
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index 5464997..7d66ee0 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -1,26 +1,47 @@
 <!--
-    Licensed to the Apache Software Foundation (ASF) under one
-    or more contributor license agreements.  See the NOTICE file
-    distributed with this work for additional information
-    regarding copyright ownership.  The ASF licenses this file
-    to you under the Apache License, Version 2.0 (the
-    "License"); you may not use this file except in compliance
-    with the License.  You may obtain a copy of the License at
+    Licensed to the Apache Software Foundation (ASF) under one or more 
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership. 
+    The ASF licenses this file to you under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with 
+    the License.  You may obtain a copy of the License at
 
       http://www.apache.org/licenses/LICENSE-2.0
 
-    Unless required by applicable law or agreed to in writing,
-    software distributed under the License is distributed on an
-    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-    KIND, either express or implied.  See the License for the
-    specific language governing permissions and limitations
-    under the License.
+    Unless required by applicable law or agreed to in writing, software 
+    distributed under the License is distributed on an "AS IS" BASIS, 
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and 
+    limitations under the License.
 -->
 
 # Troubleshooting
 This tutorial is designed to provide troubleshooting for end users and 
developers
 who are building, deploying, and using CarbonData.
 
+## When loading data, gets tablestatus.lock issues:
+
+  **Symptom**
+```
+17/11/11 16:48:13 ERROR LocalFileLock: main 
hdfs:/localhost:9000/carbon/store/default/hdfstable/tablestatus.lock (No such 
file or directory)
+java.io.FileNotFoundException: 
hdfs:/localhost:9000/carbon/store/default/hdfstable/tablestatus.lock (No such 
file or directory)
+       at java.io.FileOutputStream.open0(Native Method)
+       at java.io.FileOutputStream.open(FileOutputStream.java:270)
+       at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
+       at java.io.FileOutputStream.<init>(FileOutputStream.java:101)
+```
+
+  **Possible Cause**
+  If you use <hdfs path> as store path when creating carbonsession, may get 
the errors,because the default is LOCALLOCK.
+
+  **Procedure**
+  Before creating carbonsession, sets as below:
+  ```
+  import org.apache.carbondata.core.util.CarbonProperties
+  import org.apache.carbondata.core.constants.CarbonCommonConstants
+  CarbonProperties.getInstance().addProperty(CarbonCommonConstants.LOCK_TYPE, 
"HDFSLOCK")
+  ```
+
 ## Failed to load thrift libraries
 
   **Symptom**

http://git-wip-us.apache.org/repos/asf/carbondata/blob/9a69d638/docs/useful-tips-on-carbondata.md
----------------------------------------------------------------------
diff --git a/docs/useful-tips-on-carbondata.md 
b/docs/useful-tips-on-carbondata.md
index d1d4a8c..30485da 100644
--- a/docs/useful-tips-on-carbondata.md
+++ b/docs/useful-tips-on-carbondata.md
@@ -1,137 +1,94 @@
 <!--
-    Licensed to the Apache Software Foundation (ASF) under one
-    or more contributor license agreements.  See the NOTICE file
-    distributed with this work for additional information
-    regarding copyright ownership.  The ASF licenses this file
-    to you under the Apache License, Version 2.0 (the
-    "License"); you may not use this file except in compliance
-    with the License.  You may obtain a copy of the License at
+    Licensed to the Apache Software Foundation (ASF) under one or more 
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership. 
+    The ASF licenses this file to you under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with 
+    the License.  You may obtain a copy of the License at
 
       http://www.apache.org/licenses/LICENSE-2.0
 
-    Unless required by applicable law or agreed to in writing,
-    software distributed under the License is distributed on an
-    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-    KIND, either express or implied.  See the License for the
-    specific language governing permissions and limitations
-    under the License.
+    Unless required by applicable law or agreed to in writing, software 
+    distributed under the License is distributed on an "AS IS" BASIS, 
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and 
+    limitations under the License.
 -->
 
 # Useful Tips
-This tutorial guides you to create CarbonData Tables and optimize performance.
-The following sections will elaborate on the above topics :
+  This tutorial guides you to create CarbonData Tables and optimize 
performance.
+  The following sections will elaborate on the above topics :
 
-* [Suggestions to create CarbonData 
Table](#suggestions-to-create-carbondata-table)
-* [Configuration for Optimizing Data Loading performance for Massive 
Data](#configuration-for-optimizing-data-loading-performance-for-massive-data)
-* [Optimizing Mass Data 
Loading](#configurations-for-optimizing-carbondata-performance)
+  * [Suggestions to create CarbonData 
Table](#suggestions-to-create-carbondata-table)
+  * [Configuration for Optimizing Data Loading performance for Massive 
Data](#configuration-for-optimizing-data-loading-performance-for-massive-data)
+  * [Optimizing Mass Data 
Loading](#configurations-for-optimizing-carbondata-performance)
 
 ## Suggestions to Create CarbonData Table
 
-Recently CarbonData was used to analyze performance of Telecommunication field.
-The results of the analysis for table creation with dimensions ranging from
-10 thousand to 10 billion rows and 100 to 300 columns have been summarized 
below.
+  For example,the results of the analysis for table creation with dimensions 
ranging from 10 thousand to 10 billion rows and 100 to 300 columns have been 
summarized below.
+  The following table describes some of the columns from the table used.
 
-The following table describes some of the columns from the table used.
+  - **Table Column Description**
 
+  | Column Name | Data Type     | Cardinality | Attribution |
+  |-------------|---------------|-------------|-------------|
+  | msisdn      | String        | 30 million  | Dimension   |
+  | BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
+  | HOST        | String        | 1 million   | Dimension   |
+  | Dime_1      | String        | 1 Thousand  | Dimension   |
+  | counter_1   | Decimal       | NA          | Measure     |
+  | counter_2   | Numeric(20,0) | NA          | Measure     |
+  | ...         | ...           | NA          | Measure     |
+  | counter_100 | Decimal       | NA          | Measure     |
 
-**Table Column Description**
 
-| Column Name | Data Type     | Cardinality | Attribution |
-|-------------|---------------|-------------|-------------|
-| msisdn      | String        | 30 million  | Dimension   |
-| BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
-| HOST        | String        | 1 million   | Dimension   |
-| Dime_1      | String        | 1 Thousand  | Dimension   |
-| counter_1   | Numeric(20,0) | NA          | Measure     |
-| ...         | ...           | NA          | Measure     |
-| counter_100 | Numeric(20,0) | NA          | Measure     |
-
-CarbonData has more than 50 test cases, on the basis of these we have 
following suggestions to enhance the query performance :
-
-
-
-* **Put the frequently-used column filter in the beginning**
+  - **Put the frequently-used column filter in the beginning**
 
   For example, MSISDN filter is used in most of the query then we must put the 
MSISDN in the first column.
-The create table command can be modified as suggested below :
-
-```
-  create table carbondata_table(
-  msisdn String,
-  ...
-  )STORED BY 'org.apache.carbondata.format'
-  TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,..',
-  'DICTIONARY_INCLUDE'='...');
+  The create table command can be modified as suggested below :
 
-  Example:
+  ```
   create table carbondata_table(
     msisdn String,
-    BEGIN_TIME bigint
-    )STORED BY 'org.apache.carbondata.format'
-    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN',
-    'DICTIONARY_INCLUDE'='BEGIN_TIME');
-
-```
+    BEGIN_TIME bigint,
+    HOST String,
+    Dime_1 String,
+    counter_1, Decimal
+    ...
+    
+    )STORED BY 'carbondata'
+    TBLPROPERTIES ('SORT_COLUMNS'='msisdn, Dime_1')
+  ```
 
   Now the query with MSISDN in the filter will be more efficient.
 
-
-* **Put the frequently-used columns in the order of low to high cardinality**
+  - **Put the frequently-used columns in the order of low to high cardinality**
 
   If the table in the specified query has multiple columns which are 
frequently used to filter the results, it is suggested to put
   the columns in the order of cardinality low to high. This ordering of 
frequently used columns improves the compression ratio and
   enhances the performance of queries with filter on these columns.
 
   For example if MSISDN, HOST and Dime_1 are frequently-used columns, then the 
column order of table is suggested as
-  Dime_1>HOST>MSISDN as Dime_1 has the lowest cardinality.
+  Dime_1>HOST>MSISDN, because Dime_1 has the lowest cardinality.
   The create table command can be modified as suggested below :
 
-```
+  ```
   create table carbondata_table(
-  Dime_1 String,
-  HOST String,
-  MSISDN String,
-  ...
-  )STORED BY 'org.apache.carbondata.format'
-  TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST..',
-  'DICTIONARY_INCLUDE'='Dime_1..');
-
-  Example:
-  create table carbondata_table(
-    Dime_1 String,
-    HOST String,
-    MSISDN String
-    )STORED BY 'org.apache.carbondata.format'
-    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST',
-    'DICTIONARY_INCLUDE'='Dime_1');
-
-
-```
-
-
-* **Put the Dimension type columns in order of low to high cardinality**
-
-  If the columns used to filter are not frequently used, then it is suggested 
to order all the columns of dimension type in order of low to high cardinality.
-The create table command can be modified as below :
-
-```
-  create table carbondata_table(
-    Dime_1 String,
-    BEGIN_TIME bigint,
-    END_TIME bigint,
-    HOST String,
-    MSISDN String
-    ...
-    )STORED BY 'org.apache.carbondata.format'
-    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST...',
-    'DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME...');
-```
-
-
-* **For measure type columns with non high accuracy, replace Numeric(20,0) 
data type with Double data type**
-
-  For columns of measure type, not requiring high accuracy, it is suggested to 
replace Numeric data type with Double to enhance
-query performance. The create table command can be modified as below :
+      msisdn String,
+      BEGIN_TIME bigint,
+      HOST String,
+      Dime_1 String,
+      counter_1, Decimal
+      ...
+      
+      )STORED BY 'carbondata'
+      TBLPROPERTIES ('SORT_COLUMNS'='Dime_1, HOST, MSISDN')
+  ```
+
+  - **For measure type columns with non high accuracy, replace Numeric(20,0) 
data type with Double data type**
+
+  For columns of measure type, not requiring high accuracy, it is suggested to 
replace Numeric data type with Double to enhance query performance. 
+  The create table command can be modified as below :
 
 ```
   create table carbondata_table(
@@ -140,25 +97,20 @@ query performance. The create table command can be 
modified as below :
     END_TIME bigint,
     HOST String,
     MSISDN String,
-    counter_1 double,
+    counter_1 decimal,
     counter_2 double,
     ...
-    counter_100 double
-    )STORED BY 'org.apache.carbondata.format'
-    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST...',
-    'DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME...');
+    )STORED BY 'carbondata'
+    TBLPROPERTIES ('SORT_COLUMNS'='Dime_1, HOST, MSISDN')
 ```
   The result of performance analysis of test-case shows reduction in query 
execution time from 15 to 3 seconds, thereby improving performance by nearly 5 
times.
 
+ - **Columns of incremental character should be re-arranged at the end of 
dimensions**
 
-* **Columns of incremental character should be re-arranged at the end of 
dimensions**
-
-  Consider the following scenario where data is loaded each day and the 
begin_time is incremental for each load, it is
-suggested to put begin_time at the end of dimensions.
-
+  Consider the following scenario where data is loaded each day and the 
begin_time is incremental for each load, it is suggested to put begin_time at 
the end of dimensions.
   Incremental values are efficient in using min/max index. The create table 
command can be modified as below :
 
-```
+  ```
   create table carbondata_table(
     Dime_1 String,
     HOST String,
@@ -169,67 +121,52 @@ suggested to put begin_time at the end of dimensions.
     END_TIME bigint,
     ...
     counter_100 double
-    )STORED BY 'org.apache.carbondata.format'
-    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST...',
-    'DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME....');
-```
-
-
-* **Avoid adding high cardinality columns to dictionary**
-
-  If the system has low memory configuration, then it is suggested to exclude 
high cardinality columns from the dictionary to
-enhance load performance. Creation of  dictionary for high cardinality columns 
at time of load will degrade load performance due to
-excessive memory usage.
-
-  By default CarbonData determines the cardinality at the first data load and 
allows for dictionary creation only if the cardinality is less than
-1 million.
-
-
+    )STORED BY 'carbondata'
+    TBLPROPERTIES ('SORT_COLUMNS'='Dime_1, HOST, MSISDN')
+  ```
 
 ## Configuration for Optimizing Data Loading performance for Massive Data
 
 
- CarbonData supports large data load, in this process sorting data while 
loading consumes a lot of memory and disk IO and
- this can result sometimes in "Out Of Memory" exception.
- If you do not have much memory to use, then you may prefer to slow the speed 
of data loading instead of data load failure.
- You can configure CarbonData by tuning following properties in 
carbon.properties file to get a better performance.
-
-| Parameter | Default Value | Description/Tuning |
-|-----------|-------------|--------|
-|carbon.number.of.cores.while.loading|Default: 2.This value should be >= 
2|Specifies the number of cores used for data processing during data loading in 
CarbonData. |
-|carbon.sort.size|Default: 100000. The value should be >= 100.|Threshold to 
write local file in sort step when loading data|
-|carbon.sort.file.write.buffer.size|Default:  50000.|DataOutputStream buffer. |
-|carbon.number.of.cores.block.sort|Default: 7 | If you have huge memory and 
cpus, increase it as you will|
-|carbon.merge.sort.reader.thread|Default: 3 |Specifies the number of cores 
used for temp file merging during data loading in CarbonData.|
-|carbon.merge.sort.prefetch|Default: true | You may want set this value to 
false if you have not enough memory|
+  CarbonData supports large data load, in this process sorting data while 
loading consumes a lot of memory and disk IO and
+  this can result sometimes in "Out Of Memory" exception.
+  If you do not have much memory to use, then you may prefer to slow the speed 
of data loading instead of data load failure.
+  You can configure CarbonData by tuning following properties in 
carbon.properties file to get a better performance.
 
+  | Parameter | Default Value | Description/Tuning |
+  |-----------|-------------|--------|
+  |carbon.number.of.cores.while.loading|Default: 2.This value should be >= 
2|Specifies the number of cores used for data processing during data loading in 
CarbonData. |
+  |carbon.sort.size|Default: 100000. The value should be >= 100.|Threshold to 
write local file in sort step when loading data|
+  |carbon.sort.file.write.buffer.size|Default:  50000.|DataOutputStream 
buffer. |
+  |carbon.number.of.cores.block.sort|Default: 7 | If you have huge memory and 
cpus, increase it as you will|
+  |carbon.merge.sort.reader.thread|Default: 3 |Specifies the number of cores 
used for temp file merging during data loading in CarbonData.|
+  |carbon.merge.sort.prefetch|Default: true | You may want set this value to 
false if you have not enough memory|
 
-For example, if there are  10 million records ,and i have only 16 cores ,64GB 
memory, will be loaded to CarbonData table.
-Using the default configuration  always fail in sort step. Modify 
carbon.properties as suggested below:
+  For example, if there are 10 million records ,and i have only 16 cores ,64GB 
memory, will be loaded to CarbonData table.
+  Using the default configuration  always fail in sort step. Modify 
carbon.properties as suggested below:
 
-
-```
-carbon.number.of.cores.block.sort=1
-carbon.merge.sort.reader.thread=1
-carbon.sort.size=5000
-carbon.sort.file.write.buffer.size=5000
-carbon.merge.sort.prefetch=false
-```
+  ```
+  carbon.number.of.cores.block.sort=1
+  carbon.merge.sort.reader.thread=1
+  carbon.sort.size=5000
+  carbon.sort.file.write.buffer.size=5000
+  carbon.merge.sort.prefetch=false
+  ```
 
 ## Configurations for Optimizing CarbonData Performance
 
-Recently we did some performance POC on CarbonData for Finance and 
telecommunication Field. It involved detailed queries and aggregation
-scenarios. After the completion of POC, some of the configurations impacting 
the performance have been identified and tabulated below :
-
-| Parameter | Location | Used For  | Description | Tuning |
-|----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties | 
Data loading | During the loading of data, local temp is used to sort the data. 
This number specifies the minimum number of intermediate files after which the  
merge sort has to be initiated. | Increasing the parameter to a higher value 
will improve the load performance. For example, when we increase the value from 
20 to 100, it increases the data load performance from 35MB/S to more than 
50MB/S. Higher values of this parameter consumes  more memory during the load. |
-| carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties | 
Data loading | Specifies the number of cores used for data processing during 
data loading in CarbonData. | If you have more number of CPUs, then you can 
increase the number of CPUs, which will increase the performance. For example 
if we increase the value from 2 to 4 then the CSV reading performance can 
increase about 1 times |
-| carbon.compaction.level.threshold | spark/carbonlib/carbon.properties | Data 
loading and Querying | For minor compaction, specifies the number of segments 
to be merged in stage 1 and number of compacted segments to be merged in stage 
2. | Each CarbonData load will create one segment, if every load is small in 
size it will generate many small file over a period of time impacting the query 
performance. Configuring this parameter will merge the small segment to one big 
segment which will sort the data and improve the performance. For Example in 
one telecommunication scenario, the performance improves about 2 times after 
minor compaction. |
-| spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying | 
The number of task started when spark shuffle. | The value can be 1 to 2 times 
as much as the executor cores. In an aggregation scenario, reducing the number 
from 200 to 32 reduced the query time from 17 to 9 seconds. |
-| spark.executor.instances/spark.executor.cores/spark.executor.memory | 
spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores, 
and memory used for CarbonData query. | In the bank scenario, we provide the 4 
CPUs cores and 15 GB for each executor which can get good performance. This 2 
value does not mean more the better. It needs to be configured properly in case 
of limited resources. For example, In the bank scenario, it has enough CPU 32 
cores each node but less memory 64 GB each node. So we cannot give more CPU but 
less memory. For example, when 4 cores and 12GB for each executor. It sometimes 
happens GC during the query which impact the query performance very much from 
the 3 second to more than 15 seconds. In this scenario need to increase the 
memory or decrease the CPU cores. |
-| carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data loading 
| The buffer size to store records, returned from the block scan. | In limit 
scenario this parameter is very important. For example your query limit is 
1000. But if we set this value to 3000 that means we get 3000 records from scan 
but spark will only take 1000 rows. So the 2000 remaining are useless. In one 
Finance test case after we set it to 100, in the limit 1000 scenario the 
performance increase about 2 times in comparison to if we set this value to 
12000. |
-| carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | 
Whether use YARN local directories for multi-table load disk load balance | If 
this is set it to true CarbonData will use YARN local directories for 
multi-table load disk load balance, that will improve the data load 
performance. |
-| carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data 
loading | Whether to use multiple YARN local directories during table data 
loading for disk load balance | After enabling 'carbon.use.local.dir', if this 
is set to true, CarbonData will use all YARN local directories during data load 
for disk load balance, that will improve the data load performance. Please 
enable this property when you encounter disk hotspot problem during data 
loading. |
-
-Note: If your CarbonData instance is provided only for query, you may specify 
the property 'spark.speculation=true' which is in conf directory of spark.
+  Recently we did some performance POC on CarbonData for Finance and 
telecommunication Field. It involved detailed queries and aggregation
+  scenarios. After the completion of POC, some of the configurations impacting 
the performance have been identified and tabulated below :
+
+  | Parameter | Location | Used For  | Description | Tuning |
+  
|----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties | 
Data loading | During the loading of data, local temp is used to sort the data. 
This number specifies the minimum number of intermediate files after which the  
merge sort has to be initiated. | Increasing the parameter to a higher value 
will improve the load performance. For example, when we increase the value from 
20 to 100, it increases the data load performance from 35MB/S to more than 
50MB/S. Higher values of this parameter consumes  more memory during the load. |
+  | carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties | 
Data loading | Specifies the number of cores used for data processing during 
data loading in CarbonData. | If you have more number of CPUs, then you can 
increase the number of CPUs, which will increase the performance. For example 
if we increase the value from 2 to 4 then the CSV reading performance can 
increase about 1 times |
+  | carbon.compaction.level.threshold | spark/carbonlib/carbon.properties | 
Data loading and Querying | For minor compaction, specifies the number of 
segments to be merged in stage 1 and number of compacted segments to be merged 
in stage 2. | Each CarbonData load will create one segment, if every load is 
small in size it will generate many small file over a period of time impacting 
the query performance. Configuring this parameter will merge the small segment 
to one big segment which will sort the data and improve the performance. For 
Example in one telecommunication scenario, the performance improves about 2 
times after minor compaction. |
+  | spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying | 
The number of task started when spark shuffle. | The value can be 1 to 2 times 
as much as the executor cores. In an aggregation scenario, reducing the number 
from 200 to 32 reduced the query time from 17 to 9 seconds. |
+  | spark.executor.instances/spark.executor.cores/spark.executor.memory | 
spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores, 
and memory used for CarbonData query. | In the bank scenario, we provide the 4 
CPUs cores and 15 GB for each executor which can get good performance. This 2 
value does not mean more the better. It needs to be configured properly in case 
of limited resources. For example, In the bank scenario, it has enough CPU 32 
cores each node but less memory 64 GB each node. So we cannot give more CPU but 
less memory. For example, when 4 cores and 12GB for each executor. It sometimes 
happens GC during the query which impact the query performance very much from 
the 3 second to more than 15 seconds. In this scenario need to increase the 
memory or decrease the CPU cores. |
+  | carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data 
loading | The buffer size to store records, returned from the block scan. | In 
limit scenario this parameter is very important. For example your query limit 
is 1000. But if we set this value to 3000 that means we get 3000 records from 
scan but spark will only take 1000 rows. So the 2000 remaining are useless. In 
one Finance test case after we set it to 100, in the limit 1000 scenario the 
performance increase about 2 times in comparison to if we set this value to 
12000. |
+  | carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | 
Whether use YARN local directories for multi-table load disk load balance | If 
this is set it to true CarbonData will use YARN local directories for 
multi-table load disk load balance, that will improve the data load 
performance. |
+  | carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data 
loading | Whether to use multiple YARN local directories during table data 
loading for disk load balance | After enabling 'carbon.use.local.dir', if this 
is set to true, CarbonData will use all YARN local directories during data load 
for disk load balance, that will improve the data load performance. Please 
enable this property when you encounter disk hotspot problem during data 
loading. |
+
+  Note: If your CarbonData instance is provided only for query, you may 
specify the property 'spark.speculation=true' which is in conf directory of 
spark.

[1/2] carbondata git commit: [CARBONDATA-1770] Update error docs and consolidate DDL, DML, Partition docs

Reply via email to