[01/39] carbondata-site git commit: Added new page layout & updated as per new md files

chenliang613 Fri, 07 Sep 2018 09:53:59 -0700

Repository: carbondata-site
Updated Branches:
  refs/heads/asf-site 324588f48 -> bee563340



http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/44eed099/src/site/markdown/segment-management-on-carbondata.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/segment-management-on-carbondata.md 
b/src/site/markdown/segment-management-on-carbondata.md
new file mode 100644
index 0000000..a519c88
--- /dev/null
+++ b/src/site/markdown/segment-management-on-carbondata.md
@@ -0,0 +1,154 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one or more 
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership. 
+    The ASF licenses this file to you under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with 
+    the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing, software 
+    distributed under the License is distributed on an "AS IS" BASIS, 
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and 
+    limitations under the License.
+-->
+
+
+## SEGMENT MANAGEMENT
+
+Each load into CarbonData is written into a separate folder called 
Segment.Segments is a powerful 
+concept which helps to maintain consistency of data and easy transaction 
management.CarbonData provides DML (Data Manipulation Language) commands to 
maintain the segments.
+
+- [Show Segments](#show-segment)
+- [Delete Segment by ID](#delete-segment-by-id)
+- [Delete Segment by Date](#delete-segment-by-date)
+- [Query Data with Specified Segments](#query-data-with-specified-segments)
+
+### SHOW SEGMENT
+
+  This command is used to list the segments of CarbonData table.
+
+  ```
+  SHOW [HISTORY] SEGMENTS FOR TABLE [db_name.]table_name LIMIT 
number_of_segments
+  ```
+
+  Example:
+  Show visible segments
+  ```
+  SHOW SEGMENTS FOR TABLE CarbonDatabase.CarbonTable LIMIT 4
+  ```
+  Show all segments, include invisible segments
+  ```
+  SHOW HISTORY SEGMENTS FOR TABLE CarbonDatabase.CarbonTable LIMIT 4
+  ```
+
+### DELETE SEGMENT BY ID
+
+  This command is used to delete segment by using the segment ID. Each segment 
has a unique segment ID associated with it. 
+  Using this segment ID, you can remove the segment.
+
+  The following command will get the segmentID.
+
+  ```
+  SHOW SEGMENTS FOR TABLE [db_name.]table_name LIMIT number_of_segments
+  ```
+
+  After you retrieve the segment ID of the segment that you want to delete, 
execute the following command to delete the selected segment.
+
+  ```
+  DELETE FROM TABLE [db_name.]table_name WHERE SEGMENT.ID IN (segment_id1, 
segments_id2, ...)
+  ```
+
+  Example:
+
+  ```
+  DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.ID IN (0)
+  DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.ID IN (0,5,8)
+  ```
+
+### DELETE SEGMENT BY DATE
+
+  This command will allow to delete the CarbonData segment(s) from the store 
based on the date provided by the user in the DML command. 
+  The segment created before the particular date will be removed from the 
specific stores.
+
+  ```
+  DELETE FROM TABLE [db_name.]table_name WHERE SEGMENT.STARTTIME BEFORE 
DATE_VALUE
+  ```
+
+  Example:
+  ```
+  DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.STARTTIME BEFORE 
'2017-06-01 12:05:06' 
+  ```
+
+### QUERY DATA WITH SPECIFIED SEGMENTS
+
+  This command is used to read data from specified segments during CarbonScan.
+
+  Get the Segment ID:
+  ```
+  SHOW SEGMENTS FOR TABLE [db_name.]table_name LIMIT number_of_segments
+  ```
+
+  Set the segment IDs for table
+  ```
+  SET carbon.input.segments.<database_name>.<table_name> = <list of segment 
IDs>
+  ```
+
+  **NOTE:**
+  carbon.input.segments: Specifies the segment IDs to be queried. This 
property allows you to query specified segments of the specified table. The 
CarbonScan will read data from specified segments only.
+
+  If user wants to query with segments reading in multi threading mode, then 
CarbonSession. threadSet can be used instead of SET query.
+  ```
+  CarbonSession.threadSet 
("carbon.input.segments.<database_name>.<table_name>","<list of segment IDs>");
+  ```
+
+  Reset the segment IDs
+  ```
+  SET carbon.input.segments.<database_name>.<table_name> = *;
+  ```
+
+  If user wants to query with segments reading in multi threading mode, then 
CarbonSession. threadSet can be used instead of SET query. 
+  ```
+  CarbonSession.threadSet 
("carbon.input.segments.<database_name>.<table_name>","*");
+  ```
+
+  **Examples:**
+
+  * Example to show the list of segment IDs,segment status, and other required 
details and then specify the list of segments to be read.
+
+  ```
+  SHOW SEGMENTS FOR carbontable1;
+  
+  SET carbon.input.segments.db.carbontable1 = 1,3,9;
+  ```
+
+  * Example to query with segments reading in multi threading mode:
+
+  ```
+  CarbonSession.threadSet 
("carbon.input.segments.db.carbontable_Multi_Thread","1,3");
+  ```
+
+  * Example for threadset in multithread environment (following shows how it 
is used in Scala code):
+
+  ```
+  def main(args: Array[String]) {
+  Future {          
+    CarbonSession.threadSet 
("carbon.input.segments.db.carbontable_Multi_Thread","1")
+    spark.sql("select count(empno) from 
carbon.input.segments.db.carbontable_Multi_Thread").show();
+     }
+   }
+  ```
+
+
+<script>
+$(function() {
+  // Show selected style on nav item
+  $('.b-nav__docs').addClass('selected');
+  // Display docs subnav items
+  if (!$('.b-nav__docs').parent().hasClass('nav__item__with__subs--expanded')) 
{
+    $('.b-nav__docs').parent().toggleClass('nav__item__with__subs--expanded');
+  }
+});
+</script>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/44eed099/src/site/markdown/streaming-guide.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/streaming-guide.md 
b/src/site/markdown/streaming-guide.md
index 32d24dc..2f8aa5e 100644
--- a/src/site/markdown/streaming-guide.md
+++ b/src/site/markdown/streaming-guide.md
@@ -259,3 +259,16 @@ ALTER TABLE streaming_table COMPACT 'close_streaming'
 5. if the table has dictionary columns, it will not support concurrent data 
loading.
 6. block delete "streaming" segment while the streaming ingestion is running.
 7. block drop the streaming table while the streaming ingestion is running.
+
+
+<script>
+$(function() {
+  // Show selected style on nav item
+  $('.b-nav__docs').addClass('selected');
+
+  // Display docs subnav items
+  if (!$('.b-nav__docs').parent().hasClass('nav__item__with__subs--expanded')) 
{
+    $('.b-nav__docs').parent().toggleClass('nav__item__with__subs--expanded');
+  }
+});
+</script>

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/44eed099/src/site/markdown/supported-data-types-in-carbondata.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/supported-data-types-in-carbondata.md 
b/src/site/markdown/supported-data-types-in-carbondata.md
index eb74a2e..35e41ba 100644
--- a/src/site/markdown/supported-data-types-in-carbondata.md
+++ b/src/site/markdown/supported-data-types-in-carbondata.md
@@ -36,7 +36,7 @@
     * VARCHAR
 
     **NOTE**: For string longer than 32000 characters, use 
`LONG_STRING_COLUMNS` in table property.
-    Please refer to TBLProperties in 
[CreateTable](https://github.com/apache/carbondata/blob/master/docs/data-management-on-carbondata.md#create-table)
 for more information.
+    Please refer to TBLProperties in 
[CreateTable](./ddl-of-carbondata.md#create-table) for more information.
 
   * Complex Types
     * arrays: ARRAY``<data_type>``
@@ -45,4 +45,16 @@
     **NOTE**: Only 2 level complex type schema is supported for now.
 
   * Other Types
-    * BOOLEAN
\ No newline at end of file
+    * BOOLEAN
+    
+<script>
+$(function() {
+  // Show selected style on nav item
+  $('.b-nav__docs').addClass('selected');
+
+  // Display docs subnav items
+  if (!$('.b-nav__docs').parent().hasClass('nav__item__with__subs--expanded')) 
{
+    $('.b-nav__docs').parent().toggleClass('nav__item__with__subs--expanded');
+  }
+});
+</script>

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/44eed099/src/site/markdown/timeseries-datamap-guide.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/timeseries-datamap-guide.md 
b/src/site/markdown/timeseries-datamap-guide.md
index 135188d..d3ef3c6 100644
--- a/src/site/markdown/timeseries-datamap-guide.md
+++ b/src/site/markdown/timeseries-datamap-guide.md
@@ -22,18 +22,19 @@
 * [Data Management](#data-management-with-pre-aggregate-tables)
 
 ## Timeseries DataMap Introduction (Alpha Feature)
-Timeseries DataMap a pre-aggregate table implementation based on 
'pre-aggregate' DataMap.
+Timeseries DataMap is a pre-aggregate table implementation based on 
'pre-aggregate' DataMap.
 Difference is that Timeseries DataMap has built-in understanding of time 
hierarchy and
 levels: year, month, day, hour, minute, so that it supports automatic roll-up 
in time dimension 
 for query.
 
+**CAUTION:** Current version of CarbonData does not support roll-up.It will be 
implemented in future versions.
+
 The data loading, querying, compaction command and its behavior is the same as 
preaggregate DataMap.
-Please refer to [Pre-aggregate 
DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/preaggregate-datamap-guide.md)
+Please refer to [Pre-aggregate DataMap](./preaggregate-datamap-guide.md)
 for more information.
   
 To use this datamap, user can create multiple timeseries datamap on the main 
table which has 
-a *event_time* column, one datamap for one time granularity. Then Carbondata 
can do automatic 
-roll-up for queries on the main table.
+a *event_time* column, one datamap for one time granularity.
 
 For example, below statement effectively create multiple pre-aggregate tables  
on main table called 
 **timeseries**
@@ -88,20 +89,10 @@ DMPROPERTIES (
 ) AS
 SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), 
sum(price),
  avg(price) FROM sales GROUP BY order_time, country, sex
-  
-CREATE DATAMAP agg_minute
-ON TABLE sales
-USING "timeseries"
-DMPROPERTIES (
-  'event_time'='order_time',
-  'minute_granularity'='1',
-) AS
-SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), 
sum(price),
- avg(price) FROM sales GROUP BY order_time, country, sex
 ```
   
 For querying timeseries data, Carbondata has builtin support for following 
time related UDF 
-to enable automatically roll-up to the desired aggregation level
+
 ```
 timeseries(timeseries column name, 'aggregation level')
 ```
@@ -111,7 +102,7 @@ SELECT timeseries(order_time, 'hour'), sum(quantity) FROM 
sales GROUP BY timeser
 ```
   
 It is **not necessary** to create pre-aggregate tables for each granularity 
unless required for 
-query. Carbondata can roll-up the data and fetch it.
+query.
  
 For Example: For main table **sales** , if following timeseries datamaps were 
created for day 
 level and hour level pre-aggregate
@@ -138,7 +129,7 @@ level and hour level pre-aggregate
    avg(price) FROM sales GROUP BY order_time, country, sex
 ```
 
-Queries like below will be rolled-up and hit the timeseries datamaps
+Queries like below will not be rolled-up and hit the main table
 ```
 Select timeseries(order_time, 'month'), sum(quantity) from sales group by 
timeseries(order_time,
   'month')
@@ -155,9 +146,21 @@ the future CarbonData release.
       
 
 ## Compacting timeseries datamp
-Refer to Compaction section in [preaggregation 
datamap](https://github.com/apache/carbondata/blob/master/docs/datamap/preaggregate-datamap-guide.md).
 
+Refer to Compaction section in [preaggregation 
datamap](./preaggregate-datamap-guide.md). 
 Same applies to timeseries datamap.
 
 ## Data Management on timeseries datamap
-Refer to Data Management section in [preaggregation 
datamap](https://github.com/apache/carbondata/blob/master/docs/datamap/preaggregate-datamap-guide.md).
-Same applies to timeseries datamap.
\ No newline at end of file
+Refer to Data Management section in [preaggregation 
datamap](./preaggregate-datamap-guide.md).
+Same applies to timeseries datamap.
+
+<script>
+$(function() {
+  // Show selected style on nav item
+  $('.b-nav__datamap').addClass('selected');
+  
+  if 
(!$('.b-nav__datamap').parent().hasClass('nav__item__with__subs--expanded')) {
+    // Display datamap subnav items
+    
$('.b-nav__datamap').parent().toggleClass('nav__item__with__subs--expanded');
+  }
+});
+</script>

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/44eed099/src/site/markdown/troubleshooting.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/troubleshooting.md 
b/src/site/markdown/troubleshooting.md
deleted file mode 100644
index 0156121..0000000
--- a/src/site/markdown/troubleshooting.md
+++ /dev/null
@@ -1,267 +0,0 @@
-<!--
-    Licensed to the Apache Software Foundation (ASF) under one or more 
-    contributor license agreements.  See the NOTICE file distributed with
-    this work for additional information regarding copyright ownership. 
-    The ASF licenses this file to you under the Apache License, Version 2.0
-    (the "License"); you may not use this file except in compliance with 
-    the License.  You may obtain a copy of the License at
-
-      http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software 
-    distributed under the License is distributed on an "AS IS" BASIS, 
-    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-    See the License for the specific language governing permissions and 
-    limitations under the License.
--->
-
-# Troubleshooting
-This tutorial is designed to provide troubleshooting for end users and 
developers
-who are building, deploying, and using CarbonData.
-
-## When loading data, gets tablestatus.lock issues:
-
-  **Symptom**
-```
-17/11/11 16:48:13 ERROR LocalFileLock: main 
hdfs:/localhost:9000/carbon/store/default/hdfstable/tablestatus.lock (No such 
file or directory)
-java.io.FileNotFoundException: 
hdfs:/localhost:9000/carbon/store/default/hdfstable/tablestatus.lock (No such 
file or directory)
-       at java.io.FileOutputStream.open0(Native Method)
-       at java.io.FileOutputStream.open(FileOutputStream.java:270)
-       at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
-       at java.io.FileOutputStream.<init>(FileOutputStream.java:101)
-```
-
-  **Possible Cause**
-  If you use `<hdfs path>` as store path when creating carbonsession, may get 
the errors,because the default is LOCALLOCK.
-
-  **Procedure**
-  Before creating carbonsession, sets as below:
-  ```
-  import org.apache.carbondata.core.util.CarbonProperties
-  import org.apache.carbondata.core.constants.CarbonCommonConstants
-  CarbonProperties.getInstance().addProperty(CarbonCommonConstants.LOCK_TYPE, 
"HDFSLOCK")
-  ```
-
-## Failed to load thrift libraries
-
-  **Symptom**
-
-  Thrift throws following exception :
-
-  ```
-  thrift: error while loading shared libraries:
-  libthriftc.so.0: cannot open shared object file: No such file or directory
-  ```
-
-  **Possible Cause**
-
-  The complete path to the directory containing the libraries is not 
configured correctly.
-
-  **Procedure**
-
-  Follow the Apache thrift docs at 
[https://thrift.apache.org/docs/install](https://thrift.apache.org/docs/install)
 to install thrift correctly.
-
-## Failed to launch the Spark Shell
-
-  **Symptom**
-
-  The shell prompts the following error :
-
-  ```
-  org.apache.spark.sql.CarbonContext$$anon$$apache$spark$sql$catalyst$analysis
-  $OverrideCatalog$_setter_$org$apache$spark$sql$catalyst$analysis
-  $OverrideCatalog$$overrides_$e
-  ```
-
-  **Possible Cause**
-
-  The Spark Version and the selected Spark Profile do not match.
-
-  **Procedure**
-
-  1. Ensure your spark version and selected profile for spark are correct.
-
-  2. Use the following command :
-
-```
-"mvn -Pspark-2.1 -Dspark.version {yourSparkVersion} clean package"
-```
-Note :  Refrain from using "mvn clean package" without specifying the profile.
-
-## Failed to execute load query on cluster.
-
-  **Symptom**
-
-  Load query failed with the following exception:
-
-  ```
-  Dictionary file is locked for updation.
-  ```
-
-  **Possible Cause**
-
-  The carbon.properties file is not identical in all the nodes of the cluster.
-
-  **Procedure**
-
-  Follow the steps to ensure the carbon.properties file is consistent across 
all the nodes:
-
-  1. Copy the carbon.properties file from the master node to all the other 
nodes in the cluster.
-     For example, you can use ssh to copy this file to all the nodes.
-
-  2. For the changes to take effect, restart the Spark cluster.
-
-## Failed to execute insert query on cluster.
-
-  **Symptom**
-
-  Load query failed with the following exception:
-
-  ```
-  Dictionary file is locked for updation.
-  ```
-
-  **Possible Cause**
-
-  The carbon.properties file is not identical in all the nodes of the cluster.
-
-  **Procedure**
-
-  Follow the steps to ensure the carbon.properties file is consistent across 
all the nodes:
-
-  1. Copy the carbon.properties file from the master node to all the other 
nodes in the cluster.
-       For example, you can use scp to copy this file to all the nodes.
-
-  2. For the changes to take effect, restart the Spark cluster.
-
-## Failed to connect to hiveuser with thrift
-
-  **Symptom**
-
-  We get the following exception :
-
-  ```
-  Cannot connect to hiveuser.
-  ```
-
-  **Possible Cause**
-
-  The external process does not have permission to access.
-
-  **Procedure**
-
-  Ensure that the Hiveuser in mysql must allow its access to the external 
processes.
-
-## Failed to read the metastore db during table creation.
-
-  **Symptom**
-
-  We get the following exception on trying to connect :
-
-  ```
-  Cannot read the metastore db
-  ```
-
-  **Possible Cause**
-
-  The metastore db is dysfunctional.
-
-  **Procedure**
-
-  Remove the metastore db from the carbon.metastore in the Spark Directory.
-
-## Failed to load data on the cluster
-
-  **Symptom**
-
-  Data loading fails with the following exception :
-
-   ```
-   Data Load failure exception
-   ```
-
-  **Possible Cause**
-
-  The following issue can cause the failure :
-
-  1. The core-site.xml, hive-site.xml, yarn-site and carbon.properties are not 
consistent across all nodes of the cluster.
-
-  2. Path to hdfs ddl is not configured correctly in the carbon.properties.
-
-  **Procedure**
-
-   Follow the steps to ensure the following configuration files are consistent 
across all the nodes:
-
-   1. Copy the core-site.xml, hive-site.xml, yarn-site,carbon.properties files 
from the master node to all the other nodes in the cluster.
-      For example, you can use scp to copy this file to all the nodes.
-
-      Note : Set the path to hdfs ddl in carbon.properties in the master node.
-
-   2. For the changes to take effect, restart the Spark cluster.
-
-
-
-## Failed to insert data on the cluster
-
-  **Symptom**
-
-  Insertion fails with the following exception :
-
-   ```
-   Data Load failure exception
-   ```
-
-  **Possible Cause**
-
-  The following issue can cause the failure :
-
-  1. The core-site.xml, hive-site.xml, yarn-site and carbon.properties are not 
consistent across all nodes of the cluster.
-
-  2. Path to hdfs ddl is not configured correctly in the carbon.properties.
-
-  **Procedure**
-
-   Follow the steps to ensure the following configuration files are consistent 
across all the nodes:
-
-   1. Copy the core-site.xml, hive-site.xml, yarn-site,carbon.properties files 
from the master node to all the other nodes in the cluster.
-      For example, you can use scp to copy this file to all the nodes.
-
-      Note : Set the path to hdfs ddl in carbon.properties in the master node.
-
-   2. For the changes to take effect, restart the Spark cluster.
-
-## Failed to execute Concurrent Operations(Load,Insert,Update) on table by 
multiple workers.
-
-  **Symptom**
-
-  Execution fails with the following exception :
-
-   ```
-   Table is locked for updation.
-   ```
-
-  **Possible Cause**
-
-  Concurrency not supported.
-
-  **Procedure**
-
-  Worker must wait for the query execution to complete and the table to 
release the lock for another query execution to succeed.
-
-## Failed to create a table with a single numeric column.
-
-  **Symptom**
-
-  Execution fails with the following exception :
-
-   ```
-   Table creation fails.
-   ```
-
-  **Possible Cause**
-
-  Behaviour not supported.
-
-  **Procedure**
-
-  A single column that can be considered as dimension is mandatory for table 
creation.

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/44eed099/src/site/markdown/useful-tips-on-carbondata.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/useful-tips-on-carbondata.md 
b/src/site/markdown/useful-tips-on-carbondata.md
deleted file mode 100644
index 641a7f3..0000000
--- a/src/site/markdown/useful-tips-on-carbondata.md
+++ /dev/null
@@ -1,179 +0,0 @@
-<!--
-    Licensed to the Apache Software Foundation (ASF) under one or more 
-    contributor license agreements.  See the NOTICE file distributed with
-    this work for additional information regarding copyright ownership. 
-    The ASF licenses this file to you under the Apache License, Version 2.0
-    (the "License"); you may not use this file except in compliance with 
-    the License.  You may obtain a copy of the License at
-
-      http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software 
-    distributed under the License is distributed on an "AS IS" BASIS, 
-    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-    See the License for the specific language governing permissions and 
-    limitations under the License.
--->
-
-# Useful Tips
-  This tutorial guides you to create CarbonData Tables and optimize 
performance.
-  The following sections will elaborate on the below topics :
-
-  * [Suggestions to create CarbonData 
Table](#suggestions-to-create-carbondata-table)
-  * [Configuration for Optimizing Data Loading performance for Massive 
Data](#configuration-for-optimizing-data-loading-performance-for-massive-data)
-  * [Optimizing Mass Data 
Loading](#configurations-for-optimizing-carbondata-performance)
-
-## Suggestions to Create CarbonData Table
-
-  For example, the results of the analysis for table creation with dimensions 
ranging from 10 thousand to 10 billion rows and 100 to 300 columns have been 
summarized below.
-  The following table describes some of the columns from the table used.
-
-  - **Table Column Description**
-
-  | Column Name | Data Type     | Cardinality | Attribution |
-  |-------------|---------------|-------------|-------------|
-  | msisdn      | String        | 30 million  | Dimension   |
-  | BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
-  | HOST        | String        | 1 million   | Dimension   |
-  | Dime_1      | String        | 1 Thousand  | Dimension   |
-  | counter_1   | Decimal       | NA          | Measure     |
-  | counter_2   | Numeric(20,0) | NA          | Measure     |
-  | ...         | ...           | NA          | Measure     |
-  | counter_100 | Decimal       | NA          | Measure     |
-
-
-  - **Put the frequently-used column filter in the beginning**
-
-  For example, MSISDN filter is used in most of the query then we must put the 
MSISDN in the first column.
-  The create table command can be modified as suggested below :
-
-  ```
-  create table carbondata_table(
-    msisdn String,
-    BEGIN_TIME bigint,
-    HOST String,
-    Dime_1 String,
-    counter_1, Decimal
-    ...
-    
-    )STORED BY 'carbondata'
-    TBLPROPERTIES ('SORT_COLUMNS'='msisdn, Dime_1')
-  ```
-
-  Now the query with MSISDN in the filter will be more efficient.
-
-  - **Put the frequently-used columns in the order of low to high cardinality**
-
-  If the table in the specified query has multiple columns which are 
frequently used to filter the results, it is suggested to put
-  the columns in the order of cardinality low to high. This ordering of 
frequently used columns improves the compression ratio and
-  enhances the performance of queries with filter on these columns.
-
-  For example, if MSISDN, HOST and Dime_1 are frequently-used columns, then 
the column order of table is suggested as
-  Dime_1>HOST>MSISDN, because Dime_1 has the lowest cardinality.
-  The create table command can be modified as suggested below :
-
-  ```
-  create table carbondata_table(
-      msisdn String,
-      BEGIN_TIME bigint,
-      HOST String,
-      Dime_1 String,
-      counter_1, Decimal
-      ...
-      
-      )STORED BY 'carbondata'
-      TBLPROPERTIES ('SORT_COLUMNS'='Dime_1, HOST, MSISDN')
-  ```
-
-  - **For measure type columns with non high accuracy, replace Numeric(20,0) 
data type with Double data type**
-
-  For columns of measure type, not requiring high accuracy, it is suggested to 
replace Numeric data type with Double to enhance query performance. 
-  The create table command can be modified as below :
-
-```
-  create table carbondata_table(
-    Dime_1 String,
-    BEGIN_TIME bigint,
-    END_TIME bigint,
-    HOST String,
-    MSISDN String,
-    counter_1 decimal,
-    counter_2 double,
-    ...
-    )STORED BY 'carbondata'
-    TBLPROPERTIES ('SORT_COLUMNS'='Dime_1, HOST, MSISDN')
-```
-  The result of performance analysis of test-case shows reduction in query 
execution time from 15 to 3 seconds, thereby improving performance by nearly 5 
times.
-
- - **Columns of incremental character should be re-arranged at the end of 
dimensions**
-
-  Consider the following scenario where data is loaded each day and the 
begin_time is incremental for each load, it is suggested to put begin_time at 
the end of dimensions.
-  Incremental values are efficient in using min/max index. The create table 
command can be modified as below :
-
-  ```
-  create table carbondata_table(
-    Dime_1 String,
-    HOST String,
-    MSISDN String,
-    counter_1 double,
-    counter_2 double,
-    BEGIN_TIME bigint,
-    END_TIME bigint,
-    ...
-    counter_100 double
-    )STORED BY 'carbondata'
-    TBLPROPERTIES ('SORT_COLUMNS'='Dime_1, HOST, MSISDN')
-  ```
-
-  **NOTE:**
-  + BloomFilter can be created to enhance performance for queries with precise 
equal/in conditions. You can find more information about it in BloomFilter 
datamap 
[document](https://github.com/apache/carbondata/blob/master/docs/datamap/bloomfilter-datamap-guide.md).
-
-
-## Configuration for Optimizing Data Loading performance for Massive Data
-
-
-  CarbonData supports large data load, in this process sorting data while 
loading consumes a lot of memory and disk IO and
-  this can result sometimes in "Out Of Memory" exception.
-  If you do not have much memory to use, then you may prefer to slow the speed 
of data loading instead of data load failure.
-  You can configure CarbonData by tuning following properties in 
carbon.properties file to get a better performance.
-
-  | Parameter | Default Value | Description/Tuning |
-  |-----------|-------------|--------|
-  |carbon.number.of.cores.while.loading|Default: 2.This value should be >= 
2|Specifies the number of cores used for data processing during data loading in 
CarbonData. |
-  |carbon.sort.size|Default: 100000. The value should be >= 100.|Threshold to 
write local file in sort step when loading data|
-  |carbon.sort.file.write.buffer.size|Default:  50000.|DataOutputStream 
buffer. |
-  |carbon.number.of.cores.block.sort|Default: 7 | If you have huge memory and 
CPUs, increase it as you will|
-  |carbon.merge.sort.reader.thread|Default: 3 |Specifies the number of cores 
used for temp file merging during data loading in CarbonData.|
-  |carbon.merge.sort.prefetch|Default: true | You may want set this value to 
false if you have not enough memory|
-
-  For example, if there are 10 million records, and i have only 16 cores, 64GB 
memory, will be loaded to CarbonData table.
-  Using the default configuration  always fail in sort step. Modify 
carbon.properties as suggested below:
-
-  ```
-  carbon.number.of.cores.block.sort=1
-  carbon.merge.sort.reader.thread=1
-  carbon.sort.size=5000
-  carbon.sort.file.write.buffer.size=5000
-  carbon.merge.sort.prefetch=false
-  ```
-
-## Configurations for Optimizing CarbonData Performance
-
-  Recently we did some performance POC on CarbonData for Finance and 
telecommunication Field. It involved detailed queries and aggregation
-  scenarios. After the completion of POC, some of the configurations impacting 
the performance have been identified and tabulated below :
-
-  | Parameter | Location | Used For  | Description | Tuning |
-  
|----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-  | carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties | 
Data loading | During the loading of data, local temp is used to sort the data. 
This number specifies the minimum number of intermediate files after which the  
merge sort has to be initiated. | Increasing the parameter to a higher value 
will improve the load performance. For example, when we increase the value from 
20 to 100, it increases the data load performance from 35MB/S to more than 
50MB/S. Higher values of this parameter consumes  more memory during the load. |
-  | carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties | 
Data loading | Specifies the number of cores used for data processing during 
data loading in CarbonData. | If you have more number of CPUs, then you can 
increase the number of CPUs, which will increase the performance. For example 
if we increase the value from 2 to 4 then the CSV reading performance can 
increase about 1 times |
-  | carbon.compaction.level.threshold | spark/carbonlib/carbon.properties | 
Data loading and Querying | For minor compaction, specifies the number of 
segments to be merged in stage 1 and number of compacted segments to be merged 
in stage 2. | Each CarbonData load will create one segment, if every load is 
small in size it will generate many small file over a period of time impacting 
the query performance. Configuring this parameter will merge the small segment 
to one big segment which will sort the data and improve the performance. For 
Example in one telecommunication scenario, the performance improves about 2 
times after minor compaction. |
-  | spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying | 
The number of task started when spark shuffle. | The value can be 1 to 2 times 
as much as the executor cores. In an aggregation scenario, reducing the number 
from 200 to 32 reduced the query time from 17 to 9 seconds. |
-  | spark.executor.instances/spark.executor.cores/spark.executor.memory | 
spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores, 
and memory used for CarbonData query. | In the bank scenario, we provide the 4 
CPUs cores and 15 GB for each executor which can get good performance. This 2 
value does not mean more the better. It needs to be configured properly in case 
of limited resources. For example, In the bank scenario, it has enough CPU 32 
cores each node but less memory 64 GB each node. So we cannot give more CPU but 
less memory. For example, when 4 cores and 12GB for each executor. It sometimes 
happens GC during the query which impact the query performance very much from 
the 3 second to more than 15 seconds. In this scenario need to increase the 
memory or decrease the CPU cores. |
-  | carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data 
loading | The buffer size to store records, returned from the block scan. | In 
limit scenario this parameter is very important. For example your query limit 
is 1000. But if we set this value to 3000 that means we get 3000 records from 
scan but spark will only take 1000 rows. So the 2000 remaining are useless. In 
one Finance test case after we set it to 100, in the limit 1000 scenario the 
performance increase about 2 times in comparison to if we set this value to 
12000. |
-  | carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | 
Whether use YARN local directories for multi-table load disk load balance | If 
this is set it to true CarbonData will use YARN local directories for 
multi-table load disk load balance, that will improve the data load 
performance. |
-  | carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data 
loading | Whether to use multiple YARN local directories during table data 
loading for disk load balance | After enabling 'carbon.use.local.dir', if this 
is set to true, CarbonData will use all YARN local directories during data load 
for disk load balance, that will improve the data load performance. Please 
enable this property when you encounter disk hotspot problem during data 
loading. |
-  | carbon.sort.temp.compressor | spark/carbonlib/carbon.properties | Data 
loading | Specify the name of compressor to compress the intermediate sort 
temporary files during sort procedure in data loading. | The optional values 
are 'SNAPPY','GZIP','BZIP2','LZ4','ZSTD' and empty. By default, empty means 
that Carbondata will not compress the sort temp files. This parameter will be 
useful if you encounter disk bottleneck. |
-  | carbon.load.skewedDataOptimization.enabled | 
spark/carbonlib/carbon.properties | Data loading | Whether to enable size based 
block allocation strategy for data loading. | When loading, carbondata will use 
file size based block allocation strategy for task distribution. It will make 
sure that all the executors process the same size of data -- It's useful if the 
size of your input data files varies widely, say 1MB~1GB. |
-  | carbon.load.min.size.enabled | spark/carbonlib/carbon.properties | Data 
loading | Whether to enable node minumun input data size allocation strategy 
for data loading.| When loading, carbondata will use node minumun input data 
size allocation strategy for task distribution. It will make sure the node load 
the minimum amount of data -- It's useful if the size of your input data files 
very small, say 1MB~256MB,Avoid generating a large number of small files. |
-  
-  Note: If your CarbonData instance is provided only for query, you may 
specify the property 'spark.speculation=true' which is in conf directory of 
spark.

[01/39] carbondata-site git commit: Added new page layout & updated as per new md files

Reply via email to