[carbondata] branch master updated: [CARBONDATA-3215] Optimize the documentation

raghunandan Thu, 17 Jan 2019 21:40:33 -0800

This is an automated email from the ASF dual-hosted git repository.

raghunandan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git



The following commit(s) were added to refs/heads/master by this push:
     new b828d0d  [CARBONDATA-3215] Optimize the documentation
b828d0d is described below

commit b828d0da9c5b0d63b7285f7d801edc4f0a949f5a
Author: xubo245 <xub...@huawei.com>
AuthorDate: Fri Dec 28 20:37:16 2018 +0800

    [CARBONDATA-3215] Optimize the documentation
    
    When user use the Global dictionary, local dictionary，non-dictionary in the 
code,
    users maybe have some confusion. The same for mvdataMap and IndexDataMap. I 
describe and list it in this PR.
    
    1.describe Global dictionary, local dictionary，non-dictionary together in 
doc
    2.list mvdataMap and IndexDataMap
    
    This closes #3033
---
 docs/datamap-developer-guide.md |   8 +-
 docs/ddl-of-carbondata.md       | 166 ++++++++++++++++++++--------------------
 2 files changed, 89 insertions(+), 85 deletions(-)

diff --git a/docs/datamap-developer-guide.md b/docs/datamap-developer-guide.md
index c74aa1b..e1fa355 100644
--- a/docs/datamap-developer-guide.md
+++ b/docs/datamap-developer-guide.md
@@ -19,16 +19,16 @@
 
 ### Introduction
 DataMap is a data structure that can be used to accelerate certain query of 
the table. Different DataMap can be implemented by developers. 
-Currently, there are two 2 types of DataMap supported:
-1. IndexDataMap: DataMap that leverages index to accelerate filter query
-2. MVDataMap: DataMap that leverages Materialized View to accelerate OLAP 
style query, like SPJG query (select, predicate, join, groupby)
+Currently, there are two types of DataMap supported:
+1. IndexDataMap: DataMap that leverages index to accelerate filter query. 
Lucene DataMap and BloomFiler DataMap belong to this type of DataMaps.
+2. MVDataMap: DataMap that leverages Materialized View to accelerate olap 
style query, like SPJG query (select, predicate, join, groupby). Preaggregate, 
timeseries and mv DataMap belong to this type of DataMaps.
 
 ### DataMap Provider
 When user issues `CREATE DATAMAP dm ON TABLE main USING 'provider'`, the 
corresponding DataMapProvider implementation will be created and initialized. 
 Currently, the provider string can be:
 1. preaggregate: A type of MVDataMap that do pre-aggregate of single table
 2. timeseries: A type of MVDataMap that do pre-aggregate based on time 
dimension of the table
-3. class name IndexDataMapFactory  implementation: Developer can implement new 
type of IndexDataMap by extending IndexDataMapFactory
+3. class name IndexDataMapFactory implementation: Developer can implement new 
type of IndexDataMap by extending IndexDataMapFactory
 
 When user issues `DROP DATAMAP dm ON TABLE main`, the corresponding 
DataMapProvider interface will be called.
 
diff --git a/docs/ddl-of-carbondata.md b/docs/ddl-of-carbondata.md
index aaa2eda..b9b391b 100644
--- a/docs/ddl-of-carbondata.md
+++ b/docs/ddl-of-carbondata.md
@@ -21,13 +21,13 @@ CarbonData DDL statements are documented here,which 
includes:
 
 * [CREATE TABLE](#create-table)
   * [Dictionary Encoding](#dictionary-encoding-configuration)
+  * [Local Dictionary](#local-dictionary-configuration)
   * [Inverted Index](#inverted-index-configuration)
   * [Sort Columns](#sort-columns-configuration)
   * [Sort Scope](#sort-scope-configuration)
   * [Table Block Size](#table-block-size-configuration)
   * [Table Compaction](#table-compaction-configuration)
   * [Streaming](#streaming)
-  * [Local Dictionary](#local-dictionary-configuration)
   * [Caching Column Min/Max](#caching-minmax-value-for-required-columns)
   * [Caching Level](#caching-at-block-or-blocklet-level)
   * [Hive/Parquet folder Structure](#support-flat-folder-same-as-hiveparquet)
@@ -121,8 +121,91 @@ CarbonData DDL statements are documented here,which 
includes:
      TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2')
      ```
 
-     **NOTE**: Dictionary Include/Exclude for complex child columns is not 
supported.
+     **NOTE**: 
+      * Dictionary Include/Exclude for complex child columns is not supported. 
  
+      * Dictionary is global. Except global dictionary, there are local 
dictionary and non-dictionary in CarbonData.
+      
+   - ##### Local Dictionary Configuration
+
+   Columns for which dictionary is not generated needs more storage space and 
in turn more IO. Also since more data will have to be read during query, query 
performance also would suffer.Generating dictionary per blocklet for such 
columns would help in saving storage space and assist in improving query 
performance as carbondata is optimized for handling dictionary encoded columns 
more effectively.Generating dictionary internally per blocklet is termed as 
local dictionary. Please refer to [...]
 
+   Local Dictionary helps in:
+   1. Getting more compression.
+   2. Filter queries and full scan queries will be faster as filter will be 
done on encoded data.
+   3. Reducing the store size and memory footprint as only unique values will 
be stored as part of local dictionary and corresponding data will be stored as 
encoded data.
+   4. Getting higher IO throughput.
+
+   **NOTE:** 
+
+   * Following Data Types are Supported for Local Dictionary:
+      * STRING
+      * VARCHAR
+      * CHAR
+
+   * Following Data Types are not Supported for Local Dictionary: 
+      * SMALLINT
+      * INTEGER
+      * BIGINT
+      * DOUBLE
+      * DECIMAL
+      * TIMESTAMP
+      * DATE
+      * BOOLEAN
+      * FLOAT
+      * BYTE
+   * In case of multi-level complex dataType columns, primitive 
string/varchar/char columns are considered for local dictionary generation.
+
+   System Level Properties for Local Dictionary: 
+   
+   
+   | Properties | Default value | Description |
+   | ---------- | ------------- | ----------- |
+   | carbon.local.dictionary.enable | false | By default, Local Dictionary 
will be disabled for the carbondata table. |
+   | carbon.local.dictionary.decoder.fallback | true | Page Level data will 
not be maintained for the blocklet. During fallback, actual data will be 
retrieved from the encoded page data using local dictionary. **NOTE:** Memory 
footprint decreases significantly as compared to when this property is set to 
false |
+    
+   Local Dictionary can be configured using the following properties during 
create table command: 
+          
+
+| Properties | Default value | Description |
+| ---------- | ------------- | ----------- |
+| LOCAL_DICTIONARY_ENABLE | false | Whether to enable local dictionary 
generation. **NOTE:** If this property is defined, it will override the value 
configured at system level by '***carbon.local.dictionary.enable***'.Local 
dictionary will be generated for all string/varchar/char columns unless 
LOCAL_DICTIONARY_INCLUDE, LOCAL_DICTIONARY_EXCLUDE is configured. |
+| LOCAL_DICTIONARY_THRESHOLD | 10000 | The maximum cardinality of a column 
upto which carbondata can try to generate local dictionary (maximum - 100000). 
**NOTE:** When LOCAL_DICTIONARY_THRESHOLD is defined for Complex columns, the 
count of distinct records of all child columns are summed up. |
+| LOCAL_DICTIONARY_INCLUDE | string/varchar/char columns| Columns for which 
Local Dictionary has to be generated.**NOTE:** Those string/varchar/char 
columns which are added into DICTIONARY_INCLUDE option will not be considered 
for local dictionary generation. This property needs to be configured only when 
local dictionary needs to be generated for few columns, skipping others. This 
property takes effect only when **LOCAL_DICTIONARY_ENABLE** is true or 
**carbon.local.dictionary.enable** i [...]
+| LOCAL_DICTIONARY_EXCLUDE | none | Columns for which Local Dictionary need 
not be generated. This property needs to be configured only when local 
dictionary needs to be skipped for few columns, generating for others. This 
property takes effect only when **LOCAL_DICTIONARY_ENABLE** is true or 
**carbon.local.dictionary.enable** is true |
+
+   **Fallback behavior:** 
+
+   * When the cardinality of a column exceeds the threshold, it triggers a 
fallback and the generated dictionary will be reverted and data loading will be 
continued without dictionary encoding.
+   
+   * In case of complex columns, fallback is triggered when the summation 
value of all child columns' distinct records exceeds the defined 
LOCAL_DICTIONARY_THRESHOLD value.
+
+   **NOTE:** When fallback is triggered, the data loading performance will 
decrease as encoded data will be discarded and the actual data is written to 
the temporary sort files.
+
+   **Points to be noted:**
+
+   * Reduce Block size:
+   
+      Number of Blocks generated is less in case of Local Dictionary as 
compression ratio is high. This may reduce the number of tasks launched during 
query, resulting in degradation of query performance if the pruned blocks are 
less compared to the number of parallel tasks which can be run. So it is 
recommended to configure smaller block size which in turn generates more number 
of blocks.
+      
+### Example:
+
+   ```
+   CREATE TABLE carbontable(             
+     column1 string,             
+     column2 string,             
+     column3 LONG)
+   STORED AS carbondata
+   
TBLPROPERTIES('LOCAL_DICTIONARY_ENABLE'='true','LOCAL_DICTIONARY_THRESHOLD'='1000',
+   'LOCAL_DICTIONARY_INCLUDE'='column1','LOCAL_DICTIONARY_EXCLUDE'='column2')
+   ```
+
+   **NOTE:** 
+
+   * We recommend to use Local Dictionary when cardinality is high but is 
distributed across multiple loads
+   * On a large cluster, decoding data can become a bottleneck for global 
dictionary as there will be many remote reads. In this scenario, it is better 
to use Local Dictionary.
+   * When cardinality is less, but loads are repetitive, it is better to use 
global dictionary as local dictionary generates multiple dictionary files at 
blocklet level increasing redundancy.
+   * If want to use non-dictionary, users can set LOCAL_DICTIONARY_ENABLE as 
false and don't set DICTIONARY_INCLUDE.
+      
    - ##### Inverted Index Configuration
 
      By default inverted index is disabled as store size will be reduced, it 
can be enabled by using a table property. It might help to improve compression 
ratio and query speed, especially for low cardinality columns which are in 
reward position.
@@ -224,85 +307,6 @@ CarbonData DDL statements are documented here,which 
includes:
      TBLPROPERTIES ('streaming'='true')
      ```
 
-   - ##### Local Dictionary Configuration
-
-   Columns for which dictionary is not generated needs more storage space and 
in turn more IO. Also since more data will have to be read during query, query 
performance also would suffer.Generating dictionary per blocklet for such 
columns would help in saving storage space and assist in improving query 
performance as carbondata is optimized for handling dictionary encoded columns 
more effectively.Generating dictionary internally per blocklet is termed as 
local dictionary. Please refer to [...]
-
-   Local Dictionary helps in:
-   1. Getting more compression.
-   2. Filter queries and full scan queries will be faster as filter will be 
done on encoded data.
-   3. Reducing the store size and memory footprint as only unique values will 
be stored as part of local dictionary and corresponding data will be stored as 
encoded data.
-   4. Getting higher IO throughput.
-
-   **NOTE:** 
-
-   * Following Data Types are Supported for Local Dictionary:
-      * STRING
-      * VARCHAR
-      * CHAR
-
-   * Following Data Types are not Supported for Local Dictionary: 
-      * SMALLINT
-      * INTEGER
-      * BIGINT
-      * DOUBLE
-      * DECIMAL
-      * TIMESTAMP
-      * DATE
-      * BOOLEAN
-      * FLOAT
-      * BYTE
-   * In case of multi-level complex dataType columns, primitive 
string/varchar/char columns are considered for local dictionary generation.
-
-   System Level Properties for Local Dictionary: 
-   
-   
-   | Properties | Default value | Description |
-   | ---------- | ------------- | ----------- |
-   | carbon.local.dictionary.enable | false | By default, Local Dictionary 
will be disabled for the carbondata table. |
-   | carbon.local.dictionary.decoder.fallback | true | Page Level data will 
not be maintained for the blocklet. During fallback, actual data will be 
retrieved from the encoded page data using local dictionary. **NOTE:** Memory 
footprint decreases significantly as compared to when this property is set to 
false |
-    
-   Local Dictionary can be configured using the following properties during 
create table command: 
-          
-
-| Properties | Default value | Description |
-| ---------- | ------------- | ----------- |
-| LOCAL_DICTIONARY_ENABLE | false | Whether to enable local dictionary 
generation. **NOTE:** If this property is defined, it will override the value 
configured at system level by '***carbon.local.dictionary.enable***'.Local 
dictionary will be generated for all string/varchar/char columns unless 
LOCAL_DICTIONARY_INCLUDE, LOCAL_DICTIONARY_EXCLUDE is configured. |
-| LOCAL_DICTIONARY_THRESHOLD | 10000 | The maximum cardinality of a column 
upto which carbondata can try to generate local dictionary (maximum - 100000). 
**NOTE:** When LOCAL_DICTIONARY_THRESHOLD is defined for Complex columns, the 
count of distinct records of all child columns are summed up. |
-| LOCAL_DICTIONARY_INCLUDE | string/varchar/char columns| Columns for which 
Local Dictionary has to be generated.**NOTE:** Those string/varchar/char 
columns which are added into DICTIONARY_INCLUDE option will not be considered 
for local dictionary generation. This property needs to be configured only when 
local dictionary needs to be generated for few columns, skipping others. This 
property takes effect only when **LOCAL_DICTIONARY_ENABLE** is true or 
**carbon.local.dictionary.enable** i [...]
-| LOCAL_DICTIONARY_EXCLUDE | none | Columns for which Local Dictionary need 
not be generated. This property needs to be configured only when local 
dictionary needs to be skipped for few columns, generating for others. This 
property takes effect only when **LOCAL_DICTIONARY_ENABLE** is true or 
**carbon.local.dictionary.enable** is true |
-
-   **Fallback behavior:** 
-
-   * When the cardinality of a column exceeds the threshold, it triggers a 
fallback and the generated dictionary will be reverted and data loading will be 
continued without dictionary encoding.
-   
-   * In case of complex columns, fallback is triggered when the summation 
value of all child columns' distinct records exceeds the defined 
LOCAL_DICTIONARY_THRESHOLD value.
-
-   **NOTE:** When fallback is triggered, the data loading performance will 
decrease as encoded data will be discarded and the actual data is written to 
the temporary sort files.
-
-   **Points to be noted:**
-
-   * Reduce Block size:
-   
-      Number of Blocks generated is less in case of Local Dictionary as 
compression ratio is high. This may reduce the number of tasks launched during 
query, resulting in degradation of query performance if the pruned blocks are 
less compared to the number of parallel tasks which can be run. So it is 
recommended to configure smaller block size which in turn generates more number 
of blocks.
-      
-### Example:
-
-   ```
-   CREATE TABLE carbontable(             
-     column1 string,             
-     column2 string,             
-     column3 LONG)
-   STORED AS carbondata
-   
TBLPROPERTIES('LOCAL_DICTIONARY_ENABLE'='true','LOCAL_DICTIONARY_THRESHOLD'='1000',
-   'LOCAL_DICTIONARY_INCLUDE'='column1','LOCAL_DICTIONARY_EXCLUDE'='column2')
-   ```
-
-   **NOTE:** 
-
-   * We recommend to use Local Dictionary when cardinality is high but is 
distributed across multiple loads
-   * On a large cluster, decoding data can become a bottleneck for global 
dictionary as there will be many remote reads. In this scenario, it is better 
to use Local Dictionary.
-   * When cardinality is less, but loads are repetitive, it is better to use 
global dictionary as local dictionary generates multiple dictionary files at 
blocklet level increasing redundancy.
 
    - ##### Caching Min/Max Value for Required Columns

[carbondata] branch master updated: [CARBONDATA-3215] Optimize the documentation

Reply via email to