[carbondata] branch master updated: [CARBONDATA-3791] Fix documentation for various features

akashrn5 Wed, 06 May 2020 08:59:43 -0700

This is an automated email from the ASF dual-hosted git repository.

akashrn5 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git



The following commit(s) were added to refs/heads/master by this push:
     new 9122342  [CARBONDATA-3791] Fix documentation for various features
9122342 is described below

commit 9122342bdad83b50370435357c28aab0d51a8970
Author: kunal642 <[email protected]>
AuthorDate: Sun May 3 21:43:37 2020 +0530

    [CARBONDATA-3791] Fix documentation for various features
    
    Why is this PR needed?
    Fix documentation for various features
    
    What changes were proposed in this PR?
    1. Added write with hive doc
    2. Added alter upgrade segment doc
    3. Fix other random issues
    
    Does this PR introduce any user interface change?
    No
    
    Is any new testcase added?
    No
    
    This closes #3738
---
 docs/ddl-of-carbondata.md | 32 +++++++++++++++-----------------
 docs/hive-guide.md        | 22 +++++++++++++++-------
 docs/index-server.md      | 25 ++++++++++++++++---------
 3 files changed, 46 insertions(+), 33 deletions(-)

diff --git a/docs/ddl-of-carbondata.md b/docs/ddl-of-carbondata.md
index 84b18f3..3165f4e 100644
--- a/docs/ddl-of-carbondata.md
+++ b/docs/ddl-of-carbondata.md
@@ -20,7 +20,6 @@
 CarbonData DDL statements are documented here,which includes:
 
 * [CREATE TABLE](#create-table)
-  * [Dictionary Encoding](#dictionary-encoding-configuration)
   * [Local Dictionary](#local-dictionary-configuration)
   * [Inverted Index](#inverted-index-configuration)
   * [Sort Columns](#sort-columns-configuration)
@@ -31,7 +30,7 @@ CarbonData DDL statements are documented here,which includes:
   * [Caching Column Min/Max](#caching-minmax-value-for-required-columns)
   * [Caching Level](#caching-at-block-or-blocklet-level)
   * [Hive/Parquet folder Structure](#support-flat-folder-same-as-hiveparquet)
-  * [Extra Long String columns](#string-longer-than-32000-characters)
+  * [Long String columns](#string-longer-than-32000-characters)
   * [Compression for Table](#compression-for-table)
   * [Bad Records Path](#bad-records-path) 
   * [Load Minimum Input File Size](#load-minimum-data-size)
@@ -115,7 +114,7 @@ CarbonData DDL statements are documented here,which 
includes:
 
    - ##### Local Dictionary Configuration
 
-   Columns for which dictionary is not generated needs more storage space and 
in turn more IO. Also since more data will have to be read during query, query 
performance also would suffer.Generating dictionary per blocklet for such 
columns would help in saving storage space and assist in improving query 
performance as carbondata is optimized for handling dictionary encoded columns 
more effectively.Generating dictionary internally per blocklet is termed as 
local dictionary. Please refer to [...]
+   Columns for which dictionary is not generated needs more storage space and 
in turn more IO. Also since more data will have to be read during query, query 
performance also would suffer. Generating dictionary per blocklet for such 
columns would help in saving storage space and assist in improving query 
performance as carbondata is optimized for handling dictionary encoded columns 
more effectively.Generating dictionary internally per blocklet is termed as 
local dictionary. Please refer t [...]
 
    Local Dictionary helps in:
    1. Getting more compression.
@@ -200,7 +199,7 @@ CarbonData DDL statements are documented here,which 
includes:
      **NOTE**: Columns specified in INVERTED_INDEX should also be present in 
SORT_COLUMNS.
 
      ```
-     TBLPROPERTIES 
('SORT_COLUMNS'='column2,column3','NO_INVERTED_INDEX'='column1', 
'INVERTED_INDEX'='column2, column3')
+     TBLPROPERTIES ('SORT_COLUMNS'='column2,column3', 
'INVERTED_INDEX'='column2, column3')
      ```
 
    - ##### Sort Columns Configuration
@@ -215,7 +214,7 @@ CarbonData DDL statements are documented here,which 
includes:
      TBLPROPERTIES ('SORT_COLUMNS'='column1, column3')
      ```
 
-     **NOTE**: Sort_Columns for Complex datatype columns and binary data type 
is not supported.
+     **NOTE**: Sort_Columns for Complex datatype columns, binary, double, 
float, decimal data type is not supported.
 
    - ##### Sort Scope Configuration
    
@@ -240,7 +239,7 @@ CarbonData DDL statements are documented here,which 
includes:
      revenue INT)
    STORED AS carbondata
    TBLPROPERTIES ('SORT_COLUMNS'='productName,storeCity',
-                  'SORT_SCOPE'='NO_SORT')
+                  'SORT_SCOPE'='LOCAL_SORT')
    ```
 
    **NOTE:** CarbonData also supports "using carbondata". Find example code at 
[SparkSessionExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/SparkSessionExample.scala)
 in the CarbonData repo.
@@ -453,11 +452,11 @@ CarbonData DDL statements are documented here,which 
includes:
    - ##### Compression for table
 
      Data compression is also supported by CarbonData.
-     By default, Snappy is used to compress the data. CarbonData also supports 
ZSTD compressor.
+     By default, Snappy is used to compress the data. CarbonData also supports 
ZSTD and GZIP compressors.
+     
      User can specify the compressor in the table property:
-
      ```
-     TBLPROPERTIES('carbon.column.compressor'='snappy')
+     TBLPROPERTIES('carbon.column.compressor'='GZIP')
      ```
      or
      ```
@@ -588,7 +587,7 @@ CarbonData DDL statements are documented here,which 
includes:
          | STORED AS carbondata
          | LOCATION '$storeLocation/origin'
       """.stripMargin)
-  checkAnswer(sql("SELECT count(*) from source"), sql("SELECT count(*) from 
origin"))
+  sql("SELECT count(*) from source").show()
   ```
 
 ### Create external table on Non-Transactional table data location.
@@ -608,12 +607,10 @@ CarbonData DDL statements are documented here,which 
includes:
   This can be SDK output or C++ SDK output. Refer [SDK Guide](./sdk-guide.md) 
and [C++ SDK Guide](./csdk-guide.md). 
 
   **Note:**
-  1. Dropping of the external table should not delete the files present in the 
location.
+  1. Dropping of the external table will not delete the files present in the 
location.
   2. When external table is created on non-transactional table data, 
     external table will be registered with the schema of carbondata files.
-    If multiple files with different schema is present, exception will be 
thrown.
-    So, If table registered with one schema and files are of different schema, 
-    suggest to drop the external table and create again to register table with 
new schema.  
+    If multiple files have the same column with different datatypes then 
exception will be thrown.  
 
 
 ## CREATE DATABASE 
@@ -680,6 +677,7 @@ CarbonData DDL statements are documented here,which 
includes:
       **NOTE:** Add Complex datatype columns is not supported.
 
 Users can specify which columns to include and exclude for local dictionary 
generation after adding new columns. These will be appended with the already 
existing local dictionary include and exclude columns of main table 
respectively.
+     
      ```
      ALTER TABLE carbon ADD COLUMNS (a1 STRING, b1 STRING) 
TBLPROPERTIES('LOCAL_DICTIONARY_INCLUDE'='a1','LOCAL_DICTIONARY_EXCLUDE'='b1')
      ```
@@ -1038,7 +1036,7 @@ Users can specify which columns to include and exclude 
for local dictionary gene
   ``` 
   
   This shows the overall memory consumed in the cache by categories - index 
files, dictionary and 
-  datamaps. This also shows the cache usage by all the tables and children 
tables in the current 
+  indexes. This also shows the cache usage by all the tables and children 
tables in the current 
   database.
   
    ```sql
@@ -1054,7 +1052,7 @@ Users can specify which columns to include and exclude 
for local dictionary gene
   ```
   
   This shows detailed information on cache usage by the table `tableName` and 
its carbonindex files, 
-  its dictionary files, its datamaps and children tables.
+  its dictionary files, its indexes and children tables.
   
   This command is not allowed on child tables.
 
@@ -1063,7 +1061,7 @@ Users can specify which columns to include and exclude 
for local dictionary gene
    ```
     
   This clears any entry in cache by the table `tableName`, its carbonindex 
files, 
-  its dictionary files, its datamaps and children tables.
+  its dictionary files, its indexes and children tables.
     
   This command is not allowed on child tables.
 
diff --git a/docs/hive-guide.md b/docs/hive-guide.md
index 1941168..982ee03 100644
--- a/docs/hive-guide.md
+++ b/docs/hive-guide.md
@@ -64,7 +64,7 @@ carbon.sql("LOAD DATA INPATH '<hdfs store path>/sample.csv' 
INTO TABLE hive_carb
 scala>carbon.sql("SELECT * FROM hive_carbon").show()
 ```
 
-## Query Data in Hive
+## Configure Carbon in Hive
 ### Configure hive classpath
 ```
 mkdir hive/auxlibs/
@@ -93,6 +93,17 @@ Carbon Jars to be copied to the above paths.
 $HIVE_HOME/bin/beeline
 ```
 
+### Write data from hive
+
+ - Write data from hive into carbondata format.
+ 
+ ```
+create table hive_carbon(id int, name string, scale decimal, country string, 
salary double) stored by 'org.apache.carbondata.hive.CarbonStorageHandler';
+insert into hive_carbon select * from parquetTable;
+```
+
+**Note**: Only non-transactional tables are supported when created through 
hive. This means that the standard carbon folder structure would not be 
followed and all files would be written in a flat folder structure.
+
 ### Query data from hive
 
  - This is to read the carbon table through Hive. It is the integration of the 
carbon with Hive.
@@ -105,13 +116,10 @@ These properties helps to recursively traverse through 
the directories to read t
 
 ### Example
 ```
- - In case if the carbon table is not set with the SERDE and the 
INPUTFORMAT/OUTPUTFORMAT, user can create a new hive managed table like below 
with the required details for the hive to read.
-create table hive_carbon_1(id int, name string, scale decimal, country string, 
salary double) ROW FORMAT SERDE 'org.apache.carbondata.hive.CarbonHiveSerDe' 
WITH SERDEPROPERTIES 
('mapreduce.input.carboninputformat.databaseName'='default', 
'mapreduce.input.carboninputformat.tableName'='HIVE_CARBON_EXAMPLE') STORED AS 
INPUTFORMAT 'org.apache.carbondata.hive.MapredCarbonInputFormat' OUTPUTFORMAT 
'org.apache.carbondata.hive.MapredCarbonOutputFormat' LOCATION 
'location_to_the_carbon_table';
-
  - Query the table
-select * from hive_carbon_1;
-select count(*) from hive_carbon_1;
-select * from hive_carbon_1 order by id;
+select * from hive_carbon;
+select count(*) from hive_carbon;
+select * from hive_carbon order by id;
 ```
 
 ### Note
diff --git a/docs/index-server.md b/docs/index-server.md
index 62e239d..6dde633 100644
--- a/docs/index-server.md
+++ b/docs/index-server.md
@@ -19,9 +19,8 @@
 
 ## Background
 
-Carbon currently prunes and caches all block/blocklet datamap index 
information into the driver for
-normal table, for Bloom/Index datamaps the JDBC driver will launch a job to 
prune and cache the
-datamaps in executors.
+Carbon currently prunes and caches all block/blocklet index information into 
the driver for
+normal table, for Bloom/Lucene indexes the JDBC driver will launch a job to 
prune and cache in executors.
 
 This causes the driver to become a bottleneck in the following ways:
 1. If the cache size becomes huge(70-80% of the driver memory) then there can 
be excessive GC in
@@ -52,8 +51,7 @@ This mapping will be maintained for each table and will 
enable the index server
 cache location for each segment.
 
 2. Cache size held by each executor: 
-    This mapping will be used to distribute the segments equally(on the basis 
of size) among the 
-    executors.
+This mapping will be used to distribute the segments equally(on the basis of 
size) among the executors.
   
 Once a request is received each segment would be iterated over and
 checked against tableToExecutorMapping to find if a executor is already
@@ -82,6 +80,15 @@ the pruned blocklets which would be further used for result 
fetching.
 
 **Note:** Multiple JDBC drivers can connect to the index server to use the 
cache.
 
+## Enabling Size based distribution for Legacy stores
+The default round robin based distribution causes unequal distribution of 
cache among the executors, which can cause any one of the executors to be 
bloated with too much cache resulting in performance degrade.
+This problem can be solved by running the `upgrade_segment` command which will 
fill the data size values for each segment in the tablestatus file. Any cache 
loaded after this can use the traditional size based distribution.
+
+#### Example
+```
+alter table table1 compact 'upgrade_segment';
+```
+
 ## Reallocation of executor
 In case executor(s) become dead/unavailable then the segments that were
 earlier being handled by those would be reassigned to some other
@@ -102,7 +109,7 @@ In case of any failure the index server would fallback to 
embedded mode
 which means that the JDBCServer would take care of distributed pruning.
 A similar job would be fired by the JDBCServer which would take care of
 pruning using its own executors. If for any reason the embedded mode
-also fails to prune the datamaps then the job would be passed on to
+also fails to prune the indexes then the job would be passed on to
 driver.
 
 **NOTE:** In case of embedded mode a job would be fired after pruning to clear 
the
@@ -120,7 +127,7 @@ The user can set the location for these files by using 
'carbon.indexserver.temp.
 the files are written in the path /tmp/indexservertmp.
 
 ## Prepriming
-As each query is responsible for caching the pruned datamaps, thus a lot of 
execution time is wasted in reading the 
+As each query is responsible for caching the pruned indexes, thus a lot of 
execution time is wasted in reading the 
 files and caching the datmaps for the first query.
 To avoid this problem we have introduced Pre-Priming which allows each data 
manipulation command like load, insert etc 
 to fire a request to the index server to load the corresponding segments into 
the index server.
@@ -152,11 +159,11 @@ The user can enable prepriming by using 
'carbon.indexserver.enable.prepriming' =
 | carbon.index.server.ip |    NA   |   Specify the IP/HOST on which the server 
would be started. Better to specify the private IP. | 
 | carbon.index.server.port | NA | The port on which the index server has to be 
started. |
 |carbon.index.server.max.worker.threads| 500 | Number of RPC handlers to open 
for accepting the requests from JDBC driver. Max accepted value is Integer.Max. 
Refer: [Hive 
configuration](https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L3441)
 |
-|carbon.max.executor.lru.cache.size|  NA | Maximum memory **(in MB)** upto 
which the executor process can cache the data (DataMaps and reverse dictionary 
values). Only integer values greater than 0 are accepted. **NOTE:** Mandatory 
for the user to set. |
+|carbon.max.executor.lru.cache.size|  NA | Maximum memory **(in MB)** upto 
which the executor process can cache the data (Indexes and reverse dictionary 
values). Only integer values greater than 0 are accepted. **NOTE:** Mandatory 
for the user to set. |
 |carbon.index.server.max.jobname.length|NA|The max length of the job to show 
in the index server application UI. For bigger queries this may impact 
performance as the whole string would be sent from JDBCServer to IndexServer.|
 |carbon.max.executor.threads.for.block.pruning|4| max executor threads used 
for block pruning. |
 |carbon.index.server.inmemory.serialization.threshold.inKB|300|Max in memory 
serialization size after reaching threshold data will be written to file. Min 
value that the user can set is 0KB and max is 102400KB. |
-|carbon.indexserver.temp.path|tablePath| The folder to write the split files 
if in memory datamap size for network transfers crossed the 
'carbon.index.server.inmemory.serialization.threshold.inKB' limit.|
+|carbon.indexserver.temp.path|tablePath| The folder to write the split files 
if in memory index cache size for network transfers crossed the 
'carbon.index.server.inmemory.serialization.threshold.inKB' limit.|
 
 
 ##### spark-defaults.conf(only for secure mode)

[carbondata] branch master updated: [CARBONDATA-3791] Fix documentation for various features

Reply via email to