This is an automated email from the ASF dual-hosted git repository.
akashrn5 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git
The following commit(s) were added to refs/heads/master by this push:
new 9122342 [CARBONDATA-3791] Fix documentation for various features
9122342 is described below
commit 9122342bdad83b50370435357c28aab0d51a8970
Author: kunal642 <[email protected]>
AuthorDate: Sun May 3 21:43:37 2020 +0530
[CARBONDATA-3791] Fix documentation for various features
Why is this PR needed?
Fix documentation for various features
What changes were proposed in this PR?
1. Added write with hive doc
2. Added alter upgrade segment doc
3. Fix other random issues
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3738
---
docs/ddl-of-carbondata.md | 32 +++++++++++++++-----------------
docs/hive-guide.md | 22 +++++++++++++++-------
docs/index-server.md | 25 ++++++++++++++++---------
3 files changed, 46 insertions(+), 33 deletions(-)
diff --git a/docs/ddl-of-carbondata.md b/docs/ddl-of-carbondata.md
index 84b18f3..3165f4e 100644
--- a/docs/ddl-of-carbondata.md
+++ b/docs/ddl-of-carbondata.md
@@ -20,7 +20,6 @@
CarbonData DDL statements are documented here,which includes:
* [CREATE TABLE](#create-table)
- * [Dictionary Encoding](#dictionary-encoding-configuration)
* [Local Dictionary](#local-dictionary-configuration)
* [Inverted Index](#inverted-index-configuration)
* [Sort Columns](#sort-columns-configuration)
@@ -31,7 +30,7 @@ CarbonData DDL statements are documented here,which includes:
* [Caching Column Min/Max](#caching-minmax-value-for-required-columns)
* [Caching Level](#caching-at-block-or-blocklet-level)
* [Hive/Parquet folder Structure](#support-flat-folder-same-as-hiveparquet)
- * [Extra Long String columns](#string-longer-than-32000-characters)
+ * [Long String columns](#string-longer-than-32000-characters)
* [Compression for Table](#compression-for-table)
* [Bad Records Path](#bad-records-path)
* [Load Minimum Input File Size](#load-minimum-data-size)
@@ -115,7 +114,7 @@ CarbonData DDL statements are documented here,which
includes:
- ##### Local Dictionary Configuration
- Columns for which dictionary is not generated needs more storage space and
in turn more IO. Also since more data will have to be read during query, query
performance also would suffer.Generating dictionary per blocklet for such
columns would help in saving storage space and assist in improving query
performance as carbondata is optimized for handling dictionary encoded columns
more effectively.Generating dictionary internally per blocklet is termed as
local dictionary. Please refer to [...]
+ Columns for which dictionary is not generated needs more storage space and
in turn more IO. Also since more data will have to be read during query, query
performance also would suffer. Generating dictionary per blocklet for such
columns would help in saving storage space and assist in improving query
performance as carbondata is optimized for handling dictionary encoded columns
more effectively.Generating dictionary internally per blocklet is termed as
local dictionary. Please refer t [...]
Local Dictionary helps in:
1. Getting more compression.
@@ -200,7 +199,7 @@ CarbonData DDL statements are documented here,which
includes:
**NOTE**: Columns specified in INVERTED_INDEX should also be present in
SORT_COLUMNS.
```
- TBLPROPERTIES
('SORT_COLUMNS'='column2,column3','NO_INVERTED_INDEX'='column1',
'INVERTED_INDEX'='column2, column3')
+ TBLPROPERTIES ('SORT_COLUMNS'='column2,column3',
'INVERTED_INDEX'='column2, column3')
```
- ##### Sort Columns Configuration
@@ -215,7 +214,7 @@ CarbonData DDL statements are documented here,which
includes:
TBLPROPERTIES ('SORT_COLUMNS'='column1, column3')
```
- **NOTE**: Sort_Columns for Complex datatype columns and binary data type
is not supported.
+ **NOTE**: Sort_Columns for Complex datatype columns, binary, double,
float, decimal data type is not supported.
- ##### Sort Scope Configuration
@@ -240,7 +239,7 @@ CarbonData DDL statements are documented here,which
includes:
revenue INT)
STORED AS carbondata
TBLPROPERTIES ('SORT_COLUMNS'='productName,storeCity',
- 'SORT_SCOPE'='NO_SORT')
+ 'SORT_SCOPE'='LOCAL_SORT')
```
**NOTE:** CarbonData also supports "using carbondata". Find example code at
[SparkSessionExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/SparkSessionExample.scala)
in the CarbonData repo.
@@ -453,11 +452,11 @@ CarbonData DDL statements are documented here,which
includes:
- ##### Compression for table
Data compression is also supported by CarbonData.
- By default, Snappy is used to compress the data. CarbonData also supports
ZSTD compressor.
+ By default, Snappy is used to compress the data. CarbonData also supports
ZSTD and GZIP compressors.
+
User can specify the compressor in the table property:
-
```
- TBLPROPERTIES('carbon.column.compressor'='snappy')
+ TBLPROPERTIES('carbon.column.compressor'='GZIP')
```
or
```
@@ -588,7 +587,7 @@ CarbonData DDL statements are documented here,which
includes:
| STORED AS carbondata
| LOCATION '$storeLocation/origin'
""".stripMargin)
- checkAnswer(sql("SELECT count(*) from source"), sql("SELECT count(*) from
origin"))
+ sql("SELECT count(*) from source").show()
```
### Create external table on Non-Transactional table data location.
@@ -608,12 +607,10 @@ CarbonData DDL statements are documented here,which
includes:
This can be SDK output or C++ SDK output. Refer [SDK Guide](./sdk-guide.md)
and [C++ SDK Guide](./csdk-guide.md).
**Note:**
- 1. Dropping of the external table should not delete the files present in the
location.
+ 1. Dropping of the external table will not delete the files present in the
location.
2. When external table is created on non-transactional table data,
external table will be registered with the schema of carbondata files.
- If multiple files with different schema is present, exception will be
thrown.
- So, If table registered with one schema and files are of different schema,
- suggest to drop the external table and create again to register table with
new schema.
+ If multiple files have the same column with different datatypes then
exception will be thrown.
## CREATE DATABASE
@@ -680,6 +677,7 @@ CarbonData DDL statements are documented here,which
includes:
**NOTE:** Add Complex datatype columns is not supported.
Users can specify which columns to include and exclude for local dictionary
generation after adding new columns. These will be appended with the already
existing local dictionary include and exclude columns of main table
respectively.
+
```
ALTER TABLE carbon ADD COLUMNS (a1 STRING, b1 STRING)
TBLPROPERTIES('LOCAL_DICTIONARY_INCLUDE'='a1','LOCAL_DICTIONARY_EXCLUDE'='b1')
```
@@ -1038,7 +1036,7 @@ Users can specify which columns to include and exclude
for local dictionary gene
```
This shows the overall memory consumed in the cache by categories - index
files, dictionary and
- datamaps. This also shows the cache usage by all the tables and children
tables in the current
+ indexes. This also shows the cache usage by all the tables and children
tables in the current
database.
```sql
@@ -1054,7 +1052,7 @@ Users can specify which columns to include and exclude
for local dictionary gene
```
This shows detailed information on cache usage by the table `tableName` and
its carbonindex files,
- its dictionary files, its datamaps and children tables.
+ its dictionary files, its indexes and children tables.
This command is not allowed on child tables.
@@ -1063,7 +1061,7 @@ Users can specify which columns to include and exclude
for local dictionary gene
```
This clears any entry in cache by the table `tableName`, its carbonindex
files,
- its dictionary files, its datamaps and children tables.
+ its dictionary files, its indexes and children tables.
This command is not allowed on child tables.
diff --git a/docs/hive-guide.md b/docs/hive-guide.md
index 1941168..982ee03 100644
--- a/docs/hive-guide.md
+++ b/docs/hive-guide.md
@@ -64,7 +64,7 @@ carbon.sql("LOAD DATA INPATH '<hdfs store path>/sample.csv'
INTO TABLE hive_carb
scala>carbon.sql("SELECT * FROM hive_carbon").show()
```
-## Query Data in Hive
+## Configure Carbon in Hive
### Configure hive classpath
```
mkdir hive/auxlibs/
@@ -93,6 +93,17 @@ Carbon Jars to be copied to the above paths.
$HIVE_HOME/bin/beeline
```
+### Write data from hive
+
+ - Write data from hive into carbondata format.
+
+ ```
+create table hive_carbon(id int, name string, scale decimal, country string,
salary double) stored by 'org.apache.carbondata.hive.CarbonStorageHandler';
+insert into hive_carbon select * from parquetTable;
+```
+
+**Note**: Only non-transactional tables are supported when created through
hive. This means that the standard carbon folder structure would not be
followed and all files would be written in a flat folder structure.
+
### Query data from hive
- This is to read the carbon table through Hive. It is the integration of the
carbon with Hive.
@@ -105,13 +116,10 @@ These properties helps to recursively traverse through
the directories to read t
### Example
```
- - In case if the carbon table is not set with the SERDE and the
INPUTFORMAT/OUTPUTFORMAT, user can create a new hive managed table like below
with the required details for the hive to read.
-create table hive_carbon_1(id int, name string, scale decimal, country string,
salary double) ROW FORMAT SERDE 'org.apache.carbondata.hive.CarbonHiveSerDe'
WITH SERDEPROPERTIES
('mapreduce.input.carboninputformat.databaseName'='default',
'mapreduce.input.carboninputformat.tableName'='HIVE_CARBON_EXAMPLE') STORED AS
INPUTFORMAT 'org.apache.carbondata.hive.MapredCarbonInputFormat' OUTPUTFORMAT
'org.apache.carbondata.hive.MapredCarbonOutputFormat' LOCATION
'location_to_the_carbon_table';
-
- Query the table
-select * from hive_carbon_1;
-select count(*) from hive_carbon_1;
-select * from hive_carbon_1 order by id;
+select * from hive_carbon;
+select count(*) from hive_carbon;
+select * from hive_carbon order by id;
```
### Note
diff --git a/docs/index-server.md b/docs/index-server.md
index 62e239d..6dde633 100644
--- a/docs/index-server.md
+++ b/docs/index-server.md
@@ -19,9 +19,8 @@
## Background
-Carbon currently prunes and caches all block/blocklet datamap index
information into the driver for
-normal table, for Bloom/Index datamaps the JDBC driver will launch a job to
prune and cache the
-datamaps in executors.
+Carbon currently prunes and caches all block/blocklet index information into
the driver for
+normal table, for Bloom/Lucene indexes the JDBC driver will launch a job to
prune and cache in executors.
This causes the driver to become a bottleneck in the following ways:
1. If the cache size becomes huge(70-80% of the driver memory) then there can
be excessive GC in
@@ -52,8 +51,7 @@ This mapping will be maintained for each table and will
enable the index server
cache location for each segment.
2. Cache size held by each executor:
- This mapping will be used to distribute the segments equally(on the basis
of size) among the
- executors.
+This mapping will be used to distribute the segments equally(on the basis of
size) among the executors.
Once a request is received each segment would be iterated over and
checked against tableToExecutorMapping to find if a executor is already
@@ -82,6 +80,15 @@ the pruned blocklets which would be further used for result
fetching.
**Note:** Multiple JDBC drivers can connect to the index server to use the
cache.
+## Enabling Size based distribution for Legacy stores
+The default round robin based distribution causes unequal distribution of
cache among the executors, which can cause any one of the executors to be
bloated with too much cache resulting in performance degrade.
+This problem can be solved by running the `upgrade_segment` command which will
fill the data size values for each segment in the tablestatus file. Any cache
loaded after this can use the traditional size based distribution.
+
+#### Example
+```
+alter table table1 compact 'upgrade_segment';
+```
+
## Reallocation of executor
In case executor(s) become dead/unavailable then the segments that were
earlier being handled by those would be reassigned to some other
@@ -102,7 +109,7 @@ In case of any failure the index server would fallback to
embedded mode
which means that the JDBCServer would take care of distributed pruning.
A similar job would be fired by the JDBCServer which would take care of
pruning using its own executors. If for any reason the embedded mode
-also fails to prune the datamaps then the job would be passed on to
+also fails to prune the indexes then the job would be passed on to
driver.
**NOTE:** In case of embedded mode a job would be fired after pruning to clear
the
@@ -120,7 +127,7 @@ The user can set the location for these files by using
'carbon.indexserver.temp.
the files are written in the path /tmp/indexservertmp.
## Prepriming
-As each query is responsible for caching the pruned datamaps, thus a lot of
execution time is wasted in reading the
+As each query is responsible for caching the pruned indexes, thus a lot of
execution time is wasted in reading the
files and caching the datmaps for the first query.
To avoid this problem we have introduced Pre-Priming which allows each data
manipulation command like load, insert etc
to fire a request to the index server to load the corresponding segments into
the index server.
@@ -152,11 +159,11 @@ The user can enable prepriming by using
'carbon.indexserver.enable.prepriming' =
| carbon.index.server.ip | NA | Specify the IP/HOST on which the server
would be started. Better to specify the private IP. |
| carbon.index.server.port | NA | The port on which the index server has to be
started. |
|carbon.index.server.max.worker.threads| 500 | Number of RPC handlers to open
for accepting the requests from JDBC driver. Max accepted value is Integer.Max.
Refer: [Hive
configuration](https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L3441)
|
-|carbon.max.executor.lru.cache.size| NA | Maximum memory **(in MB)** upto
which the executor process can cache the data (DataMaps and reverse dictionary
values). Only integer values greater than 0 are accepted. **NOTE:** Mandatory
for the user to set. |
+|carbon.max.executor.lru.cache.size| NA | Maximum memory **(in MB)** upto
which the executor process can cache the data (Indexes and reverse dictionary
values). Only integer values greater than 0 are accepted. **NOTE:** Mandatory
for the user to set. |
|carbon.index.server.max.jobname.length|NA|The max length of the job to show
in the index server application UI. For bigger queries this may impact
performance as the whole string would be sent from JDBCServer to IndexServer.|
|carbon.max.executor.threads.for.block.pruning|4| max executor threads used
for block pruning. |
|carbon.index.server.inmemory.serialization.threshold.inKB|300|Max in memory
serialization size after reaching threshold data will be written to file. Min
value that the user can set is 0KB and max is 102400KB. |
-|carbon.indexserver.temp.path|tablePath| The folder to write the split files
if in memory datamap size for network transfers crossed the
'carbon.index.server.inmemory.serialization.threshold.inKB' limit.|
+|carbon.indexserver.temp.path|tablePath| The folder to write the split files
if in memory index cache size for network transfers crossed the
'carbon.index.server.inmemory.serialization.threshold.inKB' limit.|
##### spark-defaults.conf(only for secure mode)