This is an automated email from the ASF dual-hosted git repository.
ajantha pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git
The following commit(s) were added to refs/heads/master by this push:
new 4fec616 [CARBONDATA-3791] Fix spelling, validating links for
quick-start guide and configuration parameters
4fec616 is described below
commit 4fec616dadcc52e867f17ccadfec88e9d0b656ce
Author: Mahesh Raju Somalaraju <[email protected]>
AuthorDate: Mon May 4 00:00:07 2020 +0530
[CARBONDATA-3791] Fix spelling, validating links for quick-start guide and
configuration parameters
Why is this PR needed?
Fix spelling, validating links for quick-start guide and configuration
parameters
What changes were proposed in this PR?
.md file changes
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3740
---
docs/configuration-parameters.md | 46 ++++++++++------------
docs/quick-start-guide.md | 85 +++++++++++++++++++++-------------------
2 files changed, 65 insertions(+), 66 deletions(-)
diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index dc105a8..9428e00 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -31,10 +31,9 @@ This section provides the details of all the configurations
required for the Car
| Property | Default Value | Description |
|----------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
-| carbon.storelocation | spark.sql.warehouse.dir property value | Location
where CarbonData will create the store, and write the data in its custom
format. If not specified,the path defaults to spark.sql.warehouse.dir property.
**NOTE:** Store location should be in one of the carbon supported filesystems.
Like HDFS or S3. It is not recommended to use this property. |
| carbon.ddl.base.hdfs.url | (none) | To simplify and shorten the path to be
specified in DDL/DML commands, this property is supported. This property is
used to configure the HDFS relative path, the path configured in
carbon.ddl.base.hdfs.url will be appended to the HDFS path configured in
fs.defaultFS of core-site.xml. If this path is configured, then user need not
pass the complete path while dataload. For example: If absolute path of the csv
file is hdfs://10.18.101.155:54310/data/cnb [...]
| carbon.badRecords.location | (none) | CarbonData can detect the records not
conforming to defined table schema and isolate them as bad records. This
property is used to specify where to store such bad records. |
-| carbon.streaming.auto.handoff.enabled | true | CarbonData supports storing
of streaming data. To have high throughput for streaming, the data is written
in Row format which is highly optimized for write, but performs poorly for
query. When this property is true and when the streaming data size reaches
***carbon.streaming.segment.max.size***, CabonData will automatically convert
the data to columnar format and optimize it for faster querying.**NOTE:** It is
not recommended to keep the d [...]
+| carbon.streaming.auto.handoff.enabled | true | CarbonData supports storing
of streaming data. To have high throughput for streaming, the data is written
in Row format which is highly optimized for write, but performs poorly for
query. When this property is true and when the streaming data size reaches
***carbon.streaming.segment.max.size***, CabonData will automatically convert
the data to columnar format and optimize it for faster querying. **NOTE:** It
is not recommended to keep the [...]
| carbon.streaming.segment.max.size | 1024000000 | CarbonData writes streaming
data in row format which is optimized for high write throughput. This property
defines the maximum size of data to be held is row format, beyond which it will
be converted to columnar format in order to support high performance query,
provided ***carbon.streaming.auto.handoff.enabled*** is true. **NOTE:** Setting
higher value will impact the streaming ingestion. The value has to be
configured in bytes. |
| carbon.segment.lock.files.preserve.hours | 48 | In order to support parallel
data loading onto the same table, CarbonData sequences(locks) at the
granularity of segments. Operations affecting the segment(like IUD, alter) are
blocked from parallel operations. This property value indicates the number of
hours the segment lock files will be preserved after dataload. These lock files
will be deleted with the clean command after the configured number of hours. |
| carbon.timestamp.format | yyyy-MM-dd HH:mm:ss | CarbonData can understand
data of timestamp type and process it in special manner. It can be so that the
format of Timestamp data is different from that understood by CarbonData by
default. This configuration allows users to specify the format of Timestamp in
their data. |
@@ -58,9 +57,9 @@ This section provides the details of all the configurations
required for the Car
| carbon.concurrent.lock.retries | 100 | CarbonData supports concurrent data
loading onto same table. To ensure the loading status is correctly updated into
the system,locks are used to sequence the status updation step. This
configuration specifies the maximum number of retries to obtain the lock for
updating the load status. **NOTE:** This value is high as more number of
concurrent loading happens,more the chances of not able to obtain the lock when
tried. Adjust this value according t [...]
| carbon.concurrent.lock.retry.timeout.sec | 1 | Specifies the interval
between the retries to obtain the lock for concurrent operations. **NOTE:**
Refer to ***carbon.concurrent.lock.retries*** for understanding why CarbonData
uses locks during data loading operations. |
| carbon.csv.read.buffersize.byte | 1048576 | CarbonData uses Hadoop
InputFormat to read the csv files. This configuration value is used to pass
buffer size as input for the Hadoop MR job when reading the csv files. This
value is configured in bytes. **NOTE:** Refer to
***org.apache.hadoop.mapreduce. InputFormat*** documentation for additional
information. |
-| carbon.loading.prefetch | false | CarbonData uses univocity parser to read
csv files. This configuration is used to inform the parser whether it can
prefetch the data from csv files to speed up the reading.**NOTE:** Enabling
prefetch improves the data loading performance, but needs higher memory to keep
more records which are read ahead from disk. |
-| carbon.skip.empty.line | false | The csv files givent to CarbonData for
loading can contain empty lines. Based on the business scenario, this empty
line might have to be ignored or needs to be treated as NULL value for all
columns. In order to define this business behavior, this configuration is
provided.**NOTE:** In order to consider NULL values for non string columns and
continue with data load, ***carbon.bad.records.action*** need to be set to
**FORCE**;else data load will be failed [...]
-| carbon.number.of.cores.while.loading | 2 | Number of cores to be used while
loading data. This also determines the number of threads to be used to read the
input files (csv) in parallel.**NOTE:** This configured value is used in every
data loading step to parallelize the operations. Configuring a higher value can
lead to increased early thread pre-emption by OS and there by reduce the
overall performance. |
+| carbon.loading.prefetch | false | CarbonData uses univocity parser to read
csv files. This configuration is used to inform the parser whether it can
prefetch the data from csv files to speed up the reading. **NOTE:** Enabling
prefetch improves the data loading performance, but needs higher memory to keep
more records which are read ahead from disk. |
+| carbon.skip.empty.line | false | The csv files givent to CarbonData for
loading can contain empty lines. Based on the business scenario, this empty
line might have to be ignored or needs to be treated as NULL value for all
columns. In order to define this business behavior, this configuration is
provided. **NOTE:** In order to consider NULL values for non string columns and
continue with data load, ***carbon.bad.records.action*** need to be set to
**FORCE**;else data load will be faile [...]
+| carbon.number.of.cores.while.loading | 2 | Number of cores to be used while
loading data. This also determines the number of threads to be used to read the
input files (csv) in parallel. **NOTE:** This configured value is used in every
data loading step to parallelize the operations. Configuring a higher value can
lead to increased early thread pre-emption by OS and there by reduce the
overall performance. |
| enable.unsafe.sort | true | CarbonData supports unsafe operations of Java to
avoid GC overhead for certain operations. This configuration enables to use
unsafe functions in CarbonData. **NOTE:** For operations like data loading,
which generates more short lived Java objects, Java GC can be a bottle neck.
Using unsafe can overcome the GC overhead and improve the overall performance. |
| enable.offheap.sort | true | CarbonData supports storing data in off-heap
memory for certain operations during data loading and query. This helps to
avoid the Java GC and thereby improve the overall performance. This
configuration enables using off-heap memory for sorting of data during data
loading.**NOTE:** ***enable.unsafe.sort*** configuration needs to be
configured to true for using off-heap |
| carbon.load.sort.scope | NO_SORT [If sort columns are not specified while
creating table] and LOCAL_SORT [If sort columns are specified] | CarbonData can
support various sorting options to match the balance between load and query
performance. LOCAL_SORT: All the data given to an executor in the single load
is fully sorted and written to carbondata files. Data loading performance is
reduced a little as the entire data needs to be sorted in the executor. GLOBAL
SORT: Entire data in the d [...]
@@ -84,10 +83,10 @@ This section provides the details of all the configurations
required for the Car
| carbon.timegranularity | SECOND | The configuration is used to specify the
data granularity level such as DAY, HOUR, MINUTE, or SECOND. This helps to
store more than 68 years of data into CarbonData. |
| carbon.use.local.dir | true | CarbonData,during data loading, writes files
to local temp directories before copying the files to HDFS. This configuration
is used to specify whether CarbonData can write locally to tmp directory of the
container or to the YARN application directory. |
| carbon.sort.temp.compressor | SNAPPY | CarbonData writes every
***carbon.sort.size*** number of records to intermediate temp files during data
loading to ensure memory footprint is within limits. These temporary files can
be compressed and written in order to save the storage space. This
configuration specifies the name of compressor to be used to compress the
intermediate sort temp files during sort procedure in data loading. The valid
values are 'SNAPPY','GZIP','BZIP2','LZ4','ZSTD' a [...]
-| carbon.load.skewedDataOptimization.enabled | false | During data
loading,CarbonData would divide the number of blocks equally so as to ensure
all executors process same number of blocks. This mechanism satisfies most of
the scenarios and ensures maximum parallel processing for optimal data loading
performance. In some business scenarios, there might be scenarios where the
size of blocks vary significantly and hence some executors would have to do
more work if they get blocks containing [...]
+| carbon.load.skewedDataOptimization.enabled | false | During data
loading,CarbonData would divide the number of blocks equally so as to ensure
all executors process same number of blocks. This mechanism satisfies most of
the scenarios and ensures maximum parallel processing for optimal data loading
performance. In some business scenarios, there might be scenarios where the
size of blocks vary significantly and hence some executors would have to do
more work if they get blocks containing [...]
| enable.data.loading.statistics | false | CarbonData has extensive logging
which would be useful for debugging issues related to performance or hard to
locate issues. This configuration when made ***true*** would log additional
data loading statistics information to more accurately locate the issues being
debugged. **NOTE:** Enabling this would log more debug information to log
files, there by increasing the log files size significantly in short span of
time. It is advised to configure [...]
| carbon.dictionary.chunk.size | 10000 | CarbonData generates dictionary keys
and writes them to separate dictionary file during data loading. To optimize
the IO, this configuration determines the number of dictionary keys to be
persisted to dictionary file at a time. **NOTE:** Writing to file also serves
as a commit point to the dictionary generated. Increasing more values in memory
causes more data loss during system or application failure. It is advised to
alter this configuration jud [...]
-| carbon.load.directWriteToStorePath.enabled | false | During data load, all
the carbondata files are written to local disk and finally copied to the target
store location in HDFS/S3. Enabling this parameter will make carbondata files
to be written directly onto target HDFS/S3 location bypassing the local
disk.**NOTE:** Writing directly to HDFS/S3 saves local disk IO(once for writing
the files and again for copying to HDFS/S3) there by improving the performance.
But the drawback is when [...]
+| carbon.load.directWriteToStorePath.enabled | false | During data load, all
the carbondata files are written to local disk and finally copied to the target
store location in HDFS/S3. Enabling this parameter will make carbondata files
to be written directly onto target HDFS/S3 location bypassing the local disk.
**NOTE:** Writing directly to HDFS/S3 saves local disk IO(once for writing the
files and again for copying to HDFS/S3) there by improving the performance. But
the drawback is when [...]
| carbon.options.serialization.null.format | \N | Based on the business
scenarios, some columns might need to be loaded with null values. As null value
cannot be written in csv files, some special characters might be adopted to
specify null values. This configuration can be used to specify the null values
format in the data being loaded. |
| carbon.column.compressor | snappy | CarbonData will compress the column
values using the compressor specified by this configuration. Currently
CarbonData supports 'snappy', 'zstd' and 'gzip' compressors. |
| carbon.minmax.allowed.byte.count | 200 | CarbonData will write the min max
values for string/varchar types column using the byte count specified by this
configuration. Max value is 1000 bytes(500 characters) and Min value is 10
bytes(5 characters). **NOTE:** This property is useful for reducing the store
size thereby improving the query performance but can lead to query degradation
if value is not configured properly. | |
@@ -101,19 +100,19 @@ This section provides the details of all the
configurations required for the Car
| Parameter | Default Value | Description |
|-----------------------------------------------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| carbon.number.of.cores.while.compacting | 2 | Number of cores to be used
while compacting data. This also determines the number of threads to be used to
read carbondata files in parallel. |
-| carbon.compaction.level.threshold | 4, 3 | Each CarbonData load will create
one segment, if every load is small in size it will generate many small file
over a period of time impacting the query performance. This configuration is
for minor compaction which decides how many segments to be merged.
Configuration is of the form (x,y). Compaction will be triggered for every x
segments and form a single level 1 compacted segment. When the number of
compacted level 1 segments reach y, compact [...]
+| carbon.compaction.level.threshold | 4, 3 | Each CarbonData load will create
one segment, if every load is small in size it will generate many small file
over a period of time impacting the query performance. This configuration is
for minor compaction which decides how many segments to be merged.
Configuration is of the form (x,y). Compaction will be triggered for every x
segments and form a single level 1 compacted segment. When the number of
compacted level 1 segments reach y, compact [...]
| carbon.major.compaction.size | 1024 | To improve query performance and all
the segments can be merged and compacted to a single segment upto configured
size. This Major compaction size can be configured using this parameter. Sum of
the segments which is below this threshold will be merged. This value is
expressed in MB. |
-| carbon.horizontal.compaction.enable | true | CarbonData supports
DELETE/UPDATE functionality by creating delta data files for existing
carbondata files. These delta files would grow as more number of DELETE/UPDATE
operations are performed. Compaction of these delta files are termed as
horizontal compaction. This configuration is used to turn ON/OFF horizontal
compaction. After every DELETE and UPDATE statement, horizontal compaction may
occur in case the delta (DELETE/ UPDATE) files be [...]
+| carbon.horizontal.compaction.enable | true | CarbonData supports
DELETE/UPDATE functionality by creating delta data files for existing
carbondata files. These delta files would grow as more number of DELETE/UPDATE
operations are performed. Compaction of these delta files are termed as
horizontal compaction. This configuration is used to turn ON/OFF horizontal
compaction. After every DELETE and UPDATE statement, horizontal compaction may
occur in case the delta (DELETE/ UPDATE) files be [...]
| carbon.horizontal.update.compaction.threshold | 1 | This configuration
specifies the threshold limit on number of UPDATE delta files within a segment.
In case the number of delta files goes beyond the threshold, the UPDATE delta
files within the segment becomes eligible for horizontal compaction and are
compacted into single UPDATE delta file. Values range between 1 to 10000. |
| carbon.horizontal.delete.compaction.threshold | 1 | This configuration
specifies the threshold limit on number of DELETE delta files within a block of
a segment. In case the number of delta files goes beyond the threshold, the
DELETE delta files for the particular block of the segment becomes eligible for
horizontal compaction and are compacted into single DELETE delta file. Values
range between 1 to 10000. |
-| carbon.update.segment.parallelism | 1 | CarbonData processes the UPDATE
operations by grouping records belonging to a segment into a single executor
task. When the amount of data to be updated is more, this behavior causes
problems like restarting of executor due to low memory and data-spill related
errors. This property specifies the parallelism for each segment during
update.**NOTE:** It is recommended to set this value to a multiple of the
number of executors for balance. Values ran [...]
-| carbon.numberof.preserve.segments | 0 | If the user wants to preserve some
number of segments from being compacted then he can set this configuration.
Example: carbon.numberof.preserve.segments = 2 then 2 latest segments will
always be excluded from the compaction. No segments will be preserved by
default.**NOTE:** This configuration is useful when the chances of input data
can be wrong due to environment scenarios. Preserving some of the latest
segments from being compacted can help t [...]
-| carbon.allowed.compaction.days | 0 | This configuration is used to control
on the number of recent segments that needs to be compacted, ignoring the older
ones. This configuration is in days. For Example: If the configuration is 2,
then the segments which are loaded in the time frame of past 2 days only will
get merged. Segments which are loaded earlier than 2 days will not be merged.
This configuration is disabled by default.**NOTE:** This configuration is
useful when a bulk of histor [...]
-| carbon.enable.auto.load.merge | false | Compaction can be automatically
triggered once data load completes. This ensures that the segments are merged
in time and thus query times does not increase with increase in segments. This
configuration enables to do compaction along with data loading.**NOTE:
**Compaction will be triggered once the data load completes. But the status of
data load wait till the compaction is completed. Hence it might look like data
loading time has increased, but [...]
+| carbon.update.segment.parallelism | 1 | CarbonData processes the UPDATE
operations by grouping records belonging to a segment into a single executor
task. When the amount of data to be updated is more, this behavior causes
problems like restarting of executor due to low memory and data-spill related
errors. This property specifies the parallelism for each segment during update.
**NOTE:** It is recommended to set this value to a multiple of the number of
executors for balance. Values ra [...]
+| carbon.numberof.preserve.segments | 0 | If the user wants to preserve some
number of segments from being compacted then he can set this configuration.
Example: carbon.numberof.preserve.segments = 2 then 2 latest segments will
always be excluded from the compaction. No segments will be preserved by
default. **NOTE:** This configuration is useful when the chances of input data
can be wrong due to environment scenarios. Preserving some of the latest
segments from being compacted can help [...]
+| carbon.allowed.compaction.days | 0 | This configuration is used to control
on the number of recent segments that needs to be compacted, ignoring the older
ones. This configuration is in days. For Example: If the configuration is 2,
then the segments which are loaded in the time frame of past 2 days only will
get merged. Segments which are loaded earlier than 2 days will not be merged.
This configuration is disabled by default. **NOTE:** This configuration is
useful when a bulk of histo [...]
+| carbon.enable.auto.load.merge | false | Compaction can be automatically
triggered once data load completes. This ensures that the segments are merged
in time and thus query times does not increase with increase in segments. This
configuration enables to do compaction along with data loading. **NOTE:**
Compaction will be triggered once the data load completes. But the status of
data load wait till the compaction is completed. Hence it might look like data
loading time has increased, but [...]
| carbon.enable.page.level.reader.in.compaction|true|Enabling page level
reader for compaction reduces the memory usage while compacting more number of
segments. It allows reading only page by page instead of reading whole blocklet
to memory. **NOTE:** Please refer to
[file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format)
to understand the storage format of CarbonData and concepts of pages.|
-| carbon.concurrent.compaction | true | Compaction of different tables can be
executed concurrently. This configuration determines whether to compact all
qualifying tables in parallel or not. **NOTE: **Compacting concurrently is a
resource demanding operation and needs more resources there by affecting the
query performance also. This configuration is **deprecated** and might be
removed in future releases. |
-| carbon.compaction.prefetch.enable | false | Compaction operation is similar
to Query + data load where in data from qualifying segments are queried and
data loading performed to generate a new single segment. This configuration
determines whether to query ahead data from segments and feed it for data
loading. **NOTE: **This configuration is disabled by default as it needs extra
resources for querying extra data. Based on the memory availability on the
cluster, user can enable it to imp [...]
-| carbon.merge.index.in.segment | true | Each CarbonData file has a companion
CarbonIndex file which maintains the metadata about the data. These CarbonIndex
files are read and loaded into driver and is used subsequently for pruning of
data during queries. These CarbonIndex files are very small in size(few KB) and
are many. Reading many small files from HDFS is not efficient and leads to slow
IO performance. Hence these CarbonIndex files belonging to a segment can be
combined into a sin [...]
+| carbon.concurrent.compaction | true | Compaction of different tables can be
executed concurrently. This configuration determines whether to compact all
qualifying tables in parallel or not. **NOTE:** Compacting concurrently is a
resource demanding operation and needs more resources there by affecting the
query performance also. This configuration is **deprecated** and might be
removed in future releases. |
+| carbon.compaction.prefetch.enable | false | Compaction operation is similar
to Query + data load where in data from qualifying segments are queried and
data loading performed to generate a new single segment. This configuration
determines whether to query ahead data from segments and feed it for data
loading. **NOTE:** This configuration is disabled by default as it needs extra
resources for querying extra data. Based on the memory availability on the
cluster, user can enable it to imp [...]
+| carbon.merge.index.in.segment | true | Each CarbonData file has a companion
CarbonIndex file which maintains the metadata about the data. These CarbonIndex
files are read and loaded into driver and is used subsequently for pruning of
data during queries. These CarbonIndex files are very small in size(few KB) and
are many. Reading many small files from HDFS is not efficient and leads to slow
IO performance. Hence these CarbonIndex files belonging to a segment can be
combined into a sin [...]
| carbon.enable.range.compaction | true | To configure Ranges-based Compaction
to be used or not for RANGE_COLUMN. If true after compaction also the data
would be present in ranges. |
| carbon.si.segment.merge | false | Making this true degrade the LOAD
performance. When the number of small files increase for SI segments(it can
happen as number of columns will be less and we store position id and reference
columns), user an either set to true which will merge the data files for
upcoming loads or run SI refresh command which does this job for all segments.
(REFRESH INDEX <index_table>) |
@@ -125,7 +124,7 @@ This section provides the details of all the configurations
required for the Car
| carbon.max.executor.lru.cache.size | -1 | Maximum memory **(in MB)** upto
which the executor process can cache the data (BTree and reverse dictionary
values). Default value of -1 means there is no memory limit for caching. Only
integer values greater than 0 are accepted. **NOTE:** If this parameter is not
configured, then the value of ***carbon.max.driver.lru.cache.size*** will be
used. |
| max.query.execution.time | 60 | Maximum time allowed for one query to be
executed. The value is in minutes. |
| carbon.enableMinMax | true | CarbonData maintains the metadata which enables
to prune unnecessary files from being scanned as per the query conditions. To
achieve pruning, Min,Max of each column is maintined.Based on the filter
condition in the query, certain data can be skipped from scanning by matching
the filter value against the min,max values of the column(s) present in that
carbondata file. This pruning enhances query performance significantly. |
-| carbon.dynamical.location.scheduler.timeout | 5 | CarbonData has its own
scheduling algorithm to suggest to Spark on how many tasks needs to be launched
and how much work each task need to do in a Spark cluster for any query on
CarbonData. To determine the number of tasks that can be scheduled, knowing the
count of active executors is necessary. When dynamic allocation is enabled on a
YARN based spark cluster, executor processes are shutdown if no request is
received for a particular a [...]
+| carbon.dynamical.location.scheduler.timeout | 5 | CarbonData has its own
scheduling algorithm to suggest to Spark on how many tasks needs to be launched
and how much work each task need to do in a Spark cluster for any query on
CarbonData. To determine the number of tasks that can be scheduled, knowing the
count of active executors is necessary. When dynamic allocation is enabled on a
YARN based spark cluster, executor processes are shutdown if no request is
received for a particular a [...]
| carbon.scheduler.min.registered.resources.ratio | 0.8 | Specifies the
minimum resource (executor) ratio needed for starting the block distribution.
The default value is 0.8, which indicates 80% of the requested resource is
allocated for starting block distribution. The minimum value is 0.1 min and the
maximum value is 1.0. |
| carbon.search.enabled (Alpha Feature) | false | If set to true, it will use
CarbonReader to do distributed scan directly instead of using compute framework
like spark, thus avoiding limitation of compute framework like SQL optimizer
and task scheduling overhead. |
| carbon.search.query.timeout | 10s | Time within which the result is expected
from the workers, beyond which the query is terminated |
@@ -137,7 +136,7 @@ This section provides the details of all the configurations
required for the Car
| carbon.enable.vector.reader | true | Spark added vector processing to
optimize cpu cache miss and there by increase the query performance. This
configuration enables to fetch data as columnar batch of size 4*1024 rows
instead of fetching data row by row and provide it to spark so that there is
improvement in select queries performance. |
| carbon.task.distribution | block | CarbonData has its own scheduling
algorithm to suggest to Spark on how many tasks needs to be launched and how
much work each task need to do in a Spark cluster for any query on CarbonData.
Each of these task distribution suggestions has its own advantages and
disadvantages. Based on the customer use case, appropriate task distribution
can be configured.**block**: Setting this value will launch one task per block.
This setting is suggested in case of [...]
| carbon.custom.block.distribution | false | CarbonData has its own scheduling
algorithm to suggest to Spark on how many tasks needs to be launched and how
much work each task need to do in a Spark cluster for any query on CarbonData.
When this configuration is true, CarbonData would distribute the available
blocks to be scanned among the available number of cores. For Example:If there
are 10 blocks to be scanned and only 3 tasks can be run(only 3 executor cores
available in the cluster) [...]
-| enable.query.statistics | false | CarbonData has extensive logging which
would be useful for debugging issues related to performance or hard to locate
issues. This configuration when made ***true*** would log additional query
statistics information to more accurately locate the issues being
debugged.**NOTE:** Enabling this would log more debug information to log files,
there by increasing the log files size significantly in short span of time. It
is advised to configure the log files s [...]
+| enable.query.statistics | false | CarbonData has extensive logging which
would be useful for debugging issues related to performance or hard to locate
issues. This configuration when made ***true*** would log additional query
statistics information to more accurately locate the issues being debugged.
**NOTE:** Enabling this would log more debug information to log files, there by
increasing the log files size significantly in short span of time. It is
advised to configure the log files [...]
| enable.unsafe.in.query.processing | false | CarbonData supports unsafe
operations of Java to avoid GC overhead for certain operations. This
configuration enables to use unsafe functions in CarbonData while scanning the
data during query. |
| carbon.max.driver.threads.for.block.pruning | 4 | Number of threads used for
driver pruning when the carbon files are more than 100k Maximum memory. This
configuration can used to set number of threads between 1 to 4. |
| carbon.heap.memory.pooling.threshold.bytes | 1048576 | CarbonData supports
unsafe operations of Java to avoid GC overhead for certain operations. Using
unsafe, memory can be allocated on Java Heap or off heap. This configuration
controls the allocation mechanism on Java HEAP. If the heap memory allocations
of the given size is greater or equal than this value,it should go through the
pooling mechanism. But if set this size to -1, it should not go through the
pooling mechanism. Default [...]
@@ -205,21 +204,18 @@ RESET
| Properties | Description
|
| ----------------------------------------- |
------------------------------------------------------------ |
-| carbon.options.bad.records.logger.enable | CarbonData can identify the
records that are not conformant to schema and isolate them as bad records.
Enabling this configuration will make CarbonData to log such bad
records.**NOTE:** If the input data contains many bad records, logging them
will slow down the over all data loading throughput. The data load operation
status would depend on the configuration in ***carbon.bad.records.action***. |
-| carbon.options.bad.records.logger.enable | To enable or disable bad record
logger. |
-| carbon.options.bad.records.action | This property can have four
types of actions for bad records FORCE, REDIRECT, IGNORE and FAIL. If set to
FORCE then it auto-corrects the data by storing the bad records as NULL. If set
to REDIRECT then bad records are written to the raw CSV instead of being
loaded. If set to IGNORE then bad records are neither loaded nor written to the
raw CSV. If set to FAIL then data loading fails if any bad records are found. |
+| carbon.options.bad.records.logger.enable | To enable or disable a bad
record logger. CarbonData can identify the records that are not conformant to
schema and isolate them as bad records. Enabling this configuration will make
CarbonData to log such bad records. **NOTE:** If the input data contains many
bad records, logging them will slow down the overall data loading throughput.
The data load operation status would depend on the configuration in
***carbon.bad.records.action***. | [...]
+| carbon.options.bad.records.action | This property has four types of
bad record actions: FORCE, REDIRECT, IGNORE and FAIL. If set to FORCE then it
auto-corrects the data by storing the bad records as NULL. If set to REDIRECT
then bad records are written to the raw CSV instead of being loaded. If set to
IGNORE then bad records are neither loaded nor written to the raw CSV. If set
to FAIL then data loading fails if any bad records are found. |
| carbon.options.is.empty.data.bad.record | If false, then empty ("" or ''
or ,,) data will not be considered as bad record and vice versa. |
-| carbon.options.batch.sort.size.inmb | Size of batch data to keep in
memory, as a thumb rule it supposed to be less than 45% of
sort.inmemory.size.inmb otherwise it may spill intermediate data to disk. |
| carbon.options.bad.record.path | Specifies the HDFS path where
bad records needs to be stored. |
-| carbon.custom.block.distribution | Specifies whether to use the
Spark or Carbon block distribution feature.**NOTE: **Refer to [Query
Configuration](#query-configuration)#carbon.custom.block.distribution for more
details on CarbonData scheduler. |
+| carbon.custom.block.distribution | Specifies whether to use the
Spark or Carbon block distribution feature. **NOTE:** Refer to [Query
Configuration](#query-configuration)#carbon.custom.block.distribution for more
details on CarbonData scheduler. |
| enable.unsafe.sort | Specifies whether to use unsafe
sort during data loading. Unsafe sort reduces the garbage collection during
data load operation, resulting in better performance. |
| carbon.options.date.format | Specifies the data format of
the date columns in the data being loaded |
| carbon.options.timestamp.format | Specifies the timestamp format
of the time stamp columns in the data being loaded |
| carbon.options.sort.scope | Specifies how the current data
load should be sorted with. This sort parameter is at the table level.
**NOTE:** Refer to [Data Loading
Configuration](#data-loading-configuration)#carbon.sort.scope for detailed
information. |
-| carbon.table.load.sort.scope.db_name.table_name | Overrides the SORT_SCOPE
provided in CREATE TABLE. |
+| carbon.table.load.sort.scope.<db_name>.<table_name> | Overrides the
SORT_SCOPE provided in CREATE TABLE. |
| carbon.options.global.sort.partitions | Specifies the number of
partitions to be used during global sort. |
| carbon.options.serialization.null.format | Default Null value
representation in the data being loaded. **NOTE:** Refer to [Data Loading
Configuration](#data-loading-configuration)#carbon.options.serialization.null.format
for detailed information. |
-| carbon.query.directQueryOnDataMap.enabled | Specifies whether datamap can be
queried directly. This is useful for debugging purposes.**NOTE: **Refer to
[Query Configuration](#query-configuration) for detailed information. |
**Examples:**
diff --git a/docs/quick-start-guide.md b/docs/quick-start-guide.md
index 4635cdb..e08b7b7 100644
--- a/docs/quick-start-guide.md
+++ b/docs/quick-start-guide.md
@@ -16,10 +16,10 @@
-->
# Quick Start
-This tutorial provides a quick introduction to using CarbonData. To follow
along with this guide, first download a packaged release of CarbonData from the
[CarbonData
website](https://dist.apache.org/repos/dist/release/carbondata/).Alternatively
it can be created following [Building
CarbonData](https://github.com/apache/carbondata/tree/master/build) steps.
+This tutorial provides a quick introduction to use CarbonData. To follow along
with this guide, download a packaged release of CarbonData from the [CarbonData
website](https://dist.apache.org/repos/dist/release/carbondata/).
Alternatively, it can be created following [Building
CarbonData](https://github.com/apache/carbondata/tree/master/build) steps.
## Prerequisites
-* CarbonData supports Spark versions upto 2.2.1.Please download Spark package
from [Spark website](https://spark.apache.org/downloads.html)
+* CarbonData supports Spark versions up to 2.4. Please download Spark package
from [Spark website](https://spark.apache.org/downloads.html)
* Create a sample.csv file using the following commands. The CSV file is
required for loading data into CarbonData
@@ -36,10 +36,10 @@ This tutorial provides a quick introduction to using
CarbonData. To follow along
## Integration
### Integration with Execution Engines
-CarbonData can be integrated with Spark,Presto and Hive execution engines. The
below documentation guides on Installing and Configuring with these execution
engines.
+CarbonData can be integrated with Spark, Presto, Flink and Hive execution
engines. The below documentation guides on Installing and Configuring with
these execution engines.
#### Spark
-[Installing and Configuring CarbonData to run locally with Spark SQL CLI
(version:
2.3+)](#installing-and-configuring-carbondata-to-run-locally-with-spark-sql)
+[Installing and Configuring CarbonData to run locally with Spark SQL
CLI](#installing-and-configuring-carbondata-to-run-locally-with-spark-sql-cli)
[Installing and Configuring CarbonData to run locally with Spark
Shell](#installing-and-configuring-carbondata-to-run-locally-with-spark-shell)
@@ -66,9 +66,9 @@ CarbonData can be integrated with Spark,Presto and Hive
execution engines. The b
#### Alluxio
[CarbonData supports read and write with Alluxio](./alluxio-guide.md)
-## Installing and Configuring CarbonData to run locally with Spark SQL CLI
(version: 2.3+)
+## Installing and Configuring CarbonData to run locally with Spark SQL CLI
-In Spark SQL CLI, it use CarbonExtensions to customize the SparkSession with
CarbonData's parser, analyzer, optimizer and physical planning strategy rules
in Spark.
+This will work with spark 2.3+ versions. In Spark SQL CLI, it uses
CarbonExtensions to customize the SparkSession with CarbonData's parser,
analyzer, optimizer and physical planning strategy rules in Spark.
To enable CarbonExtensions, we need to add the following configuration.
|Key|Value|
@@ -95,7 +95,9 @@ STORED AS carbondata;
###### Loading Data to a Table
```
-LOAD DATA INPATH '/path/to/sample.csv' INTO TABLE test_table;
+LOAD DATA INPATH '/local-path/sample.csv' INTO TABLE test_table;
+
+LOAD DATA INPATH 'hdfs://hdfs-path/sample.csv' INTO TABLE test_table;
```
```
@@ -119,7 +121,7 @@ GROUP BY city;
## Installing and Configuring CarbonData to run locally with Spark Shell
-Apache Spark Shell provides a simple way to learn the API, as well as a
powerful tool to analyze data interactively. Please visit [Apache Spark
Documentation](http://spark.apache.org/docs/latest/) for more details on Spark
shell.
+Apache Spark Shell provides a simple way to learn the API, as well as a
powerful tool to analyze data interactively. Please visit [Apache Spark
Documentation](http://spark.apache.org/docs/latest/) for more details on the
Spark shell.
#### Basics
@@ -129,7 +131,7 @@ Start Spark shell by running the following command in the
Spark directory:
```
./bin/spark-shell --jars <carbondata assembly jar path>
```
-**NOTE**: Path where packaged release of CarbonData was downloaded or assembly
jar will be available after [building
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md)
and can be copied from `./assembly/target/scala-2.1x/carbondata_xxx.jar`
+**NOTE**: Path where packaged release of CarbonData was downloaded or assembly
jar will be available after [building
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md)
and can be copied from `./assembly/target/scala-2.1x/apache-carbondata_xxx.jar`
In this shell, SparkSession is readily available as `spark` and Spark context
is readily available as `sc`.
@@ -234,9 +236,9 @@ carbon.sql(
### Procedure
-1. [Build the
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md)
project and get the assembly jar from
`./assembly/target/scala-2.1x/carbondata_xxx.jar`.
+1. [Build the
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md)
project and get the assembly jar from
`./assembly/target/scala-2.1x/apache-carbondata_xxx.jar`.
-2. Copy `./assembly/target/scala-2.1x/carbondata_xxx.jar` to
`$SPARK_HOME/carbonlib` folder.
+2. Copy `./assembly/target/scala-2.1x/apache-carbondata_xxx.jar` to
`$SPARK_HOME/carbonlib` folder.
**NOTE**: Create the carbonlib folder if it does not exist inside
`$SPARK_HOME` path.
@@ -253,13 +255,7 @@ carbon.sql(
| spark.driver.extraJavaOptions | `-Dcarbon.properties.filepath =
$SPARK_HOME/conf/carbon.properties` | A string of extra JVM options to pass to
the driver. For instance, GC settings or other logging. |
| spark.executor.extraJavaOptions | `-Dcarbon.properties.filepath =
$SPARK_HOME/conf/carbon.properties` | A string of extra JVM options to pass to
executors. For instance, GC settings or other logging. **NOTE**: You can enter
multiple values separated by space. |
-7. Add the following properties in `$SPARK_HOME/conf/carbon.properties` file:
-
-| Property | Required | Description
| Example | Remark
|
-| -------------------- | -------- |
------------------------------------------------------------ |
------------------------------------ | ----------------------------- |
-| carbon.storelocation | NO | Location where data CarbonData will create
the store and write the data in its own format. If not specified then it takes
spark.sql.warehouse.dir path. | hdfs://HOSTNAME:PORT/Opt/CarbonStore | Propose
to set HDFS directory |
-
-8. Verify the installation. For example:
+7. Verify the installation. For example:
```
./bin/spark-shell \
@@ -268,7 +264,9 @@ carbon.sql(
--executor-memory 2G
```
-**NOTE**: Make sure you have permissions for CarbonData JARs and files through
which driver and executor will start.
+**NOTE**:
+ - property "carbon.storelocation" is deprecated in carbondata 2.0 version.
Only the users who used this property in previous versions can still use it in
carbon 2.0 version.
+ - Make sure you have permissions for CarbonData JARs and files through which
driver and executor will start.
## Installing and Configuring CarbonData on Spark on YARN Cluster
@@ -284,7 +282,7 @@ carbon.sql(
The following steps are only for Driver Nodes. (Driver nodes are the one
which starts the spark context.)
-1. [Build the
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md)
project and get the assembly jar from
`./assembly/target/scala-2.1x/carbondata_xxx.jar` and copy to
`$SPARK_HOME/carbonlib` folder.
+1. [Build the
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md)
project and get the assembly jar from
`./assembly/target/scala-2.1x/apache-carbondata_xxx.jar` and copy to
`$SPARK_HOME/carbonlib` folder.
**NOTE**: Create the carbonlib folder if it does not exists inside
`$SPARK_HOME` path.
@@ -310,13 +308,7 @@ mv carbondata.tar.gz carbonlib/
| spark.driver.extraClassPath | Extra classpath entries to prepend to the
classpath of the driver. **NOTE**: If SPARK_CLASSPATH is defined in
spark-env.sh, then comment it and append the value in below parameter
spark.driver.extraClassPath. | `$SPARK_HOME/carbonlib/*`
|
| spark.driver.extraJavaOptions | A string of extra JVM options to pass to
the driver. For instance, GC settings or other logging. |
`-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties` |
-5. Add the following properties in `$SPARK_HOME/conf/carbon.properties`:
-
-| Property | Required | Description
| Example | Default Value
|
-| -------------------- | -------- |
------------------------------------------------------------ |
------------------------------------ | ----------------------------- |
-| carbon.storelocation | NO | Location where CarbonData will create the
store and write the data in its own format. If not specified then it takes
spark.sql.warehouse.dir path. | hdfs://HOSTNAME:PORT/Opt/CarbonStore | Propose
to set HDFS directory |
-
-6. Verify the installation.
+5. Verify the installation.
```
./bin/spark-shell \
@@ -327,6 +319,7 @@ mv carbondata.tar.gz carbonlib/
```
**NOTE**:
+ - property "carbon.storelocation" is deprecated in carbondata 2.0 version.
Only the users who used this property in previous versions can still use it in
carbon 2.0 version.
- Make sure you have permissions for CarbonData JARs and files through which
driver and executor will start.
- If use Spark + Hive 1.1.X, it needs to add carbondata assembly jar and
carbondata-hive jar into parameter 'spark.sql.hive.metastore.jars' in
spark-default.conf file.
@@ -343,13 +336,27 @@ b. Run the following command to start the CarbonData
thrift server.
```
./bin/spark-submit \
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
-$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
+$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR
+```
+
+| Parameter | Description
| Example |
+| ------------------- |
------------------------------------------------------------ |
---------------------------------------------------------- |
+| CARBON_ASSEMBLY_JAR | CarbonData assembly jar name present in the
`$SPARK_HOME/carbonlib/` folder. | apache-carbondata-xx.jar |
+
+c. Run the following command to work with S3 storage.
+
+```
+./bin/spark-submit \
+--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
+$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <access_key> <secret_key> <endpoint>
```
| Parameter | Description
| Example |
| ------------------- |
------------------------------------------------------------ |
---------------------------------------------------------- |
-| CARBON_ASSEMBLY_JAR | CarbonData assembly jar name present in the
`$SPARK_HOME/carbonlib/` folder. |
carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar |
-| carbon_store_path | This is a parameter to the CarbonThriftServer class.
This a HDFS path where CarbonData files will be kept. Strongly Recommended to
put same as carbon.storelocation parameter of carbon.properties. If not
specified then it takes spark.sql.warehouse.dir path. |
`hdfs://<host_name>:port/user/hive/warehouse/carbon.store` |
+| CARBON_ASSEMBLY_JAR | CarbonData assembly jar name present in the
`$SPARK_HOME/carbonlib/` folder. | apache-carbondata-xx.jar |
+| access_key | Access key for S3 storage |
+| secret_key | Secret key for S3 storage |
+| endpoint | Endpoint for connecting to S3 storage |
**NOTE**: From Spark 1.6, by default the Thrift server runs in multi-session
mode. Which means each JDBC/ODBC connection owns a copy of their own SQL
configuration and temporary function registry. Cached tables are still shared
though. If you prefer to run the Thrift server in single-session mode and share
all SQL configuration and temporary function registry, please set option
`spark.sql.hive.thriftServer.singleSession` to `true`. You may either add this
option to `spark-defaults.conf`, [...]
@@ -357,7 +364,7 @@ $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR
<carbon_store_path>
./bin/spark-submit \
--conf spark.sql.hive.thriftServer.singleSession=true \
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
-$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
+$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR
```
**But** in single-session mode, if one user changes the database from one
connection, the database of the other connections will be changed too.
@@ -369,8 +376,7 @@ $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR
<carbon_store_path>
```
./bin/spark-submit \
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
-$SPARK_HOME/carbonlib/carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar \
-hdfs://<host_name>:port/user/hive/warehouse/carbon.store
+$SPARK_HOME/carbonlib/apache-carbondata-xxx.jar
```
- Start with Fixed executors and resources.
@@ -382,8 +388,7 @@ hdfs://<host_name>:port/user/hive/warehouse/carbon.store
--driver-memory 20G \
--executor-memory 250G \
--executor-cores 32 \
-$SPARK_HOME/carbonlib/carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar \
-hdfs://<host_name>:port/user/hive/warehouse/carbon.store
+$SPARK_HOME/carbonlib/apache-carbondata-xxx.jar
```
### Connecting to CarbonData Thrift Server Using Beeline.
@@ -401,9 +406,9 @@ Example
## Installing and Configuring CarbonData on Presto
-**NOTE:** **CarbonData tables cannot be created nor loaded from Presto. User
need to create CarbonData Table and load data into it
+**NOTE:** **CarbonData tables cannot be created nor loaded from Presto. User
needs to create CarbonData Table and load data into it
either with
[Spark](#installing-and-configuring-carbondata-to-run-locally-with-spark-shell)
or [SDK](./sdk-guide.md) or [C++ SDK](./csdk-guide.md).
-Once the table is created,it can be queried from Presto.**
+Once the table is created, it can be queried from Presto.**
Please refer the presto guide linked below.
@@ -411,7 +416,7 @@ prestodb guide - [prestodb](./prestodb-guide.md)
prestosql guide - [prestosql](./prestosql-guide.md)
-Once installed the presto with carbonData as per above guide,
+Once installed the presto with carbonData as per the above guide,
you can use the Presto CLI on the coordinator to query data sources in the
catalog using the Presto workers.
List the schemas(databases) available
@@ -438,6 +443,4 @@ Query from the available tables
select * from carbon_table;
```
-**Note :** Create Tables and data loads should be done before executing
queries as we can not create carbon table from this interface.
-
-```
+**Note:** Create Tables and data loads should be done before executing queries
as we can not create carbon table from this interface.