[carbondata] branch master updated: [CARBONDATA-3791] Fix spelling, validating links for quick-start guide and configuration parameters

ajantha Wed, 06 May 2020 07:41:36 -0700

This is an automated email from the ASF dual-hosted git repository.

ajantha pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git



The following commit(s) were added to refs/heads/master by this push:
     new 4fec616  [CARBONDATA-3791] Fix spelling, validating links for 
quick-start guide and configuration parameters
4fec616 is described below

commit 4fec616dadcc52e867f17ccadfec88e9d0b656ce
Author: Mahesh Raju Somalaraju <mahesh.somalar...@huawei.com>
AuthorDate: Mon May 4 00:00:07 2020 +0530

    [CARBONDATA-3791] Fix spelling, validating links for quick-start guide and 
configuration parameters
    
    Why is this PR needed?
    Fix spelling, validating links for quick-start guide and configuration 
parameters
    
    What changes were proposed in this PR?
    .md file changes
    
    Does this PR introduce any user interface change?
    No
    
    Is any new testcase added?
    No
    
    This closes #3740
---
 docs/configuration-parameters.md | 46 ++++++++++------------
 docs/quick-start-guide.md        | 85 +++++++++++++++++++++-------------------
 2 files changed, 65 insertions(+), 66 deletions(-)

diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index dc105a8..9428e00 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -31,10 +31,9 @@ This section provides the details of all the configurations 
required for the Car
 
 | Property | Default Value | Description |
 
|----------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
-| carbon.storelocation | spark.sql.warehouse.dir property value | Location 
where CarbonData will create the store, and write the data in its custom 
format. If not specified,the path defaults to spark.sql.warehouse.dir property. 
**NOTE:** Store location should be in one of the carbon supported filesystems. 
Like HDFS or S3. It is not recommended to use this property. |
 | carbon.ddl.base.hdfs.url | (none) | To simplify and shorten the path to be 
specified in DDL/DML commands, this property is supported. This property is 
used to configure the HDFS relative path, the path configured in 
carbon.ddl.base.hdfs.url will be appended to the HDFS path configured in 
fs.defaultFS of core-site.xml. If this path is configured, then user need not 
pass the complete path while dataload. For example: If absolute path of the csv 
file is hdfs://10.18.101.155:54310/data/cnb [...]
 | carbon.badRecords.location | (none) | CarbonData can detect the records not 
conforming to defined table schema and isolate them as bad records. This 
property is used to specify where to store such bad records. |
-| carbon.streaming.auto.handoff.enabled | true | CarbonData supports storing 
of streaming data. To have high throughput for streaming, the data is written 
in Row format which is highly optimized for write, but performs poorly for 
query. When this property is true and when the streaming data size reaches 
***carbon.streaming.segment.max.size***, CabonData will automatically convert 
the data to columnar format and optimize it for faster querying.**NOTE:** It is 
not recommended to keep the d [...]
+| carbon.streaming.auto.handoff.enabled | true | CarbonData supports storing 
of streaming data. To have high throughput for streaming, the data is written 
in Row format which is highly optimized for write, but performs poorly for 
query. When this property is true and when the streaming data size reaches 
***carbon.streaming.segment.max.size***, CabonData will automatically convert 
the data to columnar format and optimize it for faster querying. **NOTE:** It 
is not recommended to keep the  [...]
 | carbon.streaming.segment.max.size | 1024000000 | CarbonData writes streaming 
data in row format which is optimized for high write throughput. This property 
defines the maximum size of data to be held is row format, beyond which it will 
be converted to columnar format in order to support high performance query, 
provided ***carbon.streaming.auto.handoff.enabled*** is true. **NOTE:** Setting 
higher value will impact the streaming ingestion. The value has to be 
configured in bytes. |
 | carbon.segment.lock.files.preserve.hours | 48 | In order to support parallel 
data loading onto the same table, CarbonData sequences(locks) at the 
granularity of segments. Operations affecting the segment(like IUD, alter) are 
blocked from parallel operations. This property value indicates the number of 
hours the segment lock files will be preserved after dataload. These lock files 
will be deleted with the clean command after the configured number of hours. |
 | carbon.timestamp.format | yyyy-MM-dd HH:mm:ss | CarbonData can understand 
data of timestamp type and process it in special manner. It can be so that the 
format of Timestamp data is different from that understood by CarbonData by 
default. This configuration allows users to specify the format of Timestamp in 
their data. |
@@ -58,9 +57,9 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.concurrent.lock.retries | 100 | CarbonData supports concurrent data 
loading onto same table. To ensure the loading status is correctly updated into 
the system,locks are used to sequence the status updation step. This 
configuration specifies the maximum number of retries to obtain the lock for 
updating the load status. **NOTE:** This value is high as more number of 
concurrent loading happens,more the chances of not able to obtain the lock when 
tried. Adjust this value according t [...]
 | carbon.concurrent.lock.retry.timeout.sec | 1 | Specifies the interval 
between the retries to obtain the lock for concurrent operations. **NOTE:** 
Refer to ***carbon.concurrent.lock.retries*** for understanding why CarbonData 
uses locks during data loading operations. |
 | carbon.csv.read.buffersize.byte | 1048576 | CarbonData uses Hadoop 
InputFormat to read the csv files. This configuration value is used to pass 
buffer size as input for the Hadoop MR job when reading the csv files. This 
value is configured in bytes. **NOTE:** Refer to 
***org.apache.hadoop.mapreduce. InputFormat*** documentation for additional 
information. |
-| carbon.loading.prefetch | false | CarbonData uses univocity parser to read 
csv files. This configuration is used to inform the parser whether it can 
prefetch the data from csv files to speed up the reading.**NOTE:** Enabling 
prefetch improves the data loading performance, but needs higher memory to keep 
more records which are read ahead from disk. |
-| carbon.skip.empty.line | false | The csv files givent to CarbonData for 
loading can contain empty lines. Based on the business scenario, this empty 
line might have to be ignored or needs to be treated as NULL value for all 
columns. In order to define this business behavior, this configuration is 
provided.**NOTE:** In order to consider NULL values for non string columns and 
continue with data load, ***carbon.bad.records.action*** need to be set to 
**FORCE**;else data load will be failed [...]
-| carbon.number.of.cores.while.loading | 2 | Number of cores to be used while 
loading data. This also determines the number of threads to be used to read the 
input files (csv) in parallel.**NOTE:** This configured value is used in every 
data loading step to parallelize the operations. Configuring a higher value can 
lead to increased early thread pre-emption by OS and there by reduce the 
overall performance. |
+| carbon.loading.prefetch | false | CarbonData uses univocity parser to read 
csv files. This configuration is used to inform the parser whether it can 
prefetch the data from csv files to speed up the reading. **NOTE:** Enabling 
prefetch improves the data loading performance, but needs higher memory to keep 
more records which are read ahead from disk. |
+| carbon.skip.empty.line | false | The csv files givent to CarbonData for 
loading can contain empty lines. Based on the business scenario, this empty 
line might have to be ignored or needs to be treated as NULL value for all 
columns. In order to define this business behavior, this configuration is 
provided. **NOTE:** In order to consider NULL values for non string columns and 
continue with data load, ***carbon.bad.records.action*** need to be set to 
**FORCE**;else data load will be faile [...]
+| carbon.number.of.cores.while.loading | 2 | Number of cores to be used while 
loading data. This also determines the number of threads to be used to read the 
input files (csv) in parallel. **NOTE:** This configured value is used in every 
data loading step to parallelize the operations. Configuring a higher value can 
lead to increased early thread pre-emption by OS and there by reduce the 
overall performance. |
 | enable.unsafe.sort | true | CarbonData supports unsafe operations of Java to 
avoid GC overhead for certain operations. This configuration enables to use 
unsafe functions in CarbonData. **NOTE:** For operations like data loading, 
which generates more short lived Java objects, Java GC can be a bottle neck. 
Using unsafe can overcome the GC overhead and improve the overall performance. |
 | enable.offheap.sort | true | CarbonData supports storing data in off-heap 
memory for certain operations during data loading and query. This helps to 
avoid the Java GC and thereby improve the overall performance. This 
configuration enables using off-heap memory for sorting of data during data 
loading.**NOTE:**  ***enable.unsafe.sort*** configuration needs to be 
configured to true for using off-heap |
 | carbon.load.sort.scope | NO_SORT [If sort columns are not specified while 
creating table] and LOCAL_SORT [If sort columns are specified] | CarbonData can 
support various sorting options to match the balance between load and query 
performance. LOCAL_SORT: All the data given to an executor in the single load 
is fully sorted and written to carbondata files. Data loading performance is 
reduced a little as the entire data needs to be sorted in the executor. GLOBAL 
SORT: Entire data in the d [...]
@@ -84,10 +83,10 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.timegranularity | SECOND | The configuration is used to specify the 
data granularity level such as DAY, HOUR, MINUTE, or SECOND. This helps to 
store more than 68 years of data into CarbonData. |
 | carbon.use.local.dir | true | CarbonData,during data loading, writes files 
to local temp directories before copying the files to HDFS. This configuration 
is used to specify whether CarbonData can write locally to tmp directory of the 
container or to the YARN application directory. |
 | carbon.sort.temp.compressor | SNAPPY | CarbonData writes every 
***carbon.sort.size*** number of records to intermediate temp files during data 
loading to ensure memory footprint is within limits. These temporary files can 
be compressed and written in order to save the storage space. This 
configuration specifies the name of compressor to be used to compress the 
intermediate sort temp files during sort procedure in data loading. The valid 
values are 'SNAPPY','GZIP','BZIP2','LZ4','ZSTD' a [...]
-| carbon.load.skewedDataOptimization.enabled | false | During data 
loading,CarbonData would divide the number of blocks equally so as to ensure 
all executors process same number of blocks. This mechanism satisfies most of 
the scenarios and ensures maximum parallel processing for optimal data loading 
performance. In some business scenarios, there might be scenarios where the 
size of blocks vary significantly and hence some executors would have to do 
more work if they get blocks containing [...]
+| carbon.load.skewedDataOptimization.enabled | false | During data 
loading,CarbonData would divide the number of blocks equally so as to ensure 
all executors process same number of blocks. This mechanism satisfies most of 
the scenarios and ensures maximum parallel processing for optimal data loading 
performance. In some business scenarios, there might be scenarios where the 
size of blocks vary significantly and hence some executors would have to do 
more work if they get blocks containing [...]
 | enable.data.loading.statistics | false | CarbonData has extensive logging 
which would be useful for debugging issues related to performance or hard to 
locate issues. This configuration when made ***true*** would log additional 
data loading statistics information to more accurately locate the issues being 
debugged. **NOTE:** Enabling this would log more debug information to log 
files, there by increasing the log files size significantly in short span of 
time. It is advised to configure  [...]
 | carbon.dictionary.chunk.size | 10000 | CarbonData generates dictionary keys 
and writes them to separate dictionary file during data loading. To optimize 
the IO, this configuration determines the number of dictionary keys to be 
persisted to dictionary file at a time. **NOTE:** Writing to file also serves 
as a commit point to the dictionary generated. Increasing more values in memory 
causes more data loss during system or application failure. It is advised to 
alter this configuration jud [...]
-| carbon.load.directWriteToStorePath.enabled | false | During data load, all 
the carbondata files are written to local disk and finally copied to the target 
store location in HDFS/S3. Enabling this parameter will make carbondata files 
to be written directly onto target HDFS/S3 location bypassing the local 
disk.**NOTE:** Writing directly to HDFS/S3 saves local disk IO(once for writing 
the files and again for copying to HDFS/S3) there by improving the performance. 
But the drawback is when  [...]
+| carbon.load.directWriteToStorePath.enabled | false | During data load, all 
the carbondata files are written to local disk and finally copied to the target 
store location in HDFS/S3. Enabling this parameter will make carbondata files 
to be written directly onto target HDFS/S3 location bypassing the local disk. 
**NOTE:** Writing directly to HDFS/S3 saves local disk IO(once for writing the 
files and again for copying to HDFS/S3) there by improving the performance. But 
the drawback is when [...]
 | carbon.options.serialization.null.format | \N | Based on the business 
scenarios, some columns might need to be loaded with null values. As null value 
cannot be written in csv files, some special characters might be adopted to 
specify null values. This configuration can be used to specify the null values 
format in the data being loaded. |
 | carbon.column.compressor | snappy | CarbonData will compress the column 
values using the compressor specified by this configuration. Currently 
CarbonData supports 'snappy', 'zstd' and 'gzip' compressors. |
 | carbon.minmax.allowed.byte.count | 200 | CarbonData will write the min max 
values for string/varchar types column using the byte count specified by this 
configuration. Max value is 1000 bytes(500 characters) and Min value is 10 
bytes(5 characters). **NOTE:** This property is useful for reducing the store 
size thereby improving the query performance but can lead to query degradation 
if value is not configured properly. | |
@@ -101,19 +100,19 @@ This section provides the details of all the 
configurations required for the Car
 | Parameter | Default Value | Description |
 
|-----------------------------------------------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | carbon.number.of.cores.while.compacting | 2 | Number of cores to be used 
while compacting data. This also determines the number of threads to be used to 
read carbondata files in parallel. |
-| carbon.compaction.level.threshold | 4, 3 | Each CarbonData load will create 
one segment, if every load is small in size it will generate many small file 
over a period of time impacting the query performance. This configuration is 
for minor compaction which decides how many segments to be merged. 
Configuration is of the form (x,y). Compaction will be triggered for every x 
segments and form a single level 1 compacted segment. When the number of 
compacted level 1 segments reach y, compact [...]
+| carbon.compaction.level.threshold | 4, 3 | Each CarbonData load will create 
one segment, if every load is small in size it will generate many small file 
over a period of time impacting the query performance. This configuration is 
for minor compaction which decides how many segments to be merged. 
Configuration is of the form (x,y). Compaction will be triggered for every x 
segments and form a single level 1 compacted segment. When the number of 
compacted level 1 segments reach y, compact [...]
 | carbon.major.compaction.size | 1024 | To improve query performance and all 
the segments can be merged and compacted to a single segment upto configured 
size. This Major compaction size can be configured using this parameter. Sum of 
the segments which is below this threshold will be merged. This value is 
expressed in MB. |
-| carbon.horizontal.compaction.enable | true | CarbonData supports 
DELETE/UPDATE functionality by creating delta data files for existing 
carbondata files. These delta files would grow as more number of DELETE/UPDATE 
operations are performed. Compaction of these delta files are termed as 
horizontal compaction. This configuration is used to turn ON/OFF horizontal 
compaction. After every DELETE and UPDATE statement, horizontal compaction may 
occur in case the delta (DELETE/ UPDATE) files be [...]
+| carbon.horizontal.compaction.enable | true | CarbonData supports 
DELETE/UPDATE functionality by creating delta data files for existing 
carbondata files. These delta files would grow as more number of DELETE/UPDATE 
operations are performed. Compaction of these delta files are termed as 
horizontal compaction. This configuration is used to turn ON/OFF horizontal 
compaction. After every DELETE and UPDATE statement, horizontal compaction may 
occur in case the delta (DELETE/ UPDATE) files be [...]
 | carbon.horizontal.update.compaction.threshold | 1 | This configuration 
specifies the threshold limit on number of UPDATE delta files within a segment. 
In case the number of delta files goes beyond the threshold, the UPDATE delta 
files within the segment becomes eligible for horizontal compaction and are 
compacted into single UPDATE delta file. Values range between 1 to 10000. |
 | carbon.horizontal.delete.compaction.threshold | 1 | This configuration 
specifies the threshold limit on number of DELETE delta files within a block of 
a segment. In case the number of delta files goes beyond the threshold, the 
DELETE delta files for the particular block of the segment becomes eligible for 
horizontal compaction and are compacted into single DELETE delta file. Values 
range between 1 to 10000. |
-| carbon.update.segment.parallelism | 1 | CarbonData processes the UPDATE 
operations by grouping records belonging to a segment into a single executor 
task. When the amount of data to be updated is more, this behavior causes 
problems like restarting of executor due to low memory and data-spill related 
errors. This property specifies the parallelism for each segment during 
update.**NOTE:** It is recommended to set this value to a multiple of the 
number of executors for balance. Values ran [...]
-| carbon.numberof.preserve.segments | 0 | If the user wants to preserve some 
number of segments from being compacted then he can set this configuration. 
Example: carbon.numberof.preserve.segments = 2 then 2 latest segments will 
always be excluded from the compaction. No segments will be preserved by 
default.**NOTE:** This configuration is useful when the chances of input data 
can be wrong due to environment scenarios. Preserving some of the latest 
segments from being compacted can help t [...]
-| carbon.allowed.compaction.days | 0 | This configuration is used to control 
on the number of recent segments that needs to be compacted, ignoring the older 
ones. This configuration is in days. For Example: If the configuration is 2, 
then the segments which are loaded in the time frame of past 2 days only will 
get merged. Segments which are loaded earlier than 2 days will not be merged. 
This configuration is disabled by default.**NOTE:** This configuration is 
useful when a bulk of histor [...]
-| carbon.enable.auto.load.merge | false | Compaction can be automatically 
triggered once data load completes. This ensures that the segments are merged 
in time and thus query times does not increase with increase in segments. This 
configuration enables to do compaction along with data loading.**NOTE: 
**Compaction will be triggered once the data load completes. But the status of 
data load wait till the compaction is completed. Hence it might look like data 
loading time has increased, but  [...]
+| carbon.update.segment.parallelism | 1 | CarbonData processes the UPDATE 
operations by grouping records belonging to a segment into a single executor 
task. When the amount of data to be updated is more, this behavior causes 
problems like restarting of executor due to low memory and data-spill related 
errors. This property specifies the parallelism for each segment during update. 
**NOTE:** It is recommended to set this value to a multiple of the number of 
executors for balance. Values ra [...]
+| carbon.numberof.preserve.segments | 0 | If the user wants to preserve some 
number of segments from being compacted then he can set this configuration. 
Example: carbon.numberof.preserve.segments = 2 then 2 latest segments will 
always be excluded from the compaction. No segments will be preserved by 
default. **NOTE:** This configuration is useful when the chances of input data 
can be wrong due to environment scenarios. Preserving some of the latest 
segments from being compacted can help  [...]
+| carbon.allowed.compaction.days | 0 | This configuration is used to control 
on the number of recent segments that needs to be compacted, ignoring the older 
ones. This configuration is in days. For Example: If the configuration is 2, 
then the segments which are loaded in the time frame of past 2 days only will 
get merged. Segments which are loaded earlier than 2 days will not be merged. 
This configuration is disabled by default. **NOTE:** This configuration is 
useful when a bulk of histo [...]
+| carbon.enable.auto.load.merge | false | Compaction can be automatically 
triggered once data load completes. This ensures that the segments are merged 
in time and thus query times does not increase with increase in segments. This 
configuration enables to do compaction along with data loading. **NOTE:** 
Compaction will be triggered once the data load completes. But the status of 
data load wait till the compaction is completed. Hence it might look like data 
loading time has increased, but [...]
 | carbon.enable.page.level.reader.in.compaction|true|Enabling page level 
reader for compaction reduces the memory usage while compacting more number of 
segments. It allows reading only page by page instead of reading whole blocklet 
to memory. **NOTE:** Please refer to 
[file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format)
 to understand the storage format of CarbonData and concepts of pages.|
-| carbon.concurrent.compaction | true | Compaction of different tables can be 
executed concurrently. This configuration determines whether to compact all 
qualifying tables in parallel or not. **NOTE: **Compacting concurrently is a 
resource demanding operation and needs more resources there by affecting the 
query performance also. This configuration is **deprecated** and might be 
removed in future releases. |
-| carbon.compaction.prefetch.enable | false | Compaction operation is similar 
to Query + data load where in data from qualifying segments are queried and 
data loading performed to generate a new single segment. This configuration 
determines whether to query ahead data from segments and feed it for data 
loading. **NOTE: **This configuration is disabled by default as it needs extra 
resources for querying extra data. Based on the memory availability on the 
cluster, user can enable it to imp [...]
-| carbon.merge.index.in.segment | true | Each CarbonData file has a companion 
CarbonIndex file which maintains the metadata about the data. These CarbonIndex 
files are read and loaded into driver and is used subsequently for pruning of 
data during queries. These CarbonIndex files are very small in size(few KB) and 
are many. Reading many small files from HDFS is not efficient and leads to slow 
IO performance. Hence these CarbonIndex files belonging to a segment can be 
combined into  a sin [...]
+| carbon.concurrent.compaction | true | Compaction of different tables can be 
executed concurrently. This configuration determines whether to compact all 
qualifying tables in parallel or not. **NOTE:** Compacting concurrently is a 
resource demanding operation and needs more resources there by affecting the 
query performance also. This configuration is **deprecated** and might be 
removed in future releases. |
+| carbon.compaction.prefetch.enable | false | Compaction operation is similar 
to Query + data load where in data from qualifying segments are queried and 
data loading performed to generate a new single segment. This configuration 
determines whether to query ahead data from segments and feed it for data 
loading. **NOTE:** This configuration is disabled by default as it needs extra 
resources for querying extra data. Based on the memory availability on the 
cluster, user can enable it to imp [...]
+| carbon.merge.index.in.segment | true | Each CarbonData file has a companion 
CarbonIndex file which maintains the metadata about the data. These CarbonIndex 
files are read and loaded into driver and is used subsequently for pruning of 
data during queries. These CarbonIndex files are very small in size(few KB) and 
are many. Reading many small files from HDFS is not efficient and leads to slow 
IO performance. Hence these CarbonIndex files belonging to a segment can be 
combined into  a sin [...]
 | carbon.enable.range.compaction | true | To configure Ranges-based Compaction 
to be used or not for RANGE_COLUMN. If true after compaction also the data 
would be present in ranges. |
 | carbon.si.segment.merge | false | Making this true degrade the LOAD 
performance. When the number of small files increase for SI segments(it can 
happen as number of columns will be less and we store position id and reference 
columns), user an either set to true which will merge the data files for 
upcoming loads or run SI refresh command which does this job for all segments. 
(REFRESH INDEX <index_table>) |
 
@@ -125,7 +124,7 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.max.executor.lru.cache.size | -1 | Maximum memory **(in MB)** upto 
which the executor process can cache the data (BTree and reverse dictionary 
values). Default value of -1 means there is no memory limit for caching. Only 
integer values greater than 0 are accepted. **NOTE:** If this parameter is not 
configured, then the value of ***carbon.max.driver.lru.cache.size*** will be 
used. |
 | max.query.execution.time | 60 | Maximum time allowed for one query to be 
executed. The value is in minutes. |
 | carbon.enableMinMax | true | CarbonData maintains the metadata which enables 
to prune unnecessary files from being scanned as per the query conditions. To 
achieve pruning, Min,Max of each column is maintined.Based on the filter 
condition in the query, certain data can be skipped from scanning by matching 
the filter value against the min,max values of the column(s) present in that 
carbondata file. This pruning enhances query performance significantly. |
-| carbon.dynamical.location.scheduler.timeout | 5 | CarbonData has its own 
scheduling algorithm to suggest to Spark on how many tasks needs to be launched 
and how much work each task need to do in a Spark cluster for any query on 
CarbonData. To determine the number of tasks that can be scheduled, knowing the 
count of active executors is necessary. When dynamic allocation is enabled on a 
YARN based spark cluster, executor processes are shutdown if no request is 
received for a particular a [...]
+| carbon.dynamical.location.scheduler.timeout | 5 | CarbonData has its own 
scheduling algorithm to suggest to Spark on how many tasks needs to be launched 
and how much work each task need to do in a Spark cluster for any query on 
CarbonData. To determine the number of tasks that can be scheduled, knowing the 
count of active executors is necessary. When dynamic allocation is enabled on a 
YARN based spark cluster, executor processes are shutdown if no request is 
received for a particular a [...]
 | carbon.scheduler.min.registered.resources.ratio | 0.8 | Specifies the 
minimum resource (executor) ratio needed for starting the block distribution. 
The default value is 0.8, which indicates 80% of the requested resource is 
allocated for starting block distribution. The minimum value is 0.1 min and the 
maximum value is 1.0. |
 | carbon.search.enabled (Alpha Feature) | false | If set to true, it will use 
CarbonReader to do distributed scan directly instead of using compute framework 
like spark, thus avoiding limitation of compute framework like SQL optimizer 
and task scheduling overhead. |
 | carbon.search.query.timeout | 10s | Time within which the result is expected 
from the workers, beyond which the query is terminated |
@@ -137,7 +136,7 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.enable.vector.reader | true | Spark added vector processing to 
optimize cpu cache miss and there by increase the query performance. This 
configuration enables to fetch data as columnar batch of size 4*1024 rows 
instead of fetching data row by row and provide it to spark so that there is 
improvement in  select queries performance. |
 | carbon.task.distribution | block | CarbonData has its own scheduling 
algorithm to suggest to Spark on how many tasks needs to be launched and how 
much work each task need to do in a Spark cluster for any query on CarbonData. 
Each of these task distribution suggestions has its own advantages and 
disadvantages. Based on the customer use case, appropriate task distribution 
can be configured.**block**: Setting this value will launch one task per block. 
This setting is suggested in case of  [...]
 | carbon.custom.block.distribution | false | CarbonData has its own scheduling 
algorithm to suggest to Spark on how many tasks needs to be launched and how 
much work each task need to do in a Spark cluster for any query on CarbonData. 
When this configuration is true, CarbonData would distribute the available 
blocks to be scanned among the available number of cores. For Example:If there 
are 10 blocks to be scanned and only 3 tasks can be run(only 3 executor cores 
available in the cluster) [...]
-| enable.query.statistics | false | CarbonData has extensive logging which 
would be useful for debugging issues related to performance or hard to locate 
issues. This configuration when made ***true*** would log additional query 
statistics information to more accurately locate the issues being 
debugged.**NOTE:** Enabling this would log more debug information to log files, 
there by increasing the log files size significantly in short span of time. It 
is advised to configure the log files s [...]
+| enable.query.statistics | false | CarbonData has extensive logging which 
would be useful for debugging issues related to performance or hard to locate 
issues. This configuration when made ***true*** would log additional query 
statistics information to more accurately locate the issues being debugged. 
**NOTE:** Enabling this would log more debug information to log files, there by 
increasing the log files size significantly in short span of time. It is 
advised to configure the log files  [...]
 | enable.unsafe.in.query.processing | false | CarbonData supports unsafe 
operations of Java to avoid GC overhead for certain operations. This 
configuration enables to use unsafe functions in CarbonData while scanning the  
data during query. |
 | carbon.max.driver.threads.for.block.pruning | 4 | Number of threads used for 
driver pruning when the carbon files are more than 100k Maximum memory. This 
configuration can used to set number of threads between 1 to 4. |
 | carbon.heap.memory.pooling.threshold.bytes | 1048576 | CarbonData supports 
unsafe operations of Java to avoid GC overhead for certain operations. Using 
unsafe, memory can be allocated on Java Heap or off heap. This configuration 
controls the allocation mechanism on Java HEAP. If the heap memory allocations 
of the given size is greater or equal than this value,it should go through the 
pooling mechanism. But if set this size to -1, it should not go through the 
pooling mechanism. Default  [...]
@@ -205,21 +204,18 @@ RESET
 
 | Properties                                | Description                      
                            |
 | ----------------------------------------- | 
------------------------------------------------------------ |
-| carbon.options.bad.records.logger.enable  | CarbonData can identify the 
records that are not conformant to schema and isolate them as bad records. 
Enabling this configuration will make CarbonData to log such bad 
records.**NOTE:** If the input data contains many bad records, logging them 
will slow down the over all data loading throughput. The data load operation 
status would depend on the configuration in ***carbon.bad.records.action***. |
-| carbon.options.bad.records.logger.enable  | To enable or disable bad record 
logger.                      |
-| carbon.options.bad.records.action         | This property can have four 
types of actions for bad records FORCE, REDIRECT, IGNORE and FAIL. If set to 
FORCE then it auto-corrects the data by storing the bad records as NULL. If set 
to REDIRECT then bad records are written to the raw CSV instead of being 
loaded. If set to IGNORE then bad records are neither loaded nor written to the 
raw CSV. If set to FAIL then data loading fails if any bad records are found. |
+| carbon.options.bad.records.logger.enable  | To enable or disable a bad 
record logger. CarbonData can identify the records that are not conformant to 
schema and isolate them as bad records. Enabling this configuration will make 
CarbonData to log such bad records. **NOTE:** If the input data contains many 
bad records, logging them will slow down the overall data loading throughput. 
The data load operation status would depend on the configuration in 
***carbon.bad.records.action***. |      [...]
+| carbon.options.bad.records.action         | This property has four types of  
bad record actions: FORCE, REDIRECT, IGNORE and FAIL. If set to FORCE then it 
auto-corrects the data by storing the bad records as NULL. If set to REDIRECT 
then bad records are written to the raw CSV instead of being loaded. If set to 
IGNORE then bad records are neither loaded nor written to the raw CSV. If set 
to FAIL then data loading fails if any bad records are found. |
 | carbon.options.is.empty.data.bad.record   | If false, then empty ("" or '' 
or ,,) data will not be considered as bad record and vice versa. |
-| carbon.options.batch.sort.size.inmb       | Size of batch data to keep in 
memory, as a thumb rule it supposed to be less than 45% of 
sort.inmemory.size.inmb otherwise it may spill intermediate data to disk. |
 | carbon.options.bad.record.path            | Specifies the HDFS path where 
bad records needs to be stored. |
-| carbon.custom.block.distribution          | Specifies whether to use the 
Spark or Carbon block distribution feature.**NOTE: **Refer to [Query 
Configuration](#query-configuration)#carbon.custom.block.distribution for more 
details on CarbonData scheduler. |
+| carbon.custom.block.distribution          | Specifies whether to use the 
Spark or Carbon block distribution feature. **NOTE:** Refer to [Query 
Configuration](#query-configuration)#carbon.custom.block.distribution for more 
details on CarbonData scheduler. |
 | enable.unsafe.sort                        | Specifies whether to use unsafe 
sort during data loading. Unsafe sort reduces the garbage collection during 
data load operation, resulting in better performance. |
 | carbon.options.date.format                 | Specifies the data format of 
the date columns in the data being loaded |
 | carbon.options.timestamp.format            | Specifies the timestamp format 
of the time stamp columns in the data being loaded |
 | carbon.options.sort.scope                 | Specifies how the current data 
load should be sorted with. This sort parameter is at the table level. 
**NOTE:** Refer to [Data Loading 
Configuration](#data-loading-configuration)#carbon.sort.scope for detailed 
information. |
-| carbon.table.load.sort.scope.db_name.table_name | Overrides the SORT_SCOPE 
provided in CREATE TABLE.           |
+| carbon.table.load.sort.scope.<db_name>.<table_name> | Overrides the 
SORT_SCOPE provided in CREATE TABLE.           |
 | carbon.options.global.sort.partitions     | Specifies the number of 
partitions to be used during global sort.   |
 | carbon.options.serialization.null.format  | Default Null value 
representation in the data being loaded. **NOTE:** Refer to [Data Loading 
Configuration](#data-loading-configuration)#carbon.options.serialization.null.format
 for detailed information. |
-| carbon.query.directQueryOnDataMap.enabled | Specifies whether datamap can be 
queried directly. This is useful for debugging purposes.**NOTE: **Refer to 
[Query Configuration](#query-configuration) for detailed information. |
 
 **Examples:**
 
diff --git a/docs/quick-start-guide.md b/docs/quick-start-guide.md
index 4635cdb..e08b7b7 100644
--- a/docs/quick-start-guide.md
+++ b/docs/quick-start-guide.md
@@ -16,10 +16,10 @@
 -->
 
 # Quick Start
-This tutorial provides a quick introduction to using CarbonData. To follow 
along with this guide, first download a packaged release of CarbonData from the 
[CarbonData 
website](https://dist.apache.org/repos/dist/release/carbondata/).Alternatively 
it can be created following [Building 
CarbonData](https://github.com/apache/carbondata/tree/master/build) steps.
+This tutorial provides a quick introduction to use CarbonData. To follow along 
with this guide, download a packaged release of CarbonData from the [CarbonData 
website](https://dist.apache.org/repos/dist/release/carbondata/). 
Alternatively, it can be created following [Building 
CarbonData](https://github.com/apache/carbondata/tree/master/build) steps.
 
 ##  Prerequisites
-* CarbonData supports Spark versions upto 2.2.1.Please download Spark package 
from [Spark website](https://spark.apache.org/downloads.html)
+* CarbonData supports Spark versions up to 2.4. Please download Spark package 
from [Spark website](https://spark.apache.org/downloads.html)
 
 * Create a sample.csv file using the following commands. The CSV file is 
required for loading data into CarbonData
 
@@ -36,10 +36,10 @@ This tutorial provides a quick introduction to using 
CarbonData. To follow along
 ## Integration
 
 ### Integration with Execution Engines
-CarbonData can be integrated with Spark,Presto and Hive execution engines. The 
below documentation guides on Installing and Configuring with these execution 
engines.
+CarbonData can be integrated with Spark, Presto, Flink and Hive execution 
engines. The below documentation guides on Installing and Configuring with 
these execution engines.
 
 #### Spark
-[Installing and Configuring CarbonData to run locally with Spark SQL CLI 
(version: 
2.3+)](#installing-and-configuring-carbondata-to-run-locally-with-spark-sql)
+[Installing and Configuring CarbonData to run locally with Spark SQL 
CLI](#installing-and-configuring-carbondata-to-run-locally-with-spark-sql-cli)
 
 [Installing and Configuring CarbonData to run locally with Spark 
Shell](#installing-and-configuring-carbondata-to-run-locally-with-spark-shell)
 
@@ -66,9 +66,9 @@ CarbonData can be integrated with Spark,Presto and Hive 
execution engines. The b
 #### Alluxio
 [CarbonData supports read and write with Alluxio](./alluxio-guide.md)
 
-## Installing and Configuring CarbonData to run locally with Spark SQL CLI 
(version: 2.3+)
+## Installing and Configuring CarbonData to run locally with Spark SQL CLI
 
-In Spark SQL CLI, it use CarbonExtensions to customize the SparkSession with 
CarbonData's parser, analyzer, optimizer and physical planning strategy rules 
in Spark.
+This will work with spark 2.3+ versions. In Spark SQL CLI, it uses 
CarbonExtensions to customize the SparkSession with CarbonData's parser, 
analyzer, optimizer and physical planning strategy rules in Spark.
 To enable CarbonExtensions, we need to add the following configuration.
 
 |Key|Value|
@@ -95,7 +95,9 @@ STORED AS carbondata;
 ###### Loading Data to a Table
 
 ```
-LOAD DATA INPATH '/path/to/sample.csv' INTO TABLE test_table;
+LOAD DATA INPATH '/local-path/sample.csv' INTO TABLE test_table;
+
+LOAD DATA INPATH 'hdfs://hdfs-path/sample.csv' INTO TABLE test_table;
 ```
 
 ```
@@ -119,7 +121,7 @@ GROUP BY city;
 
 ## Installing and Configuring CarbonData to run locally with Spark Shell
 
-Apache Spark Shell provides a simple way to learn the API, as well as a 
powerful tool to analyze data interactively. Please visit [Apache Spark 
Documentation](http://spark.apache.org/docs/latest/) for more details on Spark 
shell.
+Apache Spark Shell provides a simple way to learn the API, as well as a 
powerful tool to analyze data interactively. Please visit [Apache Spark 
Documentation](http://spark.apache.org/docs/latest/) for more details on the 
Spark shell.
 
 #### Basics
 
@@ -129,7 +131,7 @@ Start Spark shell by running the following command in the 
Spark directory:
 ```
 ./bin/spark-shell --jars <carbondata assembly jar path>
 ```
-**NOTE**: Path where packaged release of CarbonData was downloaded or assembly 
jar will be available after [building 
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md) 
and can be copied from `./assembly/target/scala-2.1x/carbondata_xxx.jar`
+**NOTE**: Path where packaged release of CarbonData was downloaded or assembly 
jar will be available after [building 
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md) 
and can be copied from `./assembly/target/scala-2.1x/apache-carbondata_xxx.jar`
 
 In this shell, SparkSession is readily available as `spark` and Spark context 
is readily available as `sc`.
 
@@ -234,9 +236,9 @@ carbon.sql(
 
 ### Procedure
 
-1. [Build the 
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md) 
project and get the assembly jar from 
`./assembly/target/scala-2.1x/carbondata_xxx.jar`. 
+1. [Build the 
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md) 
project and get the assembly jar from 
`./assembly/target/scala-2.1x/apache-carbondata_xxx.jar`. 
 
-2. Copy `./assembly/target/scala-2.1x/carbondata_xxx.jar` to 
`$SPARK_HOME/carbonlib` folder.
+2. Copy `./assembly/target/scala-2.1x/apache-carbondata_xxx.jar` to 
`$SPARK_HOME/carbonlib` folder.
 
    **NOTE**: Create the carbonlib folder if it does not exist inside 
`$SPARK_HOME` path.
 
@@ -253,13 +255,7 @@ carbon.sql(
 | spark.driver.extraJavaOptions   | `-Dcarbon.properties.filepath = 
$SPARK_HOME/conf/carbon.properties` | A string of extra JVM options to pass to 
the driver. For instance, GC settings or other logging. |
 | spark.executor.extraJavaOptions | `-Dcarbon.properties.filepath = 
$SPARK_HOME/conf/carbon.properties` | A string of extra JVM options to pass to 
executors. For instance, GC settings or other logging. **NOTE**: You can enter 
multiple values separated by space. |
 
-7. Add the following properties in `$SPARK_HOME/conf/carbon.properties` file:
-
-| Property             | Required | Description                                
                  | Example                              | Remark               
         |
-| -------------------- | -------- | 
------------------------------------------------------------ | 
------------------------------------ | ----------------------------- |
-| carbon.storelocation | NO       | Location where data CarbonData will create 
the store and write the data in its own format. If not specified then it takes 
spark.sql.warehouse.dir path. | hdfs://HOSTNAME:PORT/Opt/CarbonStore | Propose 
to set HDFS directory |
-
-8. Verify the installation. For example:
+7. Verify the installation. For example:
 
 ```
 ./bin/spark-shell \
@@ -268,7 +264,9 @@ carbon.sql(
 --executor-memory 2G
 ```
 
-**NOTE**: Make sure you have permissions for CarbonData JARs and files through 
which driver and executor will start.
+**NOTE**: 
+ - property "carbon.storelocation" is deprecated in carbondata 2.0 version. 
Only the users who used this property in previous versions can still use it in 
carbon 2.0 version.
+ - Make sure you have permissions for CarbonData JARs and files through which 
driver and executor will start.
 
 ## Installing and Configuring CarbonData on Spark on YARN Cluster
 
@@ -284,7 +282,7 @@ carbon.sql(
 
    The following steps are only for Driver Nodes. (Driver nodes are the one 
which starts the spark context.)
 
-1. [Build the 
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md) 
project and get the assembly jar from 
`./assembly/target/scala-2.1x/carbondata_xxx.jar` and copy to 
`$SPARK_HOME/carbonlib` folder.
+1. [Build the 
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md) 
project and get the assembly jar from 
`./assembly/target/scala-2.1x/apache-carbondata_xxx.jar` and copy to 
`$SPARK_HOME/carbonlib` folder.
 
    **NOTE**: Create the carbonlib folder if it does not exists inside 
`$SPARK_HOME` path.
 
@@ -310,13 +308,7 @@ mv carbondata.tar.gz carbonlib/
 | spark.driver.extraClassPath     | Extra classpath entries to prepend to the 
classpath of the driver. **NOTE**: If SPARK_CLASSPATH is defined in 
spark-env.sh, then comment it and append the value in below parameter 
spark.driver.extraClassPath. | `$SPARK_HOME/carbonlib/*`                        
            |
 | spark.driver.extraJavaOptions   | A string of extra JVM options to pass to 
the driver. For instance, GC settings or other logging. | 
`-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties` |
 
-5. Add the following properties in `$SPARK_HOME/conf/carbon.properties`:
-
-| Property             | Required | Description                                
                  | Example                              | Default Value        
         |
-| -------------------- | -------- | 
------------------------------------------------------------ | 
------------------------------------ | ----------------------------- |
-| carbon.storelocation | NO       | Location where CarbonData will create the 
store and write the data in its own format. If not specified then it takes 
spark.sql.warehouse.dir path. | hdfs://HOSTNAME:PORT/Opt/CarbonStore | Propose 
to set HDFS directory |
-
-6. Verify the installation.
+5. Verify the installation.
 
 ```
 ./bin/spark-shell \
@@ -327,6 +319,7 @@ mv carbondata.tar.gz carbonlib/
 ```
 
 **NOTE**:
+ - property "carbon.storelocation" is deprecated in carbondata 2.0 version. 
Only the users who used this property in previous versions can still use it in 
carbon 2.0 version.
  - Make sure you have permissions for CarbonData JARs and files through which 
driver and executor will start.
  - If use Spark + Hive 1.1.X, it needs to add carbondata assembly jar and 
carbondata-hive jar into parameter 'spark.sql.hive.metastore.jars' in 
spark-default.conf file.
 
@@ -343,13 +336,27 @@ b. Run the following command to start the CarbonData 
thrift server.
 ```
 ./bin/spark-submit \
 --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
-$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
+$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR
+```
+
+| Parameter           | Description                                            
      | Example                                                    |
+| ------------------- | 
------------------------------------------------------------ | 
---------------------------------------------------------- |
+| CARBON_ASSEMBLY_JAR | CarbonData assembly jar name present in the 
`$SPARK_HOME/carbonlib/` folder. | apache-carbondata-xx.jar       |
+
+c. Run the following command to work with S3 storage.
+
+```
+./bin/spark-submit \
+--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
+$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <access_key> <secret_key> <endpoint>
 ```
 
 | Parameter           | Description                                            
      | Example                                                    |
 | ------------------- | 
------------------------------------------------------------ | 
---------------------------------------------------------- |
-| CARBON_ASSEMBLY_JAR | CarbonData assembly jar name present in the 
`$SPARK_HOME/carbonlib/` folder. | 
carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar       |
-| carbon_store_path   | This is a parameter to the CarbonThriftServer class. 
This a HDFS path where CarbonData files will be kept. Strongly Recommended to 
put same as carbon.storelocation parameter of carbon.properties. If not 
specified then it takes spark.sql.warehouse.dir path. | 
`hdfs://<host_name>:port/user/hive/warehouse/carbon.store` |
+| CARBON_ASSEMBLY_JAR | CarbonData assembly jar name present in the 
`$SPARK_HOME/carbonlib/` folder. | apache-carbondata-xx.jar       |
+| access_key   | Access key for S3 storage |
+| secret_key   | Secret key for S3 storage |
+| endpoint   | Endpoint for connecting to S3 storage |
 
 **NOTE**: From Spark 1.6, by default the Thrift server runs in multi-session 
mode. Which means each JDBC/ODBC connection owns a copy of their own SQL 
configuration and temporary function registry. Cached tables are still shared 
though. If you prefer to run the Thrift server in single-session mode and share 
all SQL configuration and temporary function registry, please set option 
`spark.sql.hive.thriftServer.singleSession` to `true`. You may either add this 
option to `spark-defaults.conf`, [...]
 
@@ -357,7 +364,7 @@ $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR 
<carbon_store_path>
 ./bin/spark-submit \
 --conf spark.sql.hive.thriftServer.singleSession=true \
 --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
-$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
+$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR
 ```
 
 **But** in single-session mode, if one user changes the database from one 
connection, the database of the other connections will be changed too.
@@ -369,8 +376,7 @@ $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR 
<carbon_store_path>
 ```
 ./bin/spark-submit \
 --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
-$SPARK_HOME/carbonlib/carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar \
-hdfs://<host_name>:port/user/hive/warehouse/carbon.store
+$SPARK_HOME/carbonlib/apache-carbondata-xxx.jar
 ```
 
 - Start with Fixed executors and resources.
@@ -382,8 +388,7 @@ hdfs://<host_name>:port/user/hive/warehouse/carbon.store
 --driver-memory 20G \
 --executor-memory 250G \
 --executor-cores 32 \
-$SPARK_HOME/carbonlib/carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar \
-hdfs://<host_name>:port/user/hive/warehouse/carbon.store
+$SPARK_HOME/carbonlib/apache-carbondata-xxx.jar
 ```
 
 ### Connecting to CarbonData Thrift Server Using Beeline.
@@ -401,9 +406,9 @@ Example
 
 ## Installing and Configuring CarbonData on Presto
 
-**NOTE:** **CarbonData tables cannot be created nor loaded from Presto. User 
need to create CarbonData Table and load data into it
+**NOTE:** **CarbonData tables cannot be created nor loaded from Presto. User 
needs to create CarbonData Table and load data into it
 either with 
[Spark](#installing-and-configuring-carbondata-to-run-locally-with-spark-shell) 
or [SDK](./sdk-guide.md) or [C++ SDK](./csdk-guide.md).
-Once the table is created,it can be queried from Presto.**
+Once the table is created, it can be queried from Presto.**
 
 Please refer the presto guide linked below.
 
@@ -411,7 +416,7 @@ prestodb guide  - [prestodb](./prestodb-guide.md)
 
 prestosql guide - [prestosql](./prestosql-guide.md)
 
-Once installed the presto with carbonData as per above guide,
+Once installed the presto with carbonData as per the above guide,
 you can use the Presto CLI on the coordinator to query data sources in the 
catalog using the Presto workers.
 
 List the schemas(databases) available
@@ -438,6 +443,4 @@ Query from the available tables
 select * from carbon_table;
 ```
 
-**Note :** Create Tables and data loads should be done before executing 
queries as we can not create carbon table from this interface.
-
-```
+**Note:** Create Tables and data loads should be done before executing queries 
as we can not create carbon table from this interface.

[carbondata] branch master updated: [CARBONDATA-3791] Fix spelling, validating links for quick-start guide and configuration parameters

Reply via email to