[carbondata] branch master updated: [CARBONDATA-4240]: Added missing properties on the configurations page

akashrn5 Wed, 27 Oct 2021 22:52:32 -0700

This is an automated email from the ASF dual-hosted git repository.

akashrn5 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git



The following commit(s) were added to refs/heads/master by this push:
     new 7d94691  [CARBONDATA-4240]: Added missing properties on the 
configurations page
7d94691 is described below

commit 7d94691deb3300624ce4b22c4563cb4b9da776fa
Author: pratyakshsharma <[email protected]>
AuthorDate: Wed Oct 27 13:51:07 2021 +0530

    [CARBONDATA-4240]: Added missing properties on the configurations page
    
    Why is this PR needed?
    Few properties which were not present on configurations page but are
    user facing properties have been added.
    
    What changes were proposed in this PR?
    Addition of missing properties
    
    Does this PR introduce any user interface change?
    No
    
    Is any new testcase added?
    No
    
    This Closes #4210
---
 docs/configuration-parameters.md | 31 +++++++++++++++++++++++++++----
 docs/ddl-of-carbondata.md        |  2 +-
 docs/quick-start-guide.md        |  4 ++--
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index 73bf2ce..c24518a 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -52,6 +52,11 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.trash.retention.days | 7 | This parameter specifies the number of 
days after which the timestamp based subdirectories are expired in the trash 
folder. Allowed Min value = 0, Allowed Max Value = 365 days|
 | carbon.clean.file.force.allowed | false | This parameter specifies if the 
clean files operation with force option is allowed or not.|
 | carbon.cdc.minmax.pruning.enabled | false | This parameter defines whether 
the min max pruning to be performed on the target table based on the source 
data. It will be useful when data is not sparse across target table which 
results in better pruning.|
+| carbon.blocklet.size | 64 MB | Carbondata file consists of blocklets which 
further consists of column pages. As per the latest V3 format, the default size 
of a blocklet is 64 MB. It is recommended not to change this value except for 
some specific use case. |
+| carbon.date.format | yyyy-MM-dd | This property specifies the format which 
is used for parsing the incoming date field values. |
+| carbon.lock.class | (none) | This specifies the implementation of 
ICarbonLock interface to be used for acquiring the locks in case of concurrent 
operations |
+| carbon.data.file.version | V3 | This specifies carbondata file format 
version. Carbondata file format has evolved with time from V1 to V3 in terms of 
metadata storage and IO level pruning capabilities. You can find more details 
[here](https://carbondata.apache.org/file-structure-of-carbondata.html#carbondata-file-format).
 |
+| spark.carbon.hive.schema.store | false | Carbondata currently supports 2 
different types of metastores for storing schemas. This property specifies if 
Hive metastore is to be used for storing and retrieving table schemas |
 
 ## Data Loading Configuration
 
@@ -70,6 +75,7 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.load.global.sort.partitions | 0 | The number of partitions to use 
when shuffling data for global sort. Default value 0 means to use same number 
of map tasks as reduce tasks. **NOTE:** In general, it is recommended to have 
2-3 tasks per CPU core in your cluster. |
 | carbon.sort.size | 100000 | Number of records to hold in memory to sort and 
write intermediate sort temp files. **NOTE:** Memory required for data loading 
will increase if you turn this value bigger. Besides each thread will cache 
this amout of records. The number of threads is configured by 
*carbon.number.of.cores.while.loading*. |
 | carbon.options.bad.records.logger.enable | false | CarbonData can identify 
the records that are not conformant to schema and isolate them as bad records. 
Enabling this configuration will make CarbonData to log such bad records. 
**NOTE:** If the input data contains many bad records, logging them will slow 
down the over all data loading throughput. The data load operation status would 
depend on the configuration in ***carbon.bad.records.action***. |
+| carbon.options.bad.records.action | FAIL | This property has four types of  
bad record actions: FORCE, REDIRECT, IGNORE and FAIL. If set to FORCE then it 
auto-corrects the data by storing the bad records as NULL. If set to REDIRECT 
then bad records are written to the raw CSV instead of being loaded. If set to 
IGNORE then bad records are neither loaded nor written to the raw CSV. If set 
to FAIL then data loading fails if any bad records are found. Also this 
property can be set at differ [...]
 | carbon.bad.records.action | FAIL | CarbonData in addition to identifying the 
bad records, can take certain actions on such data. This configuration can have 
four types of actions for bad records namely FORCE, REDIRECT, IGNORE and FAIL. 
If set to FORCE then it auto-corrects the data by storing the bad records as 
NULL. If set to REDIRECT then bad records are written to the raw CSV instead of 
being loaded. If set to IGNORE then bad records are neither loaded nor written 
to the raw CSV. If [...]
 | carbon.options.is.empty.data.bad.record | false | Based on the business 
scenarios, empty("" or '' or ,,) data can be valid or invalid. This 
configuration controls how empty data should be treated by CarbonData. If 
false, then empty ("" or '' or ,,) data will not be considered as bad record 
and vice versa. |
 | carbon.options.bad.record.path | (none) | Specifies the HDFS path where bad 
records are to be stored. By default the value is Null. This path must be 
configured by the user if ***carbon.options.bad.records.logger.enable*** is 
**true** or ***carbon.bad.records.action*** is **REDIRECT**. |
@@ -93,12 +99,15 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.options.serialization.null.format | \N | Based on the business 
scenarios, some columns might need to be loaded with null values. As null value 
cannot be written in csv files, some special characters might be adopted to 
specify null values. This configuration can be used to specify the null values 
format in the data being loaded. |
 | carbon.column.compressor | snappy | CarbonData will compress the column 
values using the compressor specified by this configuration. Currently 
CarbonData supports 'snappy', 'zstd' and 'gzip' compressors. |
 | carbon.minmax.allowed.byte.count | 200 | CarbonData will write the min max 
values for string/varchar types column using the byte count specified by this 
configuration. Max value is 1000 bytes(500 characters) and Min value is 10 
bytes(5 characters). **NOTE:** This property is useful for reducing the store 
size thereby improving the query performance but can lead to query degradation 
if value is not configured properly. | |
-| carbon.merge.index.failure.throw.exception | true | It is used to configure 
whether or not merge index failure should result in data load failure also. |
 | carbon.binary.decoder | None | Support configurable decode for loading. Two 
decoders supported: base64 and hex |
 | carbon.local.dictionary.size.threshold.inmb | 4 | size based threshold for 
local dictionary in MB, maximum allowed size is 16 MB. |
-| carbon.enable.bad.record.handling.for.insert | false | by default, disable 
the bad record and converter step during "insert into" |
-| carbon.load.si.repair | true | by default, enable loading for failed 
segments in SI during load/insert command |
+| carbon.enable.bad.record.handling.for.insert | false | By default, disable 
the bad record and converter step during "insert into" |
+| carbon.load.si.repair | true | By default, enable loading for failed 
segments in SI during load/insert command |
 | carbon.si.repair.limit | (none) | Number of failed segments to be loaded in 
SI when repairing missing segments in SI, by default load all the missing 
segments. Supports value from 0 to 2147483646 |
+| carbon.complex.delimiter.level.1 | # | This delimiter is used for parsing 
complex data type columns. Level 1 delimiter splits the complex type data 
column in a row (eg., a\001b\001c --> Array = {a,b,c}). |
+| carbon.complex.delimiter.level.2 | $ | This delimiter splits the complex 
type nested data column in a row. Applies level_1 delimiter & applies level_2 
based on complex data type (eg., a\002b\001c\002d --> Array> = {{a,b},{c,d}}). |
+| carbon.complex.delimiter.level.3 | @ | This delimiter splits the complex 
type nested data column in a row. Applies level_1 delimiter, applies level_2 
and then level_3 delimiter based on complex data type. Used in case of nested 
Complex Map type. (eg., 'a\003b\002b\003c\001aa\003bb\002cc\003dd' --> Array Of 
Map> = {{a -> b, b -> c},{aa -> bb, cc -> dd}}). |
+| carbon.complex.delimiter.level.4 | (none) | All the levels of delimiters are 
used for parsing complex data type columns. All the delimiters are applied 
depending on the complexity of the given data type. Level 4 delimiter will be 
used for parsing the complex values after level 3 delimiter has been applied 
already. |
 
 ## Compaction Configuration
 
@@ -113,12 +122,13 @@ This section provides the details of all the 
configurations required for the Car
 | carbon.numberof.preserve.segments | 0 | If the user wants to preserve some 
number of segments from being compacted then he can set this configuration. 
Example: carbon.numberof.preserve.segments = 2 then 2 latest segments will 
always be excluded from the compaction. No segments will be preserved by 
default. **NOTE:** This configuration is useful when the chances of input data 
can be wrong due to environment scenarios. Preserving some of the latest 
segments from being compacted can help  [...]
 | carbon.allowed.compaction.days | 0 | This configuration is used to control 
on the number of recent segments that needs to be compacted, ignoring the older 
ones. This configuration is in days. For Example: If the configuration is 2, 
then the segments which are loaded in the time frame of past 2 days only will 
get merged. Segments which are loaded earlier than 2 days will not be merged. 
This configuration is disabled by default. **NOTE:** This configuration is 
useful when a bulk of histo [...]
 | carbon.enable.auto.load.merge | false | Compaction can be automatically 
triggered once data load completes. This ensures that the segments are merged 
in time and thus query times does not increase with increase in segments. This 
configuration enables to do compaction along with data loading. **NOTE:** 
Compaction will be triggered once the data load completes. But the status of 
data load wait till the compaction is completed. Hence it might look like data 
loading time has increased, but [...]
-| carbon.enable.page.level.reader.in.compaction|false|Enabling page level 
reader for compaction reduces the memory usage while compacting more number of 
segments. It allows reading only page by page instead of reading whole blocklet 
to memory. **NOTE:** Please refer to 
[file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format)
 to understand the storage format of CarbonData and concepts of pages.|
+| carbon.enable.page.level.reader.in.compaction | false | Enabling page level 
reader for compaction reduces the memory usage while compacting more number of 
segments. It allows reading only page by page instead of reading whole blocklet 
to memory. **NOTE:** Please refer to 
[file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format)
 to understand the storage format of CarbonData and concepts of pages.|
 | carbon.concurrent.compaction | true | Compaction of different tables can be 
executed concurrently. This configuration determines whether to compact all 
qualifying tables in parallel or not. **NOTE:** Compacting concurrently is a 
resource demanding operation and needs more resources there by affecting the 
query performance also. This configuration is **deprecated** and might be 
removed in future releases. |
 | carbon.compaction.prefetch.enable | false | Compaction operation is similar 
to Query + data load where in data from qualifying segments are queried and 
data loading performed to generate a new single segment. This configuration 
determines whether to query ahead data from segments and feed it for data 
loading. **NOTE:** This configuration is disabled by default as it needs extra 
resources for querying extra data. Based on the memory availability on the 
cluster, user can enable it to imp [...]
 | carbon.enable.range.compaction | true | To configure Range-based Compaction 
to be used or not for RANGE_COLUMN. If true after compaction also the data 
would be present in ranges. |
 | carbon.si.segment.merge | false | Making this true degrades the LOAD 
performance. When the number of small files increase for SI segments(it can 
happen as number of columns will be less and we store position id and reference 
columns), user can either set to true which will merge the data files for 
upcoming loads or run SI refresh command which does this job for all segments. 
(REFRESH INDEX <index_table>) |
 | carbon.partition.data.on.tasklevel | false | When enabled, tasks launched 
for Local sort partition load will be based on one node one task. Compaction 
will be performed based on task level for a partition. Load performance might 
be degraded, because, the number of tasks launched is equal to number of nodes 
in case of local sort. For compaction, memory consumption will be less, as more 
number of tasks will be launched for a partition |
+| carbon.minor.compaction.size | (none) | Minor compaction originally worked 
based on the number of segments (by default 4). However in that scenario, there 
was no control over the size of segments to be compacted. This parameter was 
introduced to exclude segments whose size is greater than the configured 
threshold so that the overall IO and time taken decreases | 
 
 ## Query Configuration
 
@@ -151,6 +161,16 @@ This section provides the details of all the 
configurations required for the Car
 | carbon.partition.max.driver.lru.cache.size | -1 | Maximum memory **(in MB)** 
upto which driver can cache partition metadata. Beyond this, least recently 
used data will be removed from cache before loading new set of values.
 | carbon.mapOrderPushDown.<db_name>_<table_name>.column| empty | If order by 
column is in sort column, specify that sort column here to avoid ordering at 
map task . |
 | carbon.metacache.expiration.seconds | Long.MAX_VALUE | Expiration time **(in 
seconds)** for tableInfo cache in CarbonMetadata and tableModifiedTime in 
CarbonFileMetastore, after the time configured since last access to the cache 
entry, tableInfo and tableModifiedTime will be removed from each cache. Recent 
access will refresh the timer. Default value of Long.MAX_VALUE means the cache 
will not be expired by time. **NOTE:** At the time when cache is being expired, 
queries on the table ma [...]
+| is.driver.instance | false | This parameter decides if LRU cache for storing 
indexes need to be created on driver. By default, it is created on executors. |
+| carbon.input.metrics.update.interval | 500000 | This property determines the 
number of records queried after which input metrics are updated to spark. It 
can be set dynamically within spark session itself as well. |
+| carbon.use.bitset.pipe.line | true | Carbondata has various optimizations 
for faster query execution. Setting this property acts like a catalyst for 
filter queries. If set to true, the bitset is passed from one filter to 
another, resulting in incremental filtering and improving overall performance |
+
+## Index Configuration
+| Parameter | Default Value | Description |
+|--------------------------------------|---------------|---------------------------------------------------|
+| carbon.lucene.index.stop.words | false | By default, lucene does not create 
index for stop words like 'is', 'the' etc. This flag is used to override this 
behaviour |
+| carbon.load.dateformat.setlenient.enable | false | This property enables 
parsing of timestamp/date data in load flow if the parsing fails with invalid 
timestamp data error. For example: 1941-03-15 00:00:00 is valid time in 
Asia/Calcutta zone and is invalid and will fail to parse in Asia/Shanghai zone 
as DST is observed and clocks were turned forward 1 hour to 1941-03-15 01:00:00 
|
+| carbon.indexserver.tempfolder.deletetime | 10800000 | This specifies the 
time period in milliseconds after which temp folder gets deleted from index 
server |
 
 ## Data Mutation Configuration
 | Parameter | Default Value | Description |
@@ -237,6 +257,9 @@ RESET
 | carbon.enable.index.server                | To use index server for caching 
and pruning. This property can be used for a session or for a particular table 
with ***carbon.enable.index.server.<db_name>.<table_name>***. |
 | carbon.reorder.filter                     | This property can be used to 
enabled/disable filter reordering. Should be disabled only when the user has 
optimized the filter condition. | 
 | carbon.mapOrderPushDown.<db_name>_<table_name>.column | If order by column 
is in sort column, specify that sort column here to avoid ordering at map task 
. |
+| carbon.load.dateformat.setlenient.enable | To enable parsing of 
timestamp/date data in load flow if the parsing fails with invalid timestamp 
data error. **NOTE** Refer to [Index 
Configuration](#index-configuration)#carbon.load.dateformat.setlenient.enable 
for detailed information. |
+| carbon.minor.compaction.size | Puts an upper limit on the size of segments 
to be included for compaction. **NOTE** Refer to [Compaction 
Configuration](#compaction-configuration)#carbon.minor.compaction.size for 
detailed information. |
+| carbon.input.metrics.update.interval | Determines the number of records 
queried after which input metrics are updated to spark. **NOTE** Refer to 
[Query Configuration](#query-configuration)#carbon.minor.compaction.size for 
detailed information. |
 **Examples:**
 
 * Add or Update:
diff --git a/docs/ddl-of-carbondata.md b/docs/ddl-of-carbondata.md
index b37b3ab..dbf616b 100644
--- a/docs/ddl-of-carbondata.md
+++ b/docs/ddl-of-carbondata.md
@@ -641,7 +641,7 @@ CarbonData DDL statements are documented here,which 
includes:
   This function creates a new database. By default the database is created in 
location 'spark.sql.warehouse.dir', but you can also specify custom location by 
configuring 'spark.sql.warehouse.dir', the configuration 'carbon.storelocation' 
has been deprecated.
 
   **Note:**
-    For simplicity, we recommended you remove the configuration of 
carbon.storelocation. If carbon.storelocaiton and spark.sql.warehouse.dir are 
configured to different paths, exception will be thrown when CREATE DATABASE 
and DROP DATABASE to avoid inconsistent database location.
+    For simplicity, we recommended you remove the configuration of 
carbon.storelocation. If carbon.storelocation and spark.sql.warehouse.dir are 
configured to different paths, exception will be thrown when CREATE DATABASE 
and DROP DATABASE to avoid inconsistent database location.
 
 
   ```
diff --git a/docs/quick-start-guide.md b/docs/quick-start-guide.md
index 4782917..0d9cee1 100644
--- a/docs/quick-start-guide.md
+++ b/docs/quick-start-guide.md
@@ -259,7 +259,7 @@ carbon.sql(
 
 3. Add the carbonlib folder path in the Spark classpath. (Edit 
`$SPARK_HOME/conf/spark-env.sh` file and modify the value of `SPARK_CLASSPATH` 
by appending `$SPARK_HOME/carbonlib/*` to the existing value)
 
-4. Copy the `./conf/carbon.properties.template` file from CarbonData 
repository to `$SPARK_HOME/conf/` folder and rename the file to 
`carbon.properties`.
+4. Copy the `./conf/carbon.properties.template` file from CarbonData 
repository to `$SPARK_HOME/conf/` folder and rename the file to 
`carbon.properties`. All the carbondata related properties are configured in 
this file.
 
 5. Repeat Step 2 to Step 5 in all the nodes of the cluster.
 
@@ -304,7 +304,7 @@ carbon.sql(
 
    **NOTE**: Create the carbonlib folder if it does not exists inside 
`$SPARK_HOME` path.
 
-2. Copy the `./conf/carbon.properties.template` file from CarbonData 
repository to `$SPARK_HOME/conf/` folder and rename the file to 
`carbon.properties`.
+2. Copy the `./conf/carbon.properties.template` file from CarbonData 
repository to `$SPARK_HOME/conf/` folder and rename the file to 
`carbon.properties`. All the carbondata related properties are configured in 
this file.
 
 3. Create `tar.gz` file of carbonlib folder and move it inside the carbonlib 
folder.

[carbondata] branch master updated: [CARBONDATA-4240]: Added missing properties on the configurations page

Reply via email to