This is an automated email from the ASF dual-hosted git repository.
akashrn5 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git
The following commit(s) were added to refs/heads/master by this push:
new 7d94691 [CARBONDATA-4240]: Added missing properties on the
configurations page
7d94691 is described below
commit 7d94691deb3300624ce4b22c4563cb4b9da776fa
Author: pratyakshsharma <[email protected]>
AuthorDate: Wed Oct 27 13:51:07 2021 +0530
[CARBONDATA-4240]: Added missing properties on the configurations page
Why is this PR needed?
Few properties which were not present on configurations page but are
user facing properties have been added.
What changes were proposed in this PR?
Addition of missing properties
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This Closes #4210
---
docs/configuration-parameters.md | 31 +++++++++++++++++++++++++++----
docs/ddl-of-carbondata.md | 2 +-
docs/quick-start-guide.md | 4 ++--
3 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index 73bf2ce..c24518a 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -52,6 +52,11 @@ This section provides the details of all the configurations
required for the Car
| carbon.trash.retention.days | 7 | This parameter specifies the number of
days after which the timestamp based subdirectories are expired in the trash
folder. Allowed Min value = 0, Allowed Max Value = 365 days|
| carbon.clean.file.force.allowed | false | This parameter specifies if the
clean files operation with force option is allowed or not.|
| carbon.cdc.minmax.pruning.enabled | false | This parameter defines whether
the min max pruning to be performed on the target table based on the source
data. It will be useful when data is not sparse across target table which
results in better pruning.|
+| carbon.blocklet.size | 64 MB | Carbondata file consists of blocklets which
further consists of column pages. As per the latest V3 format, the default size
of a blocklet is 64 MB. It is recommended not to change this value except for
some specific use case. |
+| carbon.date.format | yyyy-MM-dd | This property specifies the format which
is used for parsing the incoming date field values. |
+| carbon.lock.class | (none) | This specifies the implementation of
ICarbonLock interface to be used for acquiring the locks in case of concurrent
operations |
+| carbon.data.file.version | V3 | This specifies carbondata file format
version. Carbondata file format has evolved with time from V1 to V3 in terms of
metadata storage and IO level pruning capabilities. You can find more details
[here](https://carbondata.apache.org/file-structure-of-carbondata.html#carbondata-file-format).
|
+| spark.carbon.hive.schema.store | false | Carbondata currently supports 2
different types of metastores for storing schemas. This property specifies if
Hive metastore is to be used for storing and retrieving table schemas |
## Data Loading Configuration
@@ -70,6 +75,7 @@ This section provides the details of all the configurations
required for the Car
| carbon.load.global.sort.partitions | 0 | The number of partitions to use
when shuffling data for global sort. Default value 0 means to use same number
of map tasks as reduce tasks. **NOTE:** In general, it is recommended to have
2-3 tasks per CPU core in your cluster. |
| carbon.sort.size | 100000 | Number of records to hold in memory to sort and
write intermediate sort temp files. **NOTE:** Memory required for data loading
will increase if you turn this value bigger. Besides each thread will cache
this amout of records. The number of threads is configured by
*carbon.number.of.cores.while.loading*. |
| carbon.options.bad.records.logger.enable | false | CarbonData can identify
the records that are not conformant to schema and isolate them as bad records.
Enabling this configuration will make CarbonData to log such bad records.
**NOTE:** If the input data contains many bad records, logging them will slow
down the over all data loading throughput. The data load operation status would
depend on the configuration in ***carbon.bad.records.action***. |
+| carbon.options.bad.records.action | FAIL | This property has four types of
bad record actions: FORCE, REDIRECT, IGNORE and FAIL. If set to FORCE then it
auto-corrects the data by storing the bad records as NULL. If set to REDIRECT
then bad records are written to the raw CSV instead of being loaded. If set to
IGNORE then bad records are neither loaded nor written to the raw CSV. If set
to FAIL then data loading fails if any bad records are found. Also this
property can be set at differ [...]
| carbon.bad.records.action | FAIL | CarbonData in addition to identifying the
bad records, can take certain actions on such data. This configuration can have
four types of actions for bad records namely FORCE, REDIRECT, IGNORE and FAIL.
If set to FORCE then it auto-corrects the data by storing the bad records as
NULL. If set to REDIRECT then bad records are written to the raw CSV instead of
being loaded. If set to IGNORE then bad records are neither loaded nor written
to the raw CSV. If [...]
| carbon.options.is.empty.data.bad.record | false | Based on the business
scenarios, empty("" or '' or ,,) data can be valid or invalid. This
configuration controls how empty data should be treated by CarbonData. If
false, then empty ("" or '' or ,,) data will not be considered as bad record
and vice versa. |
| carbon.options.bad.record.path | (none) | Specifies the HDFS path where bad
records are to be stored. By default the value is Null. This path must be
configured by the user if ***carbon.options.bad.records.logger.enable*** is
**true** or ***carbon.bad.records.action*** is **REDIRECT**. |
@@ -93,12 +99,15 @@ This section provides the details of all the configurations
required for the Car
| carbon.options.serialization.null.format | \N | Based on the business
scenarios, some columns might need to be loaded with null values. As null value
cannot be written in csv files, some special characters might be adopted to
specify null values. This configuration can be used to specify the null values
format in the data being loaded. |
| carbon.column.compressor | snappy | CarbonData will compress the column
values using the compressor specified by this configuration. Currently
CarbonData supports 'snappy', 'zstd' and 'gzip' compressors. |
| carbon.minmax.allowed.byte.count | 200 | CarbonData will write the min max
values for string/varchar types column using the byte count specified by this
configuration. Max value is 1000 bytes(500 characters) and Min value is 10
bytes(5 characters). **NOTE:** This property is useful for reducing the store
size thereby improving the query performance but can lead to query degradation
if value is not configured properly. | |
-| carbon.merge.index.failure.throw.exception | true | It is used to configure
whether or not merge index failure should result in data load failure also. |
| carbon.binary.decoder | None | Support configurable decode for loading. Two
decoders supported: base64 and hex |
| carbon.local.dictionary.size.threshold.inmb | 4 | size based threshold for
local dictionary in MB, maximum allowed size is 16 MB. |
-| carbon.enable.bad.record.handling.for.insert | false | by default, disable
the bad record and converter step during "insert into" |
-| carbon.load.si.repair | true | by default, enable loading for failed
segments in SI during load/insert command |
+| carbon.enable.bad.record.handling.for.insert | false | By default, disable
the bad record and converter step during "insert into" |
+| carbon.load.si.repair | true | By default, enable loading for failed
segments in SI during load/insert command |
| carbon.si.repair.limit | (none) | Number of failed segments to be loaded in
SI when repairing missing segments in SI, by default load all the missing
segments. Supports value from 0 to 2147483646 |
+| carbon.complex.delimiter.level.1 | # | This delimiter is used for parsing
complex data type columns. Level 1 delimiter splits the complex type data
column in a row (eg., a\001b\001c --> Array = {a,b,c}). |
+| carbon.complex.delimiter.level.2 | $ | This delimiter splits the complex
type nested data column in a row. Applies level_1 delimiter & applies level_2
based on complex data type (eg., a\002b\001c\002d --> Array> = {{a,b},{c,d}}). |
+| carbon.complex.delimiter.level.3 | @ | This delimiter splits the complex
type nested data column in a row. Applies level_1 delimiter, applies level_2
and then level_3 delimiter based on complex data type. Used in case of nested
Complex Map type. (eg., 'a\003b\002b\003c\001aa\003bb\002cc\003dd' --> Array Of
Map> = {{a -> b, b -> c},{aa -> bb, cc -> dd}}). |
+| carbon.complex.delimiter.level.4 | (none) | All the levels of delimiters are
used for parsing complex data type columns. All the delimiters are applied
depending on the complexity of the given data type. Level 4 delimiter will be
used for parsing the complex values after level 3 delimiter has been applied
already. |
## Compaction Configuration
@@ -113,12 +122,13 @@ This section provides the details of all the
configurations required for the Car
| carbon.numberof.preserve.segments | 0 | If the user wants to preserve some
number of segments from being compacted then he can set this configuration.
Example: carbon.numberof.preserve.segments = 2 then 2 latest segments will
always be excluded from the compaction. No segments will be preserved by
default. **NOTE:** This configuration is useful when the chances of input data
can be wrong due to environment scenarios. Preserving some of the latest
segments from being compacted can help [...]
| carbon.allowed.compaction.days | 0 | This configuration is used to control
on the number of recent segments that needs to be compacted, ignoring the older
ones. This configuration is in days. For Example: If the configuration is 2,
then the segments which are loaded in the time frame of past 2 days only will
get merged. Segments which are loaded earlier than 2 days will not be merged.
This configuration is disabled by default. **NOTE:** This configuration is
useful when a bulk of histo [...]
| carbon.enable.auto.load.merge | false | Compaction can be automatically
triggered once data load completes. This ensures that the segments are merged
in time and thus query times does not increase with increase in segments. This
configuration enables to do compaction along with data loading. **NOTE:**
Compaction will be triggered once the data load completes. But the status of
data load wait till the compaction is completed. Hence it might look like data
loading time has increased, but [...]
-| carbon.enable.page.level.reader.in.compaction|false|Enabling page level
reader for compaction reduces the memory usage while compacting more number of
segments. It allows reading only page by page instead of reading whole blocklet
to memory. **NOTE:** Please refer to
[file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format)
to understand the storage format of CarbonData and concepts of pages.|
+| carbon.enable.page.level.reader.in.compaction | false | Enabling page level
reader for compaction reduces the memory usage while compacting more number of
segments. It allows reading only page by page instead of reading whole blocklet
to memory. **NOTE:** Please refer to
[file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format)
to understand the storage format of CarbonData and concepts of pages.|
| carbon.concurrent.compaction | true | Compaction of different tables can be
executed concurrently. This configuration determines whether to compact all
qualifying tables in parallel or not. **NOTE:** Compacting concurrently is a
resource demanding operation and needs more resources there by affecting the
query performance also. This configuration is **deprecated** and might be
removed in future releases. |
| carbon.compaction.prefetch.enable | false | Compaction operation is similar
to Query + data load where in data from qualifying segments are queried and
data loading performed to generate a new single segment. This configuration
determines whether to query ahead data from segments and feed it for data
loading. **NOTE:** This configuration is disabled by default as it needs extra
resources for querying extra data. Based on the memory availability on the
cluster, user can enable it to imp [...]
| carbon.enable.range.compaction | true | To configure Range-based Compaction
to be used or not for RANGE_COLUMN. If true after compaction also the data
would be present in ranges. |
| carbon.si.segment.merge | false | Making this true degrades the LOAD
performance. When the number of small files increase for SI segments(it can
happen as number of columns will be less and we store position id and reference
columns), user can either set to true which will merge the data files for
upcoming loads or run SI refresh command which does this job for all segments.
(REFRESH INDEX <index_table>) |
| carbon.partition.data.on.tasklevel | false | When enabled, tasks launched
for Local sort partition load will be based on one node one task. Compaction
will be performed based on task level for a partition. Load performance might
be degraded, because, the number of tasks launched is equal to number of nodes
in case of local sort. For compaction, memory consumption will be less, as more
number of tasks will be launched for a partition |
+| carbon.minor.compaction.size | (none) | Minor compaction originally worked
based on the number of segments (by default 4). However in that scenario, there
was no control over the size of segments to be compacted. This parameter was
introduced to exclude segments whose size is greater than the configured
threshold so that the overall IO and time taken decreases |
## Query Configuration
@@ -151,6 +161,16 @@ This section provides the details of all the
configurations required for the Car
| carbon.partition.max.driver.lru.cache.size | -1 | Maximum memory **(in MB)**
upto which driver can cache partition metadata. Beyond this, least recently
used data will be removed from cache before loading new set of values.
| carbon.mapOrderPushDown.<db_name>_<table_name>.column| empty | If order by
column is in sort column, specify that sort column here to avoid ordering at
map task . |
| carbon.metacache.expiration.seconds | Long.MAX_VALUE | Expiration time **(in
seconds)** for tableInfo cache in CarbonMetadata and tableModifiedTime in
CarbonFileMetastore, after the time configured since last access to the cache
entry, tableInfo and tableModifiedTime will be removed from each cache. Recent
access will refresh the timer. Default value of Long.MAX_VALUE means the cache
will not be expired by time. **NOTE:** At the time when cache is being expired,
queries on the table ma [...]
+| is.driver.instance | false | This parameter decides if LRU cache for storing
indexes need to be created on driver. By default, it is created on executors. |
+| carbon.input.metrics.update.interval | 500000 | This property determines the
number of records queried after which input metrics are updated to spark. It
can be set dynamically within spark session itself as well. |
+| carbon.use.bitset.pipe.line | true | Carbondata has various optimizations
for faster query execution. Setting this property acts like a catalyst for
filter queries. If set to true, the bitset is passed from one filter to
another, resulting in incremental filtering and improving overall performance |
+
+## Index Configuration
+| Parameter | Default Value | Description |
+|--------------------------------------|---------------|---------------------------------------------------|
+| carbon.lucene.index.stop.words | false | By default, lucene does not create
index for stop words like 'is', 'the' etc. This flag is used to override this
behaviour |
+| carbon.load.dateformat.setlenient.enable | false | This property enables
parsing of timestamp/date data in load flow if the parsing fails with invalid
timestamp data error. For example: 1941-03-15 00:00:00 is valid time in
Asia/Calcutta zone and is invalid and will fail to parse in Asia/Shanghai zone
as DST is observed and clocks were turned forward 1 hour to 1941-03-15 01:00:00
|
+| carbon.indexserver.tempfolder.deletetime | 10800000 | This specifies the
time period in milliseconds after which temp folder gets deleted from index
server |
## Data Mutation Configuration
| Parameter | Default Value | Description |
@@ -237,6 +257,9 @@ RESET
| carbon.enable.index.server | To use index server for caching
and pruning. This property can be used for a session or for a particular table
with ***carbon.enable.index.server.<db_name>.<table_name>***. |
| carbon.reorder.filter | This property can be used to
enabled/disable filter reordering. Should be disabled only when the user has
optimized the filter condition. |
| carbon.mapOrderPushDown.<db_name>_<table_name>.column | If order by column
is in sort column, specify that sort column here to avoid ordering at map task
. |
+| carbon.load.dateformat.setlenient.enable | To enable parsing of
timestamp/date data in load flow if the parsing fails with invalid timestamp
data error. **NOTE** Refer to [Index
Configuration](#index-configuration)#carbon.load.dateformat.setlenient.enable
for detailed information. |
+| carbon.minor.compaction.size | Puts an upper limit on the size of segments
to be included for compaction. **NOTE** Refer to [Compaction
Configuration](#compaction-configuration)#carbon.minor.compaction.size for
detailed information. |
+| carbon.input.metrics.update.interval | Determines the number of records
queried after which input metrics are updated to spark. **NOTE** Refer to
[Query Configuration](#query-configuration)#carbon.minor.compaction.size for
detailed information. |
**Examples:**
* Add or Update:
diff --git a/docs/ddl-of-carbondata.md b/docs/ddl-of-carbondata.md
index b37b3ab..dbf616b 100644
--- a/docs/ddl-of-carbondata.md
+++ b/docs/ddl-of-carbondata.md
@@ -641,7 +641,7 @@ CarbonData DDL statements are documented here,which
includes:
This function creates a new database. By default the database is created in
location 'spark.sql.warehouse.dir', but you can also specify custom location by
configuring 'spark.sql.warehouse.dir', the configuration 'carbon.storelocation'
has been deprecated.
**Note:**
- For simplicity, we recommended you remove the configuration of
carbon.storelocation. If carbon.storelocaiton and spark.sql.warehouse.dir are
configured to different paths, exception will be thrown when CREATE DATABASE
and DROP DATABASE to avoid inconsistent database location.
+ For simplicity, we recommended you remove the configuration of
carbon.storelocation. If carbon.storelocation and spark.sql.warehouse.dir are
configured to different paths, exception will be thrown when CREATE DATABASE
and DROP DATABASE to avoid inconsistent database location.
```
diff --git a/docs/quick-start-guide.md b/docs/quick-start-guide.md
index 4782917..0d9cee1 100644
--- a/docs/quick-start-guide.md
+++ b/docs/quick-start-guide.md
@@ -259,7 +259,7 @@ carbon.sql(
3. Add the carbonlib folder path in the Spark classpath. (Edit
`$SPARK_HOME/conf/spark-env.sh` file and modify the value of `SPARK_CLASSPATH`
by appending `$SPARK_HOME/carbonlib/*` to the existing value)
-4. Copy the `./conf/carbon.properties.template` file from CarbonData
repository to `$SPARK_HOME/conf/` folder and rename the file to
`carbon.properties`.
+4. Copy the `./conf/carbon.properties.template` file from CarbonData
repository to `$SPARK_HOME/conf/` folder and rename the file to
`carbon.properties`. All the carbondata related properties are configured in
this file.
5. Repeat Step 2 to Step 5 in all the nodes of the cluster.
@@ -304,7 +304,7 @@ carbon.sql(
**NOTE**: Create the carbonlib folder if it does not exists inside
`$SPARK_HOME` path.
-2. Copy the `./conf/carbon.properties.template` file from CarbonData
repository to `$SPARK_HOME/conf/` folder and rename the file to
`carbon.properties`.
+2. Copy the `./conf/carbon.properties.template` file from CarbonData
repository to `$SPARK_HOME/conf/` folder and rename the file to
`carbon.properties`. All the carbondata related properties are configured in
this file.
3. Create `tar.gz` file of carbonlib folder and move it inside the carbonlib
folder.