This is an automated email from the ASF dual-hosted git repository.

kunalkapoor pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git


The following commit(s) were added to refs/heads/master by this push:
     new b31c2a5  [CARBONDATA-3791]: Query and Compaction changes in 
configuration parameters
b31c2a5 is described below

commit b31c2a537133762ab596bd7bda73fe486f946b9b
Author: Vikram Ahuja <[email protected]>
AuthorDate: Wed May 6 20:27:13 2020 +0530

    [CARBONDATA-3791]: Query and Compaction changes in configuration parameters
    
    Why is this PR needed?
    Query and Compaction changes in configuration parameters as per the code.
    
    What changes were proposed in this PR?
    Query and Compaction changes in configuration parameters as per the code.
    
    This closes #3742
---
 docs/configuration-parameters.md | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index 9428e00..e4a8159 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -109,7 +109,7 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.numberof.preserve.segments | 0 | If the user wants to preserve some 
number of segments from being compacted then he can set this configuration. 
Example: carbon.numberof.preserve.segments = 2 then 2 latest segments will 
always be excluded from the compaction. No segments will be preserved by 
default. **NOTE:** This configuration is useful when the chances of input data 
can be wrong due to environment scenarios. Preserving some of the latest 
segments from being compacted can help  [...]
 | carbon.allowed.compaction.days | 0 | This configuration is used to control 
on the number of recent segments that needs to be compacted, ignoring the older 
ones. This configuration is in days. For Example: If the configuration is 2, 
then the segments which are loaded in the time frame of past 2 days only will 
get merged. Segments which are loaded earlier than 2 days will not be merged. 
This configuration is disabled by default. **NOTE:** This configuration is 
useful when a bulk of histo [...]
 | carbon.enable.auto.load.merge | false | Compaction can be automatically 
triggered once data load completes. This ensures that the segments are merged 
in time and thus query times does not increase with increase in segments. This 
configuration enables to do compaction along with data loading. **NOTE:** 
Compaction will be triggered once the data load completes. But the status of 
data load wait till the compaction is completed. Hence it might look like data 
loading time has increased, but [...]
-| carbon.enable.page.level.reader.in.compaction|true|Enabling page level 
reader for compaction reduces the memory usage while compacting more number of 
segments. It allows reading only page by page instead of reading whole blocklet 
to memory. **NOTE:** Please refer to 
[file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format)
 to understand the storage format of CarbonData and concepts of pages.|
+| carbon.enable.page.level.reader.in.compaction|false|Enabling page level 
reader for compaction reduces the memory usage while compacting more number of 
segments. It allows reading only page by page instead of reading whole blocklet 
to memory. **NOTE:** Please refer to 
[file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format)
 to understand the storage format of CarbonData and concepts of pages.|
 | carbon.concurrent.compaction | true | Compaction of different tables can be 
executed concurrently. This configuration determines whether to compact all 
qualifying tables in parallel or not. **NOTE:** Compacting concurrently is a 
resource demanding operation and needs more resources there by affecting the 
query performance also. This configuration is **deprecated** and might be 
removed in future releases. |
 | carbon.compaction.prefetch.enable | false | Compaction operation is similar 
to Query + data load where in data from qualifying segments are queried and 
data loading performed to generate a new single segment. This configuration 
determines whether to query ahead data from segments and feed it for data 
loading. **NOTE:** This configuration is disabled by default as it needs extra 
resources for querying extra data. Based on the memory availability on the 
cluster, user can enable it to imp [...]
 | carbon.merge.index.in.segment | true | Each CarbonData file has a companion 
CarbonIndex file which maintains the metadata about the data. These CarbonIndex 
files are read and loaded into driver and is used subsequently for pruning of 
data during queries. These CarbonIndex files are very small in size(few KB) and 
are many. Reading many small files from HDFS is not efficient and leads to slow 
IO performance. Hence these CarbonIndex files belonging to a segment can be 
combined into  a sin [...]
@@ -126,12 +126,6 @@ This section provides the details of all the 
configurations required for the Car
 | carbon.enableMinMax | true | CarbonData maintains the metadata which enables 
to prune unnecessary files from being scanned as per the query conditions. To 
achieve pruning, Min,Max of each column is maintined.Based on the filter 
condition in the query, certain data can be skipped from scanning by matching 
the filter value against the min,max values of the column(s) present in that 
carbondata file. This pruning enhances query performance significantly. |
 | carbon.dynamical.location.scheduler.timeout | 5 | CarbonData has its own 
scheduling algorithm to suggest to Spark on how many tasks needs to be launched 
and how much work each task need to do in a Spark cluster for any query on 
CarbonData. To determine the number of tasks that can be scheduled, knowing the 
count of active executors is necessary. When dynamic allocation is enabled on a 
YARN based spark cluster, executor processes are shutdown if no request is 
received for a particular a [...]
 | carbon.scheduler.min.registered.resources.ratio | 0.8 | Specifies the 
minimum resource (executor) ratio needed for starting the block distribution. 
The default value is 0.8, which indicates 80% of the requested resource is 
allocated for starting block distribution. The minimum value is 0.1 min and the 
maximum value is 1.0. |
-| carbon.search.enabled (Alpha Feature) | false | If set to true, it will use 
CarbonReader to do distributed scan directly instead of using compute framework 
like spark, thus avoiding limitation of compute framework like SQL optimizer 
and task scheduling overhead. |
-| carbon.search.query.timeout | 10s | Time within which the result is expected 
from the workers, beyond which the query is terminated |
-| carbon.search.scan.thread | num of cores available in worker node | Number 
of cores to be used in each worker for performing scan. |
-| carbon.search.master.port | 10020 | Port on which the search master listens 
for incoming query requests |
-| carbon.search.worker.port | 10021 | Port on which search master communicates 
with the workers. |
-| carbon.search.worker.workload.limit | 10 * *carbon.search.scan.thread* | 
Maximum number of active requests that can be sent to a worker. Beyond which 
the request needs to be rescheduled for later time or to a different worker. |
 | carbon.detail.batch.size | 100 | The buffer size to store records, returned 
from the block scan. In limit scenario this parameter is very important. For 
example your query limit is 1000. But if we set this value to 3000 that means 
we get 3000 records from scan but spark will only take 1000 rows. So the 2000 
remaining are useless. In one Finance test case after we set it to 100, in the 
limit 1000 scenario the performance increase about 2 times in comparison to if 
we set this value to 12 [...]
 | carbon.enable.vector.reader | true | Spark added vector processing to 
optimize cpu cache miss and there by increase the query performance. This 
configuration enables to fetch data as columnar batch of size 4*1024 rows 
instead of fetching data row by row and provide it to spark so that there is 
improvement in  select queries performance. |
 | carbon.task.distribution | block | CarbonData has its own scheduling 
algorithm to suggest to Spark on how many tasks needs to be launched and how 
much work each task need to do in a Spark cluster for any query on CarbonData. 
Each of these task distribution suggestions has its own advantages and 
disadvantages. Based on the customer use case, appropriate task distribution 
can be configured.**block**: Setting this value will launch one task per block. 
This setting is suggested in case of  [...]

Reply via email to