Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2592#discussion_r214954243
  
    --- Diff: docs/useful-tips-on-carbondata.md ---
    @@ -158,18 +156,18 @@
       Recently we did some performance POC on CarbonData for Finance and 
telecommunication Field. It involved detailed queries and aggregation
       scenarios. After the completion of POC, some of the configurations 
impacting the performance have been identified and tabulated below :
     
    -  | Parameter | Location | Used For  | Description | Tuning |
    -  
|----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    -  | carbon.sort.intermediate.files.limit | 
spark/carbonlib/carbon.properties | Data loading | During the loading of data, 
local temp is used to sort the data. This number specifies the minimum number 
of intermediate files after which the  merge sort has to be initiated. | 
Increasing the parameter to a higher value will improve the load performance. 
For example, when we increase the value from 20 to 100, it increases the data 
load performance from 35MB/S to more than 50MB/S. Higher values of this 
parameter consumes  more memory during the load. |
    -  | carbon.number.of.cores.while.loading | 
spark/carbonlib/carbon.properties | Data loading | Specifies the number of 
cores used for data processing during data loading in CarbonData. | If you have 
more number of CPUs, then you can increase the number of CPUs, which will 
increase the performance. For example if we increase the value from 2 to 4 then 
the CSV reading performance can increase about 1 times |
    -  | carbon.compaction.level.threshold | spark/carbonlib/carbon.properties 
| Data loading and Querying | For minor compaction, specifies the number of 
segments to be merged in stage 1 and number of compacted segments to be merged 
in stage 2. | Each CarbonData load will create one segment, if every load is 
small in size it will generate many small file over a period of time impacting 
the query performance. Configuring this parameter will merge the small segment 
to one big segment which will sort the data and improve the performance. For 
Example in one telecommunication scenario, the performance improves about 2 
times after minor compaction. |
    -  | spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | 
Querying | The number of task started when spark shuffle. | The value can be 1 
to 2 times as much as the executor cores. In an aggregation scenario, reducing 
the number from 200 to 32 reduced the query time from 17 to 9 seconds. |
    -  | spark.executor.instances/spark.executor.cores/spark.executor.memory | 
spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores, 
and memory used for CarbonData query. | In the bank scenario, we provide the 4 
CPUs cores and 15 GB for each executor which can get good performance. This 2 
value does not mean more the better. It needs to be configured properly in case 
of limited resources. For example, In the bank scenario, it has enough CPU 32 
cores each node but less memory 64 GB each node. So we cannot give more CPU but 
less memory. For example, when 4 cores and 12GB for each executor. It sometimes 
happens GC during the query which impact the query performance very much from 
the 3 second to more than 15 seconds. In this scenario need to increase the 
memory or decrease the CPU cores. |
    -  | carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data 
loading | The buffer size to store records, returned from the block scan. | In 
limit scenario this parameter is very important. For example your query limit 
is 1000. But if we set this value to 3000 that means we get 3000 records from 
scan but spark will only take 1000 rows. So the 2000 remaining are useless. In 
one Finance test case after we set it to 100, in the limit 1000 scenario the 
performance increase about 2 times in comparison to if we set this value to 
12000. |
    -  | carbon.use.local.dir | spark/carbonlib/carbon.properties | Data 
loading | Whether use YARN local directories for multi-table load disk load 
balance | If this is set it to true CarbonData will use YARN local directories 
for multi-table load disk load balance, that will improve the data load 
performance. |
    -  | carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | 
Data loading | Whether to use multiple YARN local directories during table data 
loading for disk load balance | After enabling 'carbon.use.local.dir', if this 
is set to true, CarbonData will use all YARN local directories during data load 
for disk load balance, that will improve the data load performance. Please 
enable this property when you encounter disk hotspot problem during data 
loading. |
    -  | carbon.sort.temp.compressor | spark/carbonlib/carbon.properties | Data 
loading | Specify the name of compressor to compress the intermediate sort 
temporary files during sort procedure in data loading. | The optional values 
are 'SNAPPY','GZIP','BZIP2','LZ4','ZSTD' and empty. By default, empty means 
that Carbondata will not compress the sort temp files. This parameter will be 
useful if you encounter disk bottleneck. |
    -  | carbon.load.skewedDataOptimization.enabled | 
spark/carbonlib/carbon.properties | Data loading | Whether to enable size based 
block allocation strategy for data loading. | When loading, carbondata will use 
file size based block allocation strategy for task distribution. It will make 
sure that all the executors process the same size of data -- It's useful if the 
size of your input data files varies widely, say 1MB~1GB. |
    -  | carbon.load.min.size.enabled | spark/carbonlib/carbon.properties | 
Data loading | Whether to enable node minumun input data size allocation 
strategy for data loading.| When loading, carbondata will use node minumun 
input data size allocation strategy for task distribution. It will make sure 
the node load the minimum amount of data -- It's useful if the size of your 
input data files very small, say 1MB~256MB,Avoid generating a large number of 
small files. |
    -  
    +| Parameter | Location | Used For  | Description | Tuning |
    
+|----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    +| carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties 
| Data loading | During the loading of data, local temp is used to sort the 
data. This number specifies the minimum number of intermediate files after 
which the  merge sort has to be initiated. | Increasing the parameter to a 
higher value will improve the load performance. For example, when we increase 
the value from 20 to 100, it increases the data load performance from 35MB/S to 
more than 50MB/S. Higher values of this parameter consumes  more memory during 
the load. |
    +| carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties 
| Data loading | Specifies the number of cores used for data processing during 
data loading in CarbonData. | If you have more number of CPUs, then you can 
increase the number of CPUs, which will increase the performance. For example 
if we increase the value from 2 to 4 then the CSV reading performance can 
increase about 1 times |
    +| carbon.compaction.level.threshold | spark/carbonlib/carbon.properties | 
Data loading and Querying | For minor compaction, specifies the number of 
segments to be merged in stage 1 and number of compacted segments to be merged 
in stage 2. | Each CarbonData load will create one segment, if every load is 
small in size it will generate many small file over a period of time impacting 
the query performance. Configuring this parameter will merge the small segment 
to one big segment which will sort the data and improve the performance. For 
Example in one telecommunication scenario, the performance improves about 2 
times after minor compaction. |
    +| spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying 
| The number of task started when spark shuffle. | The value can be 1 to 2 
times as much as the executor cores. In an aggregation scenario, reducing the 
number from 200 to 32 reduced the query time from 17 to 9 seconds. |
    +| spark.executor.instances/spark.executor.cores/spark.executor.memory | 
spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores, 
and memory used for CarbonData query. | In the bank scenario, we provide the 4 
CPUs cores and 15 GB for each executor which can get good performance. This 2 
value does not mean more the better. It needs to be configured properly in case 
of limited resources. For example, In the bank scenario, it has enough CPU 32 
cores each node but less memory 64 GB each node. So we cannot give more CPU but 
less memory. For example, when 4 cores and 12GB for each executor. It sometimes 
happens GC during the query which impact the query performance very much from 
the 3 second to more than 15 seconds. In this scenario need to increase the 
memory or decrease the CPU cores. |
    +| carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data 
loading | The buffer size to store records, returned from the block scan. | In 
limit scenario this parameter is very important. For example your query limit 
is 1000. But if we set this value to 3000 that means we get 3000 records from 
scan but spark will only take 1000 rows. So the 2000 remaining are useless. In 
one Finance test case after we set it to 100, in the limit 1000 scenario the 
performance increase about 2 times in comparison to if we set this value to 
12000. |
    +| carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading 
| Whether use YARN local directories for multi-table load disk load balance | 
If this is set it to true CarbonData will use YARN local directories for 
multi-table load disk load balance, that will improve the data load 
performance. |
    +| carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data 
loading | Whether to use multiple YARN local directories during table data 
loading for disk load balance | After enabling 'carbon.use.local.dir', if this 
is set to true, CarbonData will use all YARN local directories during data load 
for disk load balance, that will improve the data load performance. Please 
enable this property when you encounter disk hotspot problem during data 
loading. |
    +| carbon.sort.temp.compressor | spark/carbonlib/carbon.properties | Data 
loading | Specify the name of compressor to compress the intermediate sort 
temporary files during sort procedure in data loading. | The optional values 
are 'SNAPPY','GZIP','BZIP2','LZ4' and empty. By default, empty means that 
Carbondata will not compress the sort temp files. This parameter will be useful 
if you encounter disk bottleneck. |
    --- End diff --
    
    missed during rebase, have added it


---

Reply via email to