Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2592#discussion_r207083747
--- Diff: docs/useful-tips-on-carbondata.md ---
@@ -158,18 +156,18 @@
Recently we did some performance POC on CarbonData for Finance and
telecommunication Field. It involved detailed queries and aggregation
scenarios. After the completion of POC, some of the configurations
impacting the performance have been identified and tabulated below :
- | Parameter | Location | Used For | Description | Tuning |
-
|----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
- | carbon.sort.intermediate.files.limit |
spark/carbonlib/carbon.properties | Data loading | During the loading of data,
local temp is used to sort the data. This number specifies the minimum number
of intermediate files after which the merge sort has to be initiated. |
Increasing the parameter to a higher value will improve the load performance.
For example, when we increase the value from 20 to 100, it increases the data
load performance from 35MB/S to more than 50MB/S. Higher values of this
parameter consumes more memory during the load. |
- | carbon.number.of.cores.while.loading |
spark/carbonlib/carbon.properties | Data loading | Specifies the number of
cores used for data processing during data loading in CarbonData. | If you have
more number of CPUs, then you can increase the number of CPUs, which will
increase the performance. For example if we increase the value from 2 to 4 then
the CSV reading performance can increase about 1 times |
- | carbon.compaction.level.threshold | spark/carbonlib/carbon.properties
| Data loading and Querying | For minor compaction, specifies the number of
segments to be merged in stage 1 and number of compacted segments to be merged
in stage 2. | Each CarbonData load will create one segment, if every load is
small in size it will generate many small file over a period of time impacting
the query performance. Configuring this parameter will merge the small segment
to one big segment which will sort the data and improve the performance. For
Example in one telecommunication scenario, the performance improves about 2
times after minor compaction. |
- | spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf |
Querying | The number of task started when spark shuffle. | The value can be 1
to 2 times as much as the executor cores. In an aggregation scenario, reducing
the number from 200 to 32 reduced the query time from 17 to 9 seconds. |
- | spark.executor.instances/spark.executor.cores/spark.executor.memory |
spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores,
and memory used for CarbonData query. | In the bank scenario, we provide the 4
CPUs cores and 15 GB for each executor which can get good performance. This 2
value does not mean more the better. It needs to be configured properly in case
of limited resources. For example, In the bank scenario, it has enough CPU 32
cores each node but less memory 64 GB each node. So we cannot give more CPU but
less memory. For example, when 4 cores and 12GB for each executor. It sometimes
happens GC during the query which impact the query performance very much from
the 3 second to more than 15 seconds. In this scenario need to increase the
memory or decrease the CPU cores. |
- | carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data
loading | The buffer size to store records, returned from the block scan. | In
limit scenario this parameter is very important. For example your query limit
is 1000. But if we set this value to 3000 that means we get 3000 records from
scan but spark will only take 1000 rows. So the 2000 remaining are useless. In
one Finance test case after we set it to 100, in the limit 1000 scenario the
performance increase about 2 times in comparison to if we set this value to
12000. |
- | carbon.use.local.dir | spark/carbonlib/carbon.properties | Data
loading | Whether use YARN local directories for multi-table load disk load
balance | If this is set it to true CarbonData will use YARN local directories
for multi-table load disk load balance, that will improve the data load
performance. |
- | carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties |
Data loading | Whether to use multiple YARN local directories during table data
loading for disk load balance | After enabling 'carbon.use.local.dir', if this
is set to true, CarbonData will use all YARN local directories during data load
for disk load balance, that will improve the data load performance. Please
enable this property when you encounter disk hotspot problem during data
loading. |
- | carbon.sort.temp.compressor | spark/carbonlib/carbon.properties | Data
loading | Specify the name of compressor to compress the intermediate sort
temporary files during sort procedure in data loading. | The optional values
are 'SNAPPY','GZIP','BZIP2','LZ4','ZSTD' and empty. By default, empty means
that Carbondata will not compress the sort temp files. This parameter will be
useful if you encounter disk bottleneck. |
- | carbon.load.skewedDataOptimization.enabled |
spark/carbonlib/carbon.properties | Data loading | Whether to enable size based
block allocation strategy for data loading. | When loading, carbondata will use
file size based block allocation strategy for task distribution. It will make
sure that all the executors process the same size of data -- It's useful if the
size of your input data files varies widely, say 1MB~1GB. |
- | carbon.load.min.size.enabled | spark/carbonlib/carbon.properties |
Data loading | Whether to enable node minumun input data size allocation
strategy for data loading.| When loading, carbondata will use node minumun
input data size allocation strategy for task distribution. It will make sure
the node load the minimum amount of data -- It's useful if the size of your
input data files very small, say 1MB~256MB,Avoid generating a large number of
small files. |
-
+| Parameter | Location | Used For | Description | Tuning |
+|----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties
| Data loading | During the loading of data, local temp is used to sort the
data. This number specifies the minimum number of intermediate files after
which the merge sort has to be initiated. | Increasing the parameter to a
higher value will improve the load performance. For example, when we increase
the value from 20 to 100, it increases the data load performance from 35MB/S to
more than 50MB/S. Higher values of this parameter consumes more memory during
the load. |
+| carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties
| Data loading | Specifies the number of cores used for data processing during
data loading in CarbonData. | If you have more number of CPUs, then you can
increase the number of CPUs, which will increase the performance. For example
if we increase the value from 2 to 4 then the CSV reading performance can
increase about 1 times |
+| carbon.compaction.level.threshold | spark/carbonlib/carbon.properties |
Data loading and Querying | For minor compaction, specifies the number of
segments to be merged in stage 1 and number of compacted segments to be merged
in stage 2. | Each CarbonData load will create one segment, if every load is
small in size it will generate many small file over a period of time impacting
the query performance. Configuring this parameter will merge the small segment
to one big segment which will sort the data and improve the performance. For
Example in one telecommunication scenario, the performance improves about 2
times after minor compaction. |
+| spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying
| The number of task started when spark shuffle. | The value can be 1 to 2
times as much as the executor cores. In an aggregation scenario, reducing the
number from 200 to 32 reduced the query time from 17 to 9 seconds. |
+| spark.executor.instances/spark.executor.cores/spark.executor.memory |
spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores,
and memory used for CarbonData query. | In the bank scenario, we provide the 4
CPUs cores and 15 GB for each executor which can get good performance. This 2
value does not mean more the better. It needs to be configured properly in case
of limited resources. For example, In the bank scenario, it has enough CPU 32
cores each node but less memory 64 GB each node. So we cannot give more CPU but
less memory. For example, when 4 cores and 12GB for each executor. It sometimes
happens GC during the query which impact the query performance very much from
the 3 second to more than 15 seconds. In this scenario need to increase the
memory or decrease the CPU cores. |
+| carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data
loading | The buffer size to store records, returned from the block scan. | In
limit scenario this parameter is very important. For example your query limit
is 1000. But if we set this value to 3000 that means we get 3000 records from
scan but spark will only take 1000 rows. So the 2000 remaining are useless. In
one Finance test case after we set it to 100, in the limit 1000 scenario the
performance increase about 2 times in comparison to if we set this value to
12000. |
+| carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading
| Whether use YARN local directories for multi-table load disk load balance |
If this is set it to true CarbonData will use YARN local directories for
multi-table load disk load balance, that will improve the data load
performance. |
+| carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data
loading | Whether to use multiple YARN local directories during table data
loading for disk load balance | After enabling 'carbon.use.local.dir', if this
is set to true, CarbonData will use all YARN local directories during data load
for disk load balance, that will improve the data load performance. Please
enable this property when you encounter disk hotspot problem during data
loading. |
+| carbon.sort.temp.compressor | spark/carbonlib/carbon.properties | Data
loading | Specify the name of compressor to compress the intermediate sort
temporary files during sort procedure in data loading. | The optional values
are 'SNAPPY','GZIP','BZIP2','LZ4' and empty. By default, empty means that
Carbondata will not compress the sort temp files. This parameter will be useful
if you encounter disk bottleneck. |
--- End diff --
why we remove the 'zstd' in the supported compressor list?
---