[doris] branch master updated: add the batch interval time of sink in spark connector doc (#16501)

jiafengzheng Wed, 08 Feb 2023 16:39:48 -0800

This is an automated email from the ASF dual-hosted git repository.

jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git



The following commit(s) were added to refs/heads/master by this push:
     new 2d7a9c9c11 add the batch interval time of sink in spark connector doc 
(#16501)
2d7a9c9c11 is described below

commit 2d7a9c9c1157ee9f3fed242f739c609ad8fcdecc
Author: Hong Liu <[email protected]>
AuthorDate: Thu Feb 9 08:39:30 2023 +0800

    add the batch interval time of sink in spark connector doc (#16501)
---
 docs/en/docs/ecosystem/spark-doris-connector.md    | 1 +
 docs/zh-CN/docs/ecosystem/spark-doris-connector.md | 1 +
 2 files changed, 2 insertions(+)

diff --git a/docs/en/docs/ecosystem/spark-doris-connector.md 
b/docs/en/docs/ecosystem/spark-doris-connector.md
index 8a7545f478..56d403f2e7 100644
--- a/docs/en/docs/ecosystem/spark-doris-connector.md
+++ b/docs/en/docs/ecosystem/spark-doris-connector.md
@@ -275,6 +275,7 @@ kafkaSource.selectExpr("CAST(key AS STRING)", "CAST(value 
as STRING)")
 | sink.properties.*     | --               | The stream load parameters.<br /> 
<br /> eg:<br /> sink.properties.column_separator' = ','<br /> <br /> |
 | doris.sink.task.partition.size | --                | The number of 
partitions corresponding to the Writing task. After filtering and other 
operations, the number of partitions written in Spark RDD may be large, but the 
number of records corresponding to each Partition is relatively small, 
resulting in increased writing frequency and waste of computing resources. The 
smaller this value is set, the less Doris write frequency and less Doris merge 
pressure. It is generally used with doris. [...]
 | doris.sink.task.use.repartition | false             | Whether to use 
repartition mode to control the number of partitions written by Doris. The 
default value is false, and coalesce is used (note: if there is no Spark action 
before the write, the whole computation will be less parallel). If it is set to 
true, then repartition is used (note: you can set the final number of 
partitions at the cost of shuffle). |
+| doris.sink.batch.interval.ms | 50 | The interval time of each batch sink, 
unit ms. |
 
 ### SQL & Dataframe Configuration
 
diff --git a/docs/zh-CN/docs/ecosystem/spark-doris-connector.md 
b/docs/zh-CN/docs/ecosystem/spark-doris-connector.md
index 209e53d34e..6c0b8a5c17 100644
--- a/docs/zh-CN/docs/ecosystem/spark-doris-connector.md
+++ b/docs/zh-CN/docs/ecosystem/spark-doris-connector.md
@@ -283,6 +283,7 @@ kafkaSource.selectExpr("CAST(key AS STRING)", "CAST(value 
as STRING)")
 | sink.properties.*                | --                | Stream Load 
的导入参数。<br/>例如:  'sink.properties.column_separator' = ', ' |
 | doris.sink.task.partition.size   | --                | Doris写入任务对应的 
Partition 个数。Spark RDD 经过过滤等操作，最后写入的 Partition 数可能会比较大，但每个 Partition 
对应的记录数比较少，导致写入频率增加和计算资源浪费。<br/>此数值设置越小，可以降低 Doris 写入频率，减少 Doris 合并压力。该参数配合 
doris.sink.task.use.repartition 使用。 |
 | doris.sink.task.use.repartition  | false             | 是否采用 repartition 方式控制 
Doris写入 Partition数。默认值为 false，采用 coalesce 方式控制（注意: 如果在写入之前没有 Spark action 
算子，可能会导致整个计算并行度降低）。<br/>如果设置为 true，则采用 repartition 方式（注意: 可设置最后 Partition 
数，但会额外增加 shuffle 开销）。 |
+| doris.sink.batch.interval.ms | 50 | 每个批次sink的间隔时间，单位 ms。 |
 
 ### SQL 和 Dataframe 专有配置
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[doris] branch master updated: add the batch interval time of sink in spark connector doc (#16501)

Reply via email to