[GitHub] [flink] zhuzhurk commented on a change in pull request #18757: [FLINK-25226][doc] Add documentation about the AdaptiveBatchScheduler

GitBox Fri, 18 Mar 2022 01:14:16 -0700


zhuzhurk commented on a change in pull request #18757:
URL: https://github.com/apache/flink/pull/18757#discussion_r829769397




##########
File path: docs/content/docs/deployment/elastic_scaling.md
##########
@@ -153,7 +153,7 @@ The behavior of Adaptive Scheduler is configured by [all 
configuration options c
 
 ## Adaptive Batch Scheduler
 
-The Adaptive Batch Scheduler can automatically decide parallelisms of 
operators for batch jobs. If an operator is not set with a parallelism, the 
scheduler will decide parallelism for it according to the size of its consumed 
datasets (Note that the decided parallelism can only be a power of 2, see ["The 
decided parallelism can only be a power of 2"](#limitations-2) for details). 
This can bring many benefits:
+The Adaptive Batch Scheduler can automatically decide parallelisms of 
operators for batch jobs. If an operator is not set with a parallelism, the 
scheduler will decide parallelism for it according to the size of its consumed 
datasets (Note that the decided parallelism can only be a power of 2, see ["The 
decided parallelism will be a power of 2"](#limitations-2) for details). This 
can bring many benefits:

Review comment:
       How about to exclude this note? It is already in the limitation part, 
and I find it is complicating the understanding of the motivation part.

##########
File path: docs/content/docs/deployment/elastic_scaling.md
##########
@@ -191,8 +191,8 @@ Adaptive Batch Scheduler will only decide parallelism for 
operators whose parall
 
 - **Batch jobs only**: Adaptive Batch Scheduler only supports batch jobs. 
Exception will be thrown if a streaming job is submitted.
 - **ALL-EXCHANGES-BLOCKING jobs only**: At the moment, Adaptive Batch 
Scheduler only supports jobs whose [shuffle mode]({{< ref 
"docs/deployment/config" >}}#execution-batch-shuffle-mode) is 
`ALL-EXCHANGES-BLOCKING`.
-- **The decided parallelism can only be a power of 2**: In order to make the 
subpartitoins evenly consumed by downstream tasks, user should configure the 
[`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism) to be a power of 2 
(2^N), and the decided parallelism will also be a power of 2 (2^M and M < N).
-- **No support for serveral file APIs**: No support for 
`StreamExecutionEnvironment#readFile` `StreamExecutionEnvironment#readTextFile` 
`StreamExecutionEnvironment#createInput(FileInputFormat, ...)` and all data 
sources using these APIs. When using these APIs, there will be a separate 
monitoring task (called a `Custom File Source`) as a predecessor to the actual 
data sources, which Adaptive Batch Scheduler cannot handle.
+- **The decided parallelism will be a power of 2**: In order to ensure 
downstream tasks to consume the same count of subpartitions, the configuration 
option [`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism) should be set to be a 
power of 2 (2^N), and the decided parallelism will also be a power of 2 (2^M 
and M <= N).
+- **FileInputFormat sources are not supported**: FileInputFormat sources 
include `StreamExecutionEnvironment#readFile(...)` 
`StreamExecutionEnvironment#readTextFile(...)` and 
`StreamExecutionEnvironment#createInput(FileInputFormat, ...)` are not 
supported. Users should use the new sources([FileSystem DataStream 
Connector]({{< ref "docs/connectors/datastream/filesystem.md" >}}) or 
[FileSystem SQL Connector]({{< ref "docs/connectors/table/filesystem.md" >}})) 
to read files when using the Adaptive Batch Scheduler.

Review comment:
       I would state it like "FileInputFormat sources are not supported , 
including ..."

##########
File path: docs/content.zh/docs/deployment/elastic_scaling.md
##########
@@ -168,9 +168,9 @@ Adaptive Batch Scheduler 是一种可以自动推导每个算子并行度的批
 - 由于 ["只支持所有数据交换都为 BLOCKING 模式的作业"](#局限性-2), 需要将 
[`execution.batch-shuffle-mode`]({{< ref "docs/deployment/config" 
>}}#execution-batch-shuffle-mode) 配置为 `ALL-EXCHANGES-BLOCKING`(默认值) 。
 
 除此之外，使用 Adaptive Batch Scheduler 时，以下相关配置也可以调整:
-- [`jobmanager.adaptive-batch-scheduler.min-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-min-parallelism): 允许自动设置的并行度最小值
-- [`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism): 允许自动设置的并行度最大值
-- [`jobmanager.adaptive-batch-scheduler.avg-data-volume-per-task`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-avg-data-volume-per-task): 
期望每个任务平均处理的数据量大小
+- [`jobmanager.adaptive-batch-scheduler.min-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-min-parallelism): 允许自动设置的并行度最小值。需要配置为 
2^N，否则也会被自动调整为 2^N。

Review comment:
       2^N -> 2 的幂. 
   The N may confuse users.

##########
File path: docs/content.zh/docs/deployment/elastic_scaling.md
##########
@@ -188,8 +188,8 @@ Adaptive Batch Scheduler 只会为用户未指定并行度的算子（并行度
 ### 局限性
 - **只支持批作业**: Adaptive Batch Scheduler 只支持批作业。当提交的是一个流作业时，会抛出异常。
 - **只支持所有数据交换都为 BLOCKING 模式的作业**: 目前 Adaptive Batch Scheduler 只支持 [shuffle 
mode]({{< ref "docs/deployment/config" >}}#execution-batch-shuffle-mode) 为 
ALL-EXCHANGES-BLOCKING 的作业。
-- **推导的并行度只能是 2 的幂次**: 为了使子分区可以均匀分配给下游任务，用户需要将 
[`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism) 配置为 2^N, 推导出的并行度会是 
2^M, 且满足 M < N。
-- **不支持一些文件操作 API**: 不支持 `StreamExecutionEnvironment#readFile` 
`StreamExecutionEnvironment#readTextFile` 
`StreamExecutionEnvironment#createInput(FileInputFormat, ...)` 和所有使用了这些 API 的 
source. 当使用了这些 API 时，会有一个独立的监控任务 (`Custom File Source`) 在真正的 source 前，Adaptive 
Batch Scheduler 无法处理这种情况。
+- **推导的并行度是 2 的幂次**: 
为了使子分区可以均匀分配给下游任务，[`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< 
ref "docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism) 应该被配置为 2^N, 推导出的并行度会是 
2^M, 且满足 M <= N。
+- **不支持 FileInputFormat 类型的 source**: 不支持 FileInputFormat 类型的 source, 包括 
`StreamExecutionEnvironment#readFile(...)` 
`StreamExecutionEnvironment#readTextFile(...)` 和 
`StreamExecutionEnvironment#createInput(FileInputFormat, ...)`。 当使用 Adaptive 
Batch Scheduler 时，用户应该使用新设计的 Source API ([FileSystem DataStream Connector]({{< 
ref "docs/connectors/datastream/filesystem.md" >}}) or [FileSystem SQL 
Connector]({{< ref "docs/connectors/table/filesystem.md" >}})) 来读取文件.

Review comment:
       新设计 -> 新版

##########
File path: docs/content.zh/docs/deployment/elastic_scaling.md
##########
@@ -188,8 +188,8 @@ Adaptive Batch Scheduler 只会为用户未指定并行度的算子（并行度
 ### 局限性
 - **只支持批作业**: Adaptive Batch Scheduler 只支持批作业。当提交的是一个流作业时，会抛出异常。
 - **只支持所有数据交换都为 BLOCKING 模式的作业**: 目前 Adaptive Batch Scheduler 只支持 [shuffle 
mode]({{< ref "docs/deployment/config" >}}#execution-batch-shuffle-mode) 为 
ALL-EXCHANGES-BLOCKING 的作业。
-- **推导的并行度只能是 2 的幂次**: 为了使子分区可以均匀分配给下游任务，用户需要将 
[`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism) 配置为 2^N, 推导出的并行度会是 
2^M, 且满足 M < N。
-- **不支持一些文件操作 API**: 不支持 `StreamExecutionEnvironment#readFile` 
`StreamExecutionEnvironment#readTextFile` 
`StreamExecutionEnvironment#createInput(FileInputFormat, ...)` 和所有使用了这些 API 的 
source. 当使用了这些 API 时，会有一个独立的监控任务 (`Custom File Source`) 在真正的 source 前，Adaptive 
Batch Scheduler 无法处理这种情况。
+- **推导的并行度是 2 的幂次**: 
为了使子分区可以均匀分配给下游任务，[`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< 
ref "docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism) 应该被配置为 2^N, 推导出的并行度会是 
2^M, 且满足 M <= N。
+- **不支持 FileInputFormat 类型的 source**: 不支持 FileInputFormat 类型的 source, 包括 
`StreamExecutionEnvironment#readFile(...)` 
`StreamExecutionEnvironment#readTextFile(...)` 和 
`StreamExecutionEnvironment#createInput(FileInputFormat, ...)`。 当使用 Adaptive 
Batch Scheduler 时，用户应该使用新设计的 Source API ([FileSystem DataStream Connector]({{< 
ref "docs/connectors/datastream/filesystem.md" >}}) or [FileSystem SQL 
Connector]({{< ref "docs/connectors/table/filesystem.md" >}})) 来读取文件.

Review comment:
       or -> 或

##########
File path: docs/content.zh/docs/deployment/elastic_scaling.md
##########
@@ -168,9 +168,9 @@ Adaptive Batch Scheduler 是一种可以自动推导每个算子并行度的批
 - 由于 ["只支持所有数据交换都为 BLOCKING 模式的作业"](#局限性-2), 需要将 
[`execution.batch-shuffle-mode`]({{< ref "docs/deployment/config" 
>}}#execution-batch-shuffle-mode) 配置为 `ALL-EXCHANGES-BLOCKING`(默认值) 。
 
 除此之外，使用 Adaptive Batch Scheduler 时，以下相关配置也可以调整:
-- [`jobmanager.adaptive-batch-scheduler.min-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-min-parallelism): 允许自动设置的并行度最小值
-- [`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism): 允许自动设置的并行度最大值
-- [`jobmanager.adaptive-batch-scheduler.avg-data-volume-per-task`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-avg-data-volume-per-task): 
期望每个任务平均处理的数据量大小
+- [`jobmanager.adaptive-batch-scheduler.min-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-min-parallelism): 允许自动设置的并行度最小值。需要配置为 
2^N，否则也会被自动调整为 2^N。

Review comment:
       否则也会被自动调整为 2^N -> 否则也会被自动调整为最接近且大于其的 2 的幂

##########
File path: docs/content.zh/docs/deployment/elastic_scaling.md
##########
@@ -188,8 +188,8 @@ Adaptive Batch Scheduler 只会为用户未指定并行度的算子（并行度
 ### 局限性
 - **只支持批作业**: Adaptive Batch Scheduler 只支持批作业。当提交的是一个流作业时，会抛出异常。
 - **只支持所有数据交换都为 BLOCKING 模式的作业**: 目前 Adaptive Batch Scheduler 只支持 [shuffle 
mode]({{< ref "docs/deployment/config" >}}#execution-batch-shuffle-mode) 为 
ALL-EXCHANGES-BLOCKING 的作业。
-- **推导的并行度只能是 2 的幂次**: 为了使子分区可以均匀分配给下游任务，用户需要将 
[`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism) 配置为 2^N, 推导出的并行度会是 
2^M, 且满足 M < N。
-- **不支持一些文件操作 API**: 不支持 `StreamExecutionEnvironment#readFile` 
`StreamExecutionEnvironment#readTextFile` 
`StreamExecutionEnvironment#createInput(FileInputFormat, ...)` 和所有使用了这些 API 的 
source. 当使用了这些 API 时，会有一个独立的监控任务 (`Custom File Source`) 在真正的 source 前，Adaptive 
Batch Scheduler 无法处理这种情况。
+- **推导的并行度是 2 的幂次**: 
为了使子分区可以均匀分配给下游任务，[`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< 
ref "docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism) 应该被配置为 2^N, 推导出的并行度会是 
2^M, 且满足 M <= N。

Review comment:
       推导的并行度是 2 的幂次 -> 推导出的并行度是 2 的幂

##########
File path: docs/content.zh/docs/deployment/elastic_scaling.md
##########
@@ -168,9 +168,9 @@ Adaptive Batch Scheduler 是一种可以自动推导每个算子并行度的批
 - 由于 ["只支持所有数据交换都为 BLOCKING 模式的作业"](#局限性-2), 需要将 
[`execution.batch-shuffle-mode`]({{< ref "docs/deployment/config" 
>}}#execution-batch-shuffle-mode) 配置为 `ALL-EXCHANGES-BLOCKING`(默认值) 。
 
 除此之外，使用 Adaptive Batch Scheduler 时，以下相关配置也可以调整:
-- [`jobmanager.adaptive-batch-scheduler.min-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-min-parallelism): 允许自动设置的并行度最小值
-- [`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism): 允许自动设置的并行度最大值
-- [`jobmanager.adaptive-batch-scheduler.avg-data-volume-per-task`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-avg-data-volume-per-task): 
期望每个任务平均处理的数据量大小
+- [`jobmanager.adaptive-batch-scheduler.min-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-min-parallelism): 允许自动设置的并行度最小值。需要配置为 
2^N，否则也会被自动调整为 2^N。
+- [`jobmanager.adaptive-batch-scheduler.max-parallelism`]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-max-parallelism): 允许自动设置的并行度最大值。需要配置为 
2^N，否则也会被自动调整为 2^N。

Review comment:
       否则也会被自动调整为 2^N -> 否则也会被自动调整为最接近且小于其的 2 的幂




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] zhuzhurk commented on a change in pull request #18757: [FLINK-25226][doc] Add documentation about the AdaptiveBatchScheduler

Reply via email to