mddxhj commented on code in PR #19503:
URL: https://github.com/apache/flink/pull/19503#discussion_r906670868
##########
docs/content.zh/docs/ops/state/large_state_tuning.md:
##########
@@ -26,122 +26,102 @@ under the License.
# 大状态与 Checkpoint 调优
-This page gives a guide how to configure and tune applications that use large
state.
+本文提供了如何配置和调整使用大状态的应用程序指南。
-## Overview
+## 概述
-For Flink applications to run reliably at large scale, two conditions must be
fulfilled:
+Flink 应用要想在大规模场景下可靠地运行,必须要满足如下两个条件:
- - The application needs to be able to take checkpoints reliably
+ - 应用程序需要能够可靠地创建 checkpoints。
- - The resources need to be sufficient catch up with the input data streams
after a failure
+ - 在应用故障后,需要有足够的资源追赶数据输入流。
-The first sections discuss how to get well performing checkpoints at scale.
-The last section explains some best practices concerning planning how many
resources to use.
+第一部分讨论如何大规模获得良好性能的 checkpoints。
+后一部分解释了一些关于要规划使用多少资源的最佳实践。
-## Monitoring State and Checkpoints
+## 监控状态和 Checkpoints
-The easiest way to monitor checkpoint behavior is via the UI's checkpoint
section. The documentation
-for [checkpoint monitoring]({{< ref
"docs/ops/monitoring/checkpoint_monitoring" >}}) shows how to access the
available checkpoint
-metrics.
+监控 checkpoint 行为最简单的方法是通过 UI 的 checkpoint 部分。 [监控 Checkpoint]({{< ref
"docs/ops/monitoring/checkpoint_monitoring" >}}) 的文档说明了如何查看可用的 checkpoint 指标。
-The two numbers (both exposed via Task level [metrics]({{< ref
"docs/ops/metrics" >}}#checkpointing)
-and in the [web interface]({{< ref "docs/ops/monitoring/checkpoint_monitoring"
>}})) that are of particular interest when scaling
-up checkpoints are:
+这两个指标(均通过 Task 级别 [Checkpointing 指标]({{< ref "docs/ops/metrics"
>}}#checkpointing) 展示)
+以及在 [监控 Checkpoint]({{< ref "docs/ops/monitoring/checkpoint_monitoring"
>}}))中,当看 checkpoint 详细信息时,特别有趣的是:
- - The time until operators receive their first checkpoint barrier
- When the time to trigger the checkpoint is constantly very high, it means
that the *checkpoint barriers* need a long
- time to travel from the source to the operators. That typically indicates
that the system is operating under a
- constant backpressure.
+ - 算子收到第一个 checkpoint barrier 的时间。当触发 checkpoint 的延迟时间一直很高时,这意味着 *checkpoint
barrier* 需要很长时间才能从 source 到达 operators。 这通常表明系统处于反压下运行。
- - The alignment duration, which is defined as the time between receiving
first and the last checkpoint barrier.
- During unaligned `exactly-once` checkpoints and `at-least-once`
checkpoints subtasks are processing all of the
- data from the upstream subtasks without any interruptions. However with
aligned `exactly-once` checkpoints,
- the channels that have already received a checkpoint barrier are blocked
from sending further data until
- all of the remaining channels catch up and receive theirs checkpoint
barriers (alignment time).
+ - Alignment Duration,为处理第一个和最后一个 checkpoint barrier 之间的时间。在 unaligned
checkpoints 下,`exactly-once` 和 `at-least-once` checkpoints 的 subtasks 处理来自上游
subtasks 的所有数据,且没有任何中断。
+ 然而,对于 aligned `exactly-once` checkpoints,已经收到 checkpoint barrier
的通道被阻止继续发送数据,直到所有剩余的通道都赶上并接收它们的 checkpoint barrier(对齐时间)。
-Both of those values should ideally be low - higher amounts means that
checkpoint barriers traveling through the job graph
-slowly, due to some back-pressure (not enough resources to process the
incoming records). This can also be observed
-via increased end-to-end latency of processed records. Note that those numbers
can be occasionally high in the presence of
-a transient backpressure, data skew, or network issues.
+理想情况下,这两个值都应该很低 - 较高的数值意味着 由于存在反压(没有足够的资源来处理传入的记录),导致checkpoint barriers
在作业中的移动速度较慢,这也可以通过处理记录的端到端延迟在增加来观察到。
+请注意,在出现瞬态反压、数据倾斜或网络问题时,这些数值偶尔会很高。
-[Unaligned checkpoints]({{< ref "docs/ops/state/checkpoints"
>}}#unaligned-checkpoints) can be used to speed up the propagation time
-of the checkpoint barriers. However please note, that this does not solve the
underlying problem that's causing the backpressure
-in the first place (and end-to-end records latency will remain high).
-## Tuning Checkpointing
+[Unaligned checkpoints]({{< ref "docs/ops/state/checkpoints"
>}}#unaligned-checkpoints) 可用于加快传播时间的 checkpoint barriers。
但是请注意,这并不能解决导致反压的根本问题(端到端记录延迟仍然很高)。
-Checkpoints are triggered at regular intervals that applications can
configure. When a checkpoint takes longer
-to complete than the checkpoint interval, the next checkpoint is not triggered
before the in-progress checkpoint
-completes. By default the next checkpoint will then be triggered immediately
once the ongoing checkpoint completes.
+## Checkpoint 调优
-When checkpoints end up frequently taking longer than the base interval (for
example because state
-grew larger than planned, or the storage where checkpoints are stored is
temporarily slow),
-the system is constantly taking checkpoints (new ones are started immediately
once ongoing once finish).
-That can mean that too many resources are constantly tied up in checkpointing
and that the operators make too
-little progress. This behavior has less impact on streaming applications that
use asynchronously checkpointed state,
-but may still have an impact on overall application performance.
+应用程序可以配置定期触发 checkpoints。 当 checkpoint 完成时间超过 checkpoint 间隔时,在正在进行的 checkpoint
完成之前,不会触发下一个 checkpoint。默认情况下,一旦正在进行的 checkpoint 完成,将立即触发下一个 checkpoint。
-To prevent such a situation, applications can define a *minimum duration
between checkpoints*:
-`StreamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(milliseconds)`
+当 checkpoints 完成的时间经常超过 checkpoints 基本间隔时(例如,因为状态比计划的更大,或者访问 checkpoints
所在的存储系统暂时变慢),
+系统不断地进行 checkpoints(一旦完成,新的 checkpoints 就会立即启动)。这可能意味着过多的资源被不断地束缚在
checkpointing 中,并且 checkpoint 算子进行得缓慢。
+此行为对使用 checkpointed 状态的流式应用程序的影响较小,但仍可能对整体应用程序性能产生影响。
Review Comment:
```suggestion
此行为对使用异步 checkpointed 状态的流式应用程序的影响较小,但仍可能对整体应用程序性能产生影响。
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]