klion26 commented on a change in pull request #12237:
URL: https://github.com/apache/flink/pull/12237#discussion_r433926976
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
Review comment:
这里 44 行需要和 43 行合并,否则 “无论” 和 “什么” 之间会有空格
45 和 44 也需要合并,否则 “所决” 和 “定” 之间有空格
`This kind of real-time application is sometimes performed using processing
time, but then the results are determined by the events that happen to be
processed during that hour, rather than the events that occurred then.`
这句话的翻译直接翻译即可,比如 `然后使用 processing time
的那些实时应用程序的处理结果,则由那一小时中所处理的数据所决定`(不一定是这个翻译,可以用你自己的语句描述)
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
Review comment:
```suggestion
第一个事件发生在时间 4,随后发生的事件发生在更早的时间 2,依此类推:
```
1 这一行需要和上一行合并,否则 “达到的” 和 “第一个事件”中间有空格
2 `第一个事件发生在时间 4,随后发生的事件发生在更早的时间 2` 其中的 `时间 4` 和 `时间 2` 这个描述和正常的中文描述有点不一样,
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
Review comment:
`也可以通过下面的方式继续修改` 会更好一些吗?
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就处理每个事件,并发出包含相同
+事件但按其时间戳排序的新流。
-Some observations:
+让我们重新审视这些数据:
-(1) The first element your stream sorter sees is the 4, but you can't just
immediately release it as
-the first element of the sorted stream. It may have arrived out of order, and
an earlier event might
-yet arrive. In fact, you have the benefit of some god-like knowledge of this
stream's future, and
-you can see that your stream sorter should wait at least until the 2 arrives
before producing any
-results.
+(1) 我们的排序器第一个看到的数据是4,但是我们不能立即将其作为已排序流的第一个元素释放。因为我们并不能确定它是
+有序的,并且较早的事件有可能并未到达。事实上,如果站在上帝视角,我们知道,必须要等到2到来时,排序器才可以有事件输出。
-*Some buffering, and some delay, is necessary.*
+*需要一些缓冲,需要一些时间,但这都是值得的*
-(2) If you do this wrong, you could end up waiting forever. First the sorter
saw an event from time
-4, and then an event from time 2. Will an event with a timestamp less than 2
ever arrive? Maybe.
-Maybe not. You could wait forever and never see a 1.
+(2) 接下来的这一步,如果我们选择的是固执的等待,我们永远不会有结果。首先,我们从时间4看到了一个事件,然后从时
+间2看到了一个事件。可是,时间戳小于2的事件接下来会不会到来呢?可能会,也可能不会。再次站在上帝视角,我们知道,我
+们永远不会看到1。
-*Eventually you have to be courageous and emit the 2 as the start of the
sorted stream.*
+*最终,我们必须勇于承担责任,并发出指令,把2作为已排序的事件流的开始*
-(3) What you need then is some sort of policy that defines when, for any given
timestamped event, to
-stop waiting for the arrival of earlier events.
+(3)然后,我们需要一种策略,该策略定义:对于任何给定时间戳的事件,Flink何时停止等待较早事件的到来。
-*This is precisely what watermarks do* — they define when to stop waiting for
earlier events.
+*这正是 watermarks 的作用* — 它们定义何时停止等待较早的事件。
-Event time processing in Flink depends on *watermark generators* that insert
special timestamped
-elements into the stream, called *watermarks*. A watermark for time _t_ is an
assertion that the
-stream is (probably) now complete up through time _t_.
+Flink 中事件时间的处理取决于 *watermark 生成器*,后者将带有时间戳的特殊元素插入流中形成 *watermarks*。事件
+时间 _t_ 的 watermark 代表 _t_ 之后(很可能)不会有新的元素到达。
-When should this stream sorter stop waiting, and push out the 2 to start the
sorted stream? When a
-watermark arrives with a timestamp of 2, or greater.
+事件流的排序器应何时停止等待,并推出2以启动已分类的流?当 watermark 以2或更大的时间戳到达时!
-(4) You might imagine different policies for deciding how to generate
watermarks.
+(4) 我们可能会思考,如何决定 watermarks 的不同生成策略
-Each event arrives after some delay, and these delays vary, so some events are
delayed more than
-others. One simple approach is to assume that these delays are bounded by some
maximum delay. Flink
-refers to this strategy as *bounded-out-of-orderness* watermarking. It is easy
to imagine more
-complex approaches to watermarking, but for most applications a fixed delay
works well enough.
+每个事件都会延迟一段时间后到达,然而这些延迟有所不同,有些事件可能其他事件延迟得更多。一种简单的方法是假定这些
+延迟受某个最大延迟的限制。Flink 将此策略称为 *最大无序边界(bounded-out-of-orderness)*
watermark。当然,我们可以想像
+出更好的生成 watermark 的方法,但是对于大多数应用而言,固定延迟策略已经足够了。
-### Latency vs. Completeness
+<a name="Latency-vs.-Completeness"></a>
+### 延迟 VS 正确性
-Another way to think about watermarks is that they give you, the developer of
a streaming
-application, control over the tradeoff between latency and completeness.
Unlike in batch processing,
-where one has the luxury of being able to have complete knowledge of the input
before producing any
-results, with streaming you must eventually stop waiting to see more of the
input, and produce some
-sort of result.
+watermarks 给了开发者流处理的一种选择,它们使开发人员在开发应用程序是可以控制延迟和完整性之间的权衡。与批处理不同,批
+处理中的奢侈之处在于可以在产生任何结果之前完全了解输入,而使用流式传输,我们不被允许等待所有的时间都产生了,才
+输出排序好的数据,这与流相违背。
-You can either configure your watermarking aggressively, with a short bounded
delay, and thereby
-take the risk of producing results with rather incomplete knowledge of the
input -- i.e., a possibly
-wrong result, produced quickly. Or you can wait longer, and produce results
that take advantage of
-having more complete knowledge of the input stream(s).
+我们可以把 watermarks 的边界时间配置的相对较短,从而冒着在输入了解不完全的情况下产生结果的风险-即可能会很快产生错误的
+结果。或者,您可以等待更长的时间,并利用对输入流的更全面的了解来产生结果。
-It is also possible to implement hybrid solutions that produce initial results
quickly, and then
-supply updates to those results as additional (late) data is processed. This
is a good approach for
-some applications.
+当然也可以实施混合解决方案,先快速产生初步结果,然后在处理其他(最新)数据时向这些结果提供更新。对于有一些对延
+迟的容忍程度很低,但是又对结果有很严格的要求的场景下,或许是一个福音。
-### Lateness
+<a name="Latency"></a>
+### 延迟
-Lateness is defined relative to the watermarks. A `Watermark(t)` asserts that
the stream is complete
-up through time _t_; any event following this watermark whose timestamp is
≤ _t_ is late.
+延迟是相对于 watermarks 定义的。`Watermark(t)` 判定事件流的时间已经到达了 _t_; watermark 之后的时间戳为 ≤
_t_ 的
+任何事件都被称之为延迟事件。
-### Working with Watermarks
+<a name="Working-with-Watermarks"></a>
+### 使用Watermarks
Review comment:
```suggestion
### 使用 Watermarks
```
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就处理每个事件,并发出包含相同
+事件但按其时间戳排序的新流。
-Some observations:
+让我们重新审视这些数据:
-(1) The first element your stream sorter sees is the 4, but you can't just
immediately release it as
-the first element of the sorted stream. It may have arrived out of order, and
an earlier event might
-yet arrive. In fact, you have the benefit of some god-like knowledge of this
stream's future, and
-you can see that your stream sorter should wait at least until the 2 arrives
before producing any
-results.
+(1) 我们的排序器第一个看到的数据是4,但是我们不能立即将其作为已排序流的第一个元素释放。因为我们并不能确定它是
+有序的,并且较早的事件有可能并未到达。事实上,如果站在上帝视角,我们知道,必须要等到2到来时,排序器才可以有事件输出。
-*Some buffering, and some delay, is necessary.*
+*需要一些缓冲,需要一些时间,但这都是值得的*
-(2) If you do this wrong, you could end up waiting forever. First the sorter
saw an event from time
-4, and then an event from time 2. Will an event with a timestamp less than 2
ever arrive? Maybe.
-Maybe not. You could wait forever and never see a 1.
+(2) 接下来的这一步,如果我们选择的是固执的等待,我们永远不会有结果。首先,我们从时间4看到了一个事件,然后从时
+间2看到了一个事件。可是,时间戳小于2的事件接下来会不会到来呢?可能会,也可能不会。再次站在上帝视角,我们知道,我
Review comment:
```suggestion
间 2 看到了一个事件。可是,时间戳小于 2 的事件接下来会不会到来呢?可能会,也可能不会。再次站在上帝视角,我们知道,我
```
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
Review comment:
```suggestion
如果想要使用事件时间,需要额外给 Flink 提供一个时间戳提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
```
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就处理每个事件,并发出包含相同
+事件但按其时间戳排序的新流。
-Some observations:
+让我们重新审视这些数据:
-(1) The first element your stream sorter sees is the 4, but you can't just
immediately release it as
-the first element of the sorted stream. It may have arrived out of order, and
an earlier event might
-yet arrive. In fact, you have the benefit of some god-like knowledge of this
stream's future, and
-you can see that your stream sorter should wait at least until the 2 arrives
before producing any
-results.
+(1) 我们的排序器第一个看到的数据是4,但是我们不能立即将其作为已排序流的第一个元素释放。因为我们并不能确定它是
+有序的,并且较早的事件有可能并未到达。事实上,如果站在上帝视角,我们知道,必须要等到2到来时,排序器才可以有事件输出。
-*Some buffering, and some delay, is necessary.*
+*需要一些缓冲,需要一些时间,但这都是值得的*
-(2) If you do this wrong, you could end up waiting forever. First the sorter
saw an event from time
-4, and then an event from time 2. Will an event with a timestamp less than 2
ever arrive? Maybe.
-Maybe not. You could wait forever and never see a 1.
+(2) 接下来的这一步,如果我们选择的是固执的等待,我们永远不会有结果。首先,我们从时间4看到了一个事件,然后从时
Review comment:
`固执的等待` -> `无尽的等待` 会好一些吗?
```suggestion
(2) 接下来的这一步,如果我们选择的是固执的等待,我们永远不会有结果。首先,我们从时间 4 看到了一个事件,然后从时
```
这里应该是说看到一个 `time 4` 的事件,然后是一个 `time 2` 的事件,可以重新调整下语句
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
Review comment:
`基于处理时间会导致多次计算的结果不一致` 这句话读起来不通顺,`基于处理时间` 会导致 `结果不一致`,
这里应该是`基于处理时间的计算`?或者其他的动作
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就处理每个事件,并发出包含相同
+事件但按其时间戳排序的新流。
Review comment:
74 行需要和 73 行合并,另外这句话能够再优化一下吗?
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -437,37 +407,32 @@ stream
.reduce(<same reduce function>)
{% endhighlight %}
-You might expect Flink's runtime to be smart enough to do this parallel
pre-aggregation for you
-(provided you are using a ReduceFunction or AggregateFunction), but it's not.
+可能我们会猜测以 Flink 的能力,想要做到这样看起来是可行的(前提是您使用的是ReduceFunction或AggregateFunction),但不是。
-The reason why this works is that the events produced by a time window are
assigned timestamps
-based on the time at the end of the window. So, for example, all of the events
produced
-by an hour-long window will have timestamps marking the end of an hour. Any
subsequent window
-consuming those events should have a duration that is the same as, or a
multiple of, the
-previous window.
+之所以可行,是因为时间窗口产生的事件是根据窗口结束时的时间分配时间戳的。例如,一个小时小时的窗口所产生的所有事
+件都将带有标记一个小时结束的时间戳。后面的窗口内的数据消费和前面的流产生的数据是一致的。
-#### No Results for Empty TimeWindows
+<a name="No-Results-for-Empty-TimeWindows"></a>
+#### 空的时间窗口不会产出结果
-Windows are only created when events are assigned to them. So if there are no
events in a given time
-frame, no results will be reported.
+事件会触发窗口的创建。换句话说,如果在特定的窗口内没有事件,就不会有窗口,就不会有输出结果。
#### Late Events Can Cause Late Merges
-Session windows are based on an abstraction of windows that can _merge_. Each
element is initially
-assigned to a new window, after which windows are merged whenever the gap
between them is small
-enough. In this way, a late event can bridge the gap separating two previously
separate sessions,
-producing a late merge.
+会话窗口的实现是基于窗口的一个抽象能力,窗口可以_聚合_。会话窗口中的每个数据在初始被消费时,都会被分配一个新的
+窗口,但是如果窗口之间的间隔足够小,多个窗口就会被聚合。延迟事件可以弥合两个先前分开的会话间隔,从而产生
+一个虽然有延迟但是更加准确地结果。
{% top %}
-## Hands-on
+## 动手练习
-The hands-on exercise that goes with this section is the [Hourly Tips
+本节附带的动手练习是[Hourly Tips
Review comment:
```suggestion
本节附带的动手练习是 [Hourly Tips
```
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -397,36 +369,34 @@ stream.
.process(...);
{% endhighlight %}
-When the allowed lateness is greater than zero, only those events that are so
late that they would
-be dropped are sent to the side output (if it has been configured).
+当允许的延迟大于零时,只有那些超过最大无序边界以至于会被丢弃的事件才会被发送到侧输出流(如果已配置)。
-### Surprises
+<a name="Surprises"></a>
+### 什么是惊喜
-Some aspects of Flink's windowing API may not behave in the way you would
expect. Based on
-frequently asked questions on the [flink-user mailing
-list](https://flink.apache.org/community.html#mailing-lists) and elsewhere,
here are some facts
-about windows that may surprise you.
+Flink 的窗口 API 某些方面有一些奇怪的行为,可能无法按照我们期望的方式运行。 根
Review comment:
`可能无法按照我们期望的方式运行` -> `可能和我们期望行为不一致` 会好一些吗?
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就处理每个事件,并发出包含相同
+事件但按其时间戳排序的新流。
-Some observations:
+让我们重新审视这些数据:
-(1) The first element your stream sorter sees is the 4, but you can't just
immediately release it as
-the first element of the sorted stream. It may have arrived out of order, and
an earlier event might
-yet arrive. In fact, you have the benefit of some god-like knowledge of this
stream's future, and
-you can see that your stream sorter should wait at least until the 2 arrives
before producing any
-results.
+(1) 我们的排序器第一个看到的数据是4,但是我们不能立即将其作为已排序流的第一个元素释放。因为我们并不能确定它是
+有序的,并且较早的事件有可能并未到达。事实上,如果站在上帝视角,我们知道,必须要等到2到来时,排序器才可以有事件输出。
-*Some buffering, and some delay, is necessary.*
+*需要一些缓冲,需要一些时间,但这都是值得的*
-(2) If you do this wrong, you could end up waiting forever. First the sorter
saw an event from time
-4, and then an event from time 2. Will an event with a timestamp less than 2
ever arrive? Maybe.
-Maybe not. You could wait forever and never see a 1.
+(2) 接下来的这一步,如果我们选择的是固执的等待,我们永远不会有结果。首先,我们从时间4看到了一个事件,然后从时
+间2看到了一个事件。可是,时间戳小于2的事件接下来会不会到来呢?可能会,也可能不会。再次站在上帝视角,我们知道,我
+们永远不会看到1。
-*Eventually you have to be courageous and emit the 2 as the start of the
sorted stream.*
+*最终,我们必须勇于承担责任,并发出指令,把2作为已排序的事件流的开始*
-(3) What you need then is some sort of policy that defines when, for any given
timestamped event, to
-stop waiting for the arrival of earlier events.
+(3)然后,我们需要一种策略,该策略定义:对于任何给定时间戳的事件,Flink何时停止等待较早事件的到来。
-*This is precisely what watermarks do* — they define when to stop waiting for
earlier events.
+*这正是 watermarks 的作用* — 它们定义何时停止等待较早的事件。
-Event time processing in Flink depends on *watermark generators* that insert
special timestamped
-elements into the stream, called *watermarks*. A watermark for time _t_ is an
assertion that the
-stream is (probably) now complete up through time _t_.
+Flink 中事件时间的处理取决于 *watermark 生成器*,后者将带有时间戳的特殊元素插入流中形成 *watermarks*。事件
+时间 _t_ 的 watermark 代表 _t_ 之后(很可能)不会有新的元素到达。
-When should this stream sorter stop waiting, and push out the 2 to start the
sorted stream? When a
-watermark arrives with a timestamp of 2, or greater.
+事件流的排序器应何时停止等待,并推出2以启动已分类的流?当 watermark 以2或更大的时间戳到达时!
-(4) You might imagine different policies for deciding how to generate
watermarks.
+(4) 我们可能会思考,如何决定 watermarks 的不同生成策略
-Each event arrives after some delay, and these delays vary, so some events are
delayed more than
-others. One simple approach is to assume that these delays are bounded by some
maximum delay. Flink
-refers to this strategy as *bounded-out-of-orderness* watermarking. It is easy
to imagine more
-complex approaches to watermarking, but for most applications a fixed delay
works well enough.
+每个事件都会延迟一段时间后到达,然而这些延迟有所不同,有些事件可能其他事件延迟得更多。一种简单的方法是假定这些
+延迟受某个最大延迟的限制。Flink 将此策略称为 *最大无序边界(bounded-out-of-orderness)*
watermark。当然,我们可以想像
+出更好的生成 watermark 的方法,但是对于大多数应用而言,固定延迟策略已经足够了。
-### Latency vs. Completeness
+<a name="Latency-vs.-Completeness"></a>
+### 延迟 VS 正确性
-Another way to think about watermarks is that they give you, the developer of
a streaming
-application, control over the tradeoff between latency and completeness.
Unlike in batch processing,
-where one has the luxury of being able to have complete knowledge of the input
before producing any
-results, with streaming you must eventually stop waiting to see more of the
input, and produce some
-sort of result.
+watermarks 给了开发者流处理的一种选择,它们使开发人员在开发应用程序是可以控制延迟和完整性之间的权衡。与批处理不同,批
+处理中的奢侈之处在于可以在产生任何结果之前完全了解输入,而使用流式传输,我们不被允许等待所有的时间都产生了,才
+输出排序好的数据,这与流相违背。
-You can either configure your watermarking aggressively, with a short bounded
delay, and thereby
-take the risk of producing results with rather incomplete knowledge of the
input -- i.e., a possibly
-wrong result, produced quickly. Or you can wait longer, and produce results
that take advantage of
-having more complete knowledge of the input stream(s).
+我们可以把 watermarks 的边界时间配置的相对较短,从而冒着在输入了解不完全的情况下产生结果的风险-即可能会很快产生错误的
+结果。或者,您可以等待更长的时间,并利用对输入流的更全面的了解来产生结果。
-It is also possible to implement hybrid solutions that produce initial results
quickly, and then
-supply updates to those results as additional (late) data is processed. This
is a good approach for
-some applications.
+当然也可以实施混合解决方案,先快速产生初步结果,然后在处理其他(最新)数据时向这些结果提供更新。对于有一些对延
+迟的容忍程度很低,但是又对结果有很严格的要求的场景下,或许是一个福音。
-### Lateness
+<a name="Latency"></a>
+### 延迟
-Lateness is defined relative to the watermarks. A `Watermark(t)` asserts that
the stream is complete
-up through time _t_; any event following this watermark whose timestamp is
≤ _t_ is late.
+延迟是相对于 watermarks 定义的。`Watermark(t)` 判定事件流的时间已经到达了 _t_; watermark 之后的时间戳为 ≤
_t_ 的
+任何事件都被称之为延迟事件。
Review comment:
这一行和上一行合并
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就处理每个事件,并发出包含相同
+事件但按其时间戳排序的新流。
-Some observations:
+让我们重新审视这些数据:
-(1) The first element your stream sorter sees is the 4, but you can't just
immediately release it as
-the first element of the sorted stream. It may have arrived out of order, and
an earlier event might
-yet arrive. In fact, you have the benefit of some god-like knowledge of this
stream's future, and
-you can see that your stream sorter should wait at least until the 2 arrives
before producing any
-results.
+(1) 我们的排序器第一个看到的数据是4,但是我们不能立即将其作为已排序流的第一个元素释放。因为我们并不能确定它是
+有序的,并且较早的事件有可能并未到达。事实上,如果站在上帝视角,我们知道,必须要等到2到来时,排序器才可以有事件输出。
Review comment:
```suggestion
有序的,并且较早的事件有可能并未到达。事实上,如果站在上帝视角,我们知道,必须要等到 2 到来时,排序器才可以有事件输出。
```
79 需要和 78 行合并
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就处理每个事件,并发出包含相同
+事件但按其时间戳排序的新流。
-Some observations:
+让我们重新审视这些数据:
-(1) The first element your stream sorter sees is the 4, but you can't just
immediately release it as
-the first element of the sorted stream. It may have arrived out of order, and
an earlier event might
-yet arrive. In fact, you have the benefit of some god-like knowledge of this
stream's future, and
-you can see that your stream sorter should wait at least until the 2 arrives
before producing any
-results.
+(1) 我们的排序器第一个看到的数据是4,但是我们不能立即将其作为已排序流的第一个元素释放。因为我们并不能确定它是
Review comment:
```suggestion
(1) 我们的排序器第一个看到的数据是 4,但是我们不能立即将其作为已排序流的第一个元素释放。因为我们并不能确定它是
```
`第一个元素输出` 会好一些吗?`释放` 在这里不太合适
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -397,36 +369,34 @@ stream.
.process(...);
{% endhighlight %}
-When the allowed lateness is greater than zero, only those events that are so
late that they would
-be dropped are sent to the side output (if it has been configured).
+当允许的延迟大于零时,只有那些超过最大无序边界以至于会被丢弃的事件才会被发送到侧输出流(如果已配置)。
-### Surprises
+<a name="Surprises"></a>
+### 什么是惊喜
-Some aspects of Flink's windowing API may not behave in the way you would
expect. Based on
-frequently asked questions on the [flink-user mailing
-list](https://flink.apache.org/community.html#mailing-lists) and elsewhere,
here are some facts
-about windows that may surprise you.
+Flink 的窗口 API 某些方面有一些奇怪的行为,可能无法按照我们期望的方式运行。 根
+据[Flink用户邮件列表](https://flink.apache.org/community.html#mailing-lists)
和其他地方一些频繁被问起
+的问题, 以下是一些有关Windows的底层事实,这些信息可能会让您感到惊讶。
-#### Sliding Windows Make Copies
+<a name="Sliding-Windows-Make-Copies"></a>
+#### 滑动窗口会复制出很多的事件
-Sliding window assigners can create lots of window objects, and will copy each
event into every
-relevant window. For example, if you have sliding windows every 15 minutes
that are 24-hours in
-length, each event will be copied into 4 * 24 = 96 windows.
+滑动窗口分配器可以创建许多窗口对象,并将每个事件复制到每个相关的窗口中。例如,如果您每隔15分钟就有24小时的滑动
+窗口,则每个事件将被复制到4 * 24 = 96个窗口中。
-#### Time Windows are Aligned to the Epoch
+<a name="Time-Windows-are-Aligned-to-the-Epoch"></a>
+#### 时间窗口会和时间对齐
-Just because you are using hour-long processing-time windows and start your
application running at
-12:05 does not mean that the first window will close at 1:05. The first window
will be 55 minutes
-long and close at 1:00.
+仅仅因为我们使用的是一个小时的处理时间窗口并在12:05开始运行您的应用程序,并不意味着第一个窗口将在1:05关闭。第
+一个窗口将长55分钟,并在1:00关闭。
-Note, however, that the tumbling and sliding window assigners take an optional
offset parameter
-that can be used to change the alignment of the windows. See
-[Tumbling Windows]({% link dev/stream/operators/windows.zh.md
%}#tumbling-windows) and
-[Sliding Windows]({% link dev/stream/operators/windows.zh.md
%}#sliding-windows) for details.
+ 请注意,滑动窗口和滚动窗口分配器所采用的offset参数可用于改变窗口的对齐方式。有关详细的信息,请参见
+[滚动窗口]({% link dev/stream/operators/windows.zh.md %}#tumbling-windows) 和
+[滑动窗口]({% link dev/stream/operators/windows.zh.md %}#sliding-windows) 。
#### Windows Can Follow Windows
-For example, it works to do this:
+这样做是行得通的:
Review comment:
这句话需要调整下,这里的意思是以下面的例子来说明问题,`window 后面接 window 的问题`
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就处理每个事件,并发出包含相同
+事件但按其时间戳排序的新流。
-Some observations:
+让我们重新审视这些数据:
-(1) The first element your stream sorter sees is the 4, but you can't just
immediately release it as
-the first element of the sorted stream. It may have arrived out of order, and
an earlier event might
-yet arrive. In fact, you have the benefit of some god-like knowledge of this
stream's future, and
-you can see that your stream sorter should wait at least until the 2 arrives
before producing any
-results.
+(1) 我们的排序器第一个看到的数据是4,但是我们不能立即将其作为已排序流的第一个元素释放。因为我们并不能确定它是
+有序的,并且较早的事件有可能并未到达。事实上,如果站在上帝视角,我们知道,必须要等到2到来时,排序器才可以有事件输出。
-*Some buffering, and some delay, is necessary.*
+*需要一些缓冲,需要一些时间,但这都是值得的*
-(2) If you do this wrong, you could end up waiting forever. First the sorter
saw an event from time
-4, and then an event from time 2. Will an event with a timestamp less than 2
ever arrive? Maybe.
-Maybe not. You could wait forever and never see a 1.
+(2) 接下来的这一步,如果我们选择的是固执的等待,我们永远不会有结果。首先,我们从时间4看到了一个事件,然后从时
+间2看到了一个事件。可是,时间戳小于2的事件接下来会不会到来呢?可能会,也可能不会。再次站在上帝视角,我们知道,我
+们永远不会看到1。
-*Eventually you have to be courageous and emit the 2 as the start of the
sorted stream.*
+*最终,我们必须勇于承担责任,并发出指令,把2作为已排序的事件流的开始*
-(3) What you need then is some sort of policy that defines when, for any given
timestamped event, to
-stop waiting for the arrival of earlier events.
+(3)然后,我们需要一种策略,该策略定义:对于任何给定时间戳的事件,Flink何时停止等待较早事件的到来。
-*This is precisely what watermarks do* — they define when to stop waiting for
earlier events.
+*这正是 watermarks 的作用* — 它们定义何时停止等待较早的事件。
-Event time processing in Flink depends on *watermark generators* that insert
special timestamped
-elements into the stream, called *watermarks*. A watermark for time _t_ is an
assertion that the
-stream is (probably) now complete up through time _t_.
+Flink 中事件时间的处理取决于 *watermark 生成器*,后者将带有时间戳的特殊元素插入流中形成 *watermarks*。事件
+时间 _t_ 的 watermark 代表 _t_ 之后(很可能)不会有新的元素到达。
-When should this stream sorter stop waiting, and push out the 2 to start the
sorted stream? When a
-watermark arrives with a timestamp of 2, or greater.
+事件流的排序器应何时停止等待,并推出2以启动已分类的流?当 watermark 以2或更大的时间戳到达时!
Review comment:
```suggestion
事件流的排序器应何时停止等待,并推出 2 以启动已分类的流?当 watermark 以 2 或更大的时间戳到达时!
```
`当 watermark 以 2 或更大的时间戳到达时` 这句话需要调整
1. 不像正常中文的描述
2. 可以简化下描述
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -167,38 +148,36 @@ public static class TimestampsAndWatermarks
}
{% endhighlight %}
-Note that the constructor for `BoundedOutOfOrdernessTimestampExtractor` takes
a parameter which
-specifies the maximum expected out-of-orderness (10 seconds, in this example).
+请仔细观察, `BoundedOutOfOrdernessTimestampExtractor`
的构造函数使用了一个常量用来指定最大无序边界程度(在此示例中为10秒)。
{% top %}
## Windows
-Flink features very expressive window semantics.
+Flink 在窗口的场景处理上非常有表现力。
-In this section you will learn:
+在本节中,我们将学习:
-* how windows are used to compute aggregates on unbounded streams,
-* which types of windows Flink supports, and
-* how to implement a DataStream program with a windowed aggregation
+* 如何使用窗口来计算无界流上的聚合,
+* Flink 支持哪种类型的窗口,以及
+* 如何使用窗口聚合来实现 DataStream 程序
-### Introduction
+<a name="Introduction"></a>
+### 概要
-It is natural when doing stream processing to want to compute aggregated
analytics on bounded subsets
-of the streams in order to answer questions like these:
+我们在操作无界数据流时,经常需要应对以下问题,我们经常把无界数据流分解成有界数据流聚合分析:
-* number of page views per minute
-* number of sessions per user per week
-* maximum temperature per sensor per minute
+* 每分钟的浏览量
+* 每位用户每周的会话数
+* 每个传感器每分钟的最高温度
-Computing windowed analytics with Flink depends on two principal abstractions:
_Window Assigners_
-that assign events to windows (creating new window objects as necessary), and
_Window Functions_
-that are applied to the events assigned to a window.
+用 Flink 计算窗口分析取决于两个主要的抽象操作:_Window Assigners_,将事件分配给窗口(根据需要创建新的窗口对象),
+以及 _Window Functions_,处理窗口内的数据。
-Flink's windowing API also has notions of _Triggers_, which determine when to
call the window
-function, and _Evictors_, which can remove elements collected in a window.
+Flink 的窗口 API 还具有 _Triggers_ 和 _Evictors_ 的概念,_Triggers_ 确定何时调用窗口函数,而
_Evictors_ 则
+可以删除在窗口中收集的元素。
-In its basic form, you apply windowing to a keyed stream like this:
+举一个简单的例子,我们一般这样使用键控事件流(基于key分组的输入事件流):
Review comment:
```suggestion
举一个简单的例子,我们一般这样使用键控事件流(基于 key 分组的输入事件流):
```
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -397,36 +369,34 @@ stream.
.process(...);
{% endhighlight %}
-When the allowed lateness is greater than zero, only those events that are so
late that they would
-be dropped are sent to the side output (if it has been configured).
+当允许的延迟大于零时,只有那些超过最大无序边界以至于会被丢弃的事件才会被发送到侧输出流(如果已配置)。
-### Surprises
+<a name="Surprises"></a>
+### 什么是惊喜
-Some aspects of Flink's windowing API may not behave in the way you would
expect. Based on
-frequently asked questions on the [flink-user mailing
-list](https://flink.apache.org/community.html#mailing-lists) and elsewhere,
here are some facts
-about windows that may surprise you.
+Flink 的窗口 API 某些方面有一些奇怪的行为,可能无法按照我们期望的方式运行。 根
+据[Flink用户邮件列表](https://flink.apache.org/community.html#mailing-lists)
和其他地方一些频繁被问起
Review comment:
```suggestion
据 [Flink 用户邮件列表](https://flink.apache.org/community.html#mailing-lists)
和其他地方一些频繁被问起
```
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -397,36 +369,34 @@ stream.
.process(...);
{% endhighlight %}
-When the allowed lateness is greater than zero, only those events that are so
late that they would
-be dropped are sent to the side output (if it has been configured).
+当允许的延迟大于零时,只有那些超过最大无序边界以至于会被丢弃的事件才会被发送到侧输出流(如果已配置)。
-### Surprises
+<a name="Surprises"></a>
+### 什么是惊喜
-Some aspects of Flink's windowing API may not behave in the way you would
expect. Based on
-frequently asked questions on the [flink-user mailing
-list](https://flink.apache.org/community.html#mailing-lists) and elsewhere,
here are some facts
-about windows that may surprise you.
+Flink 的窗口 API 某些方面有一些奇怪的行为,可能无法按照我们期望的方式运行。 根
+据[Flink用户邮件列表](https://flink.apache.org/community.html#mailing-lists)
和其他地方一些频繁被问起
+的问题, 以下是一些有关Windows的底层事实,这些信息可能会让您感到惊讶。
Review comment:
```suggestion
的问题, 以下是一些有关 Windows 的底层事实,这些信息可能会让您感到惊讶。
```
这里的意思是 user mail list 和其他地方的提问,反映了 window 的如下几个方面可能会和用户的预期不一样
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -437,37 +407,32 @@ stream
.reduce(<same reduce function>)
{% endhighlight %}
-You might expect Flink's runtime to be smart enough to do this parallel
pre-aggregation for you
-(provided you are using a ReduceFunction or AggregateFunction), but it's not.
+可能我们会猜测以 Flink 的能力,想要做到这样看起来是可行的(前提是您使用的是ReduceFunction或AggregateFunction),但不是。
-The reason why this works is that the events produced by a time window are
assigned timestamps
-based on the time at the end of the window. So, for example, all of the events
produced
-by an hour-long window will have timestamps marking the end of an hour. Any
subsequent window
-consuming those events should have a duration that is the same as, or a
multiple of, the
-previous window.
+之所以可行,是因为时间窗口产生的事件是根据窗口结束时的时间分配时间戳的。例如,一个小时小时的窗口所产生的所有事
+件都将带有标记一个小时结束的时间戳。后面的窗口内的数据消费和前面的流产生的数据是一致的。
-#### No Results for Empty TimeWindows
+<a name="No-Results-for-Empty-TimeWindows"></a>
+#### 空的时间窗口不会产出结果
-Windows are only created when events are assigned to them. So if there are no
events in a given time
-frame, no results will be reported.
+事件会触发窗口的创建。换句话说,如果在特定的窗口内没有事件,就不会有窗口,就不会有输出结果。
#### Late Events Can Cause Late Merges
-Session windows are based on an abstraction of windows that can _merge_. Each
element is initially
-assigned to a new window, after which windows are merged whenever the gap
between them is small
-enough. In this way, a late event can bridge the gap separating two previously
separate sessions,
-producing a late merge.
+会话窗口的实现是基于窗口的一个抽象能力,窗口可以_聚合_。会话窗口中的每个数据在初始被消费时,都会被分配一个新的
Review comment:
```suggestion
会话窗口的实现是基于窗口的一个抽象能力,窗口可以 _聚合_。会话窗口中的每个数据在初始被消费时,都会被分配一个新的
```
这个你可以再本地起来的 server 中查看具体的效果。
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -397,36 +369,34 @@ stream.
.process(...);
{% endhighlight %}
-When the allowed lateness is greater than zero, only those events that are so
late that they would
-be dropped are sent to the side output (if it has been configured).
+当允许的延迟大于零时,只有那些超过最大无序边界以至于会被丢弃的事件才会被发送到侧输出流(如果已配置)。
-### Surprises
+<a name="Surprises"></a>
+### 什么是惊喜
Review comment:
这个翻译需要优化下,这里的意思是说“给用户惊喜”,这些惊喜和用户的预期行为是不一样的,下面给了一些具体的 case
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
Review comment:
```suggestion
将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks) 中介绍,但是首先我们需要解释一下
```
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就处理每个事件,并发出包含相同
+事件但按其时间戳排序的新流。
-Some observations:
+让我们重新审视这些数据:
-(1) The first element your stream sorter sees is the 4, but you can't just
immediately release it as
-the first element of the sorted stream. It may have arrived out of order, and
an earlier event might
-yet arrive. In fact, you have the benefit of some god-like knowledge of this
stream's future, and
-you can see that your stream sorter should wait at least until the 2 arrives
before producing any
-results.
+(1) 我们的排序器第一个看到的数据是4,但是我们不能立即将其作为已排序流的第一个元素释放。因为我们并不能确定它是
+有序的,并且较早的事件有可能并未到达。事实上,如果站在上帝视角,我们知道,必须要等到2到来时,排序器才可以有事件输出。
-*Some buffering, and some delay, is necessary.*
+*需要一些缓冲,需要一些时间,但这都是值得的*
-(2) If you do this wrong, you could end up waiting forever. First the sorter
saw an event from time
-4, and then an event from time 2. Will an event with a timestamp less than 2
ever arrive? Maybe.
-Maybe not. You could wait forever and never see a 1.
+(2) 接下来的这一步,如果我们选择的是固执的等待,我们永远不会有结果。首先,我们从时间4看到了一个事件,然后从时
+间2看到了一个事件。可是,时间戳小于2的事件接下来会不会到来呢?可能会,也可能不会。再次站在上帝视角,我们知道,我
+们永远不会看到1。
Review comment:
85 行需要和上面的合并。具体效果你可以执行 `./docs/build.sh -p` 之后进行查看
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -397,36 +369,34 @@ stream.
.process(...);
{% endhighlight %}
-When the allowed lateness is greater than zero, only those events that are so
late that they would
-be dropped are sent to the side output (if it has been configured).
+当允许的延迟大于零时,只有那些超过最大无序边界以至于会被丢弃的事件才会被发送到侧输出流(如果已配置)。
-### Surprises
+<a name="Surprises"></a>
+### 什么是惊喜
-Some aspects of Flink's windowing API may not behave in the way you would
expect. Based on
-frequently asked questions on the [flink-user mailing
-list](https://flink.apache.org/community.html#mailing-lists) and elsewhere,
here are some facts
-about windows that may surprise you.
+Flink 的窗口 API 某些方面有一些奇怪的行为,可能无法按照我们期望的方式运行。 根
+据[Flink用户邮件列表](https://flink.apache.org/community.html#mailing-lists)
和其他地方一些频繁被问起
+的问题, 以下是一些有关Windows的底层事实,这些信息可能会让您感到惊讶。
-#### Sliding Windows Make Copies
+<a name="Sliding-Windows-Make-Copies"></a>
+#### 滑动窗口会复制出很多的事件
-Sliding window assigners can create lots of window objects, and will copy each
event into every
-relevant window. For example, if you have sliding windows every 15 minutes
that are 24-hours in
-length, each event will be copied into 4 * 24 = 96 windows.
+滑动窗口分配器可以创建许多窗口对象,并将每个事件复制到每个相关的窗口中。例如,如果您每隔15分钟就有24小时的滑动
+窗口,则每个事件将被复制到4 * 24 = 96个窗口中。
-#### Time Windows are Aligned to the Epoch
+<a name="Time-Windows-are-Aligned-to-the-Epoch"></a>
+#### 时间窗口会和时间对齐
-Just because you are using hour-long processing-time windows and start your
application running at
-12:05 does not mean that the first window will close at 1:05. The first window
will be 55 minutes
-long and close at 1:00.
+仅仅因为我们使用的是一个小时的处理时间窗口并在12:05开始运行您的应用程序,并不意味着第一个窗口将在1:05关闭。第
+一个窗口将长55分钟,并在1:00关闭。
Review comment:
```suggestion
一个窗口将长 55 分钟,并在 1:00 关闭。
```
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -29,123 +29,104 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论
+什么时间去计算都不会影响输出结果。然而有些人,在实时计算应用中使用处理时间,这样的话,输出结果就会被处理时间点所决
+定,而不是生产事件的时间。基于处理时间会导致多次计算的结果不一致,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-training/streaming_analytics.zh.md %}#working-with-watermarks), but first we
should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳的提取器和 Watermark 生成器,Flink 将使用它们来跟踪事件时间的进度。这
+将在选节[使用 Watermarks]({% link training/streaming_analytics.zh.md
%}#Working-with-Watermarks)中介绍,但是首先我们需要解释一下
+ watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的
+第一个事件发生在时间4,随后发生的事件发生在更早的时间2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就处理每个事件,并发出包含相同
+事件但按其时间戳排序的新流。
-Some observations:
+让我们重新审视这些数据:
-(1) The first element your stream sorter sees is the 4, but you can't just
immediately release it as
-the first element of the sorted stream. It may have arrived out of order, and
an earlier event might
-yet arrive. In fact, you have the benefit of some god-like knowledge of this
stream's future, and
-you can see that your stream sorter should wait at least until the 2 arrives
before producing any
-results.
+(1) 我们的排序器第一个看到的数据是4,但是我们不能立即将其作为已排序流的第一个元素释放。因为我们并不能确定它是
+有序的,并且较早的事件有可能并未到达。事实上,如果站在上帝视角,我们知道,必须要等到2到来时,排序器才可以有事件输出。
-*Some buffering, and some delay, is necessary.*
+*需要一些缓冲,需要一些时间,但这都是值得的*
-(2) If you do this wrong, you could end up waiting forever. First the sorter
saw an event from time
-4, and then an event from time 2. Will an event with a timestamp less than 2
ever arrive? Maybe.
-Maybe not. You could wait forever and never see a 1.
+(2) 接下来的这一步,如果我们选择的是固执的等待,我们永远不会有结果。首先,我们从时间4看到了一个事件,然后从时
+间2看到了一个事件。可是,时间戳小于2的事件接下来会不会到来呢?可能会,也可能不会。再次站在上帝视角,我们知道,我
+们永远不会看到1。
-*Eventually you have to be courageous and emit the 2 as the start of the
sorted stream.*
+*最终,我们必须勇于承担责任,并发出指令,把2作为已排序的事件流的开始*
Review comment:
`并发出指令` 这里需要修改下,这个意思不对。
```suggestion
*最终,我们必须勇于承担责任,并发出指令,把 2 作为已排序的事件流的开始*
```
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -167,38 +148,36 @@ public static class TimestampsAndWatermarks
}
{% endhighlight %}
-Note that the constructor for `BoundedOutOfOrdernessTimestampExtractor` takes
a parameter which
-specifies the maximum expected out-of-orderness (10 seconds, in this example).
+请仔细观察, `BoundedOutOfOrdernessTimestampExtractor`
的构造函数使用了一个常量用来指定最大无序边界程度(在此示例中为10秒)。
{% top %}
## Windows
-Flink features very expressive window semantics.
+Flink 在窗口的场景处理上非常有表现力。
-In this section you will learn:
+在本节中,我们将学习:
-* how windows are used to compute aggregates on unbounded streams,
-* which types of windows Flink supports, and
-* how to implement a DataStream program with a windowed aggregation
+* 如何使用窗口来计算无界流上的聚合,
+* Flink 支持哪种类型的窗口,以及
+* 如何使用窗口聚合来实现 DataStream 程序
-### Introduction
+<a name="Introduction"></a>
Review comment:
给这些标题增加一个标签是一个好习惯,这样如果有页内跳转的话,就不会失效了 👍
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -397,36 +369,34 @@ stream.
.process(...);
{% endhighlight %}
-When the allowed lateness is greater than zero, only those events that are so
late that they would
-be dropped are sent to the side output (if it has been configured).
+当允许的延迟大于零时,只有那些超过最大无序边界以至于会被丢弃的事件才会被发送到侧输出流(如果已配置)。
-### Surprises
+<a name="Surprises"></a>
+### 什么是惊喜
-Some aspects of Flink's windowing API may not behave in the way you would
expect. Based on
-frequently asked questions on the [flink-user mailing
-list](https://flink.apache.org/community.html#mailing-lists) and elsewhere,
here are some facts
-about windows that may surprise you.
+Flink 的窗口 API 某些方面有一些奇怪的行为,可能无法按照我们期望的方式运行。 根
+据[Flink用户邮件列表](https://flink.apache.org/community.html#mailing-lists)
和其他地方一些频繁被问起
+的问题, 以下是一些有关Windows的底层事实,这些信息可能会让您感到惊讶。
-#### Sliding Windows Make Copies
+<a name="Sliding-Windows-Make-Copies"></a>
+#### 滑动窗口会复制出很多的事件
Review comment:
这个需要优化下
##########
File path: docs/training/streaming_analytics.zh.md
##########
@@ -397,36 +369,34 @@ stream.
.process(...);
{% endhighlight %}
-When the allowed lateness is greater than zero, only those events that are so
late that they would
-be dropped are sent to the side output (if it has been configured).
+当允许的延迟大于零时,只有那些超过最大无序边界以至于会被丢弃的事件才会被发送到侧输出流(如果已配置)。
-### Surprises
+<a name="Surprises"></a>
+### 什么是惊喜
-Some aspects of Flink's windowing API may not behave in the way you would
expect. Based on
-frequently asked questions on the [flink-user mailing
-list](https://flink.apache.org/community.html#mailing-lists) and elsewhere,
here are some facts
-about windows that may surprise you.
+Flink 的窗口 API 某些方面有一些奇怪的行为,可能无法按照我们期望的方式运行。 根
+据[Flink用户邮件列表](https://flink.apache.org/community.html#mailing-lists)
和其他地方一些频繁被问起
+的问题, 以下是一些有关Windows的底层事实,这些信息可能会让您感到惊讶。
-#### Sliding Windows Make Copies
+<a name="Sliding-Windows-Make-Copies"></a>
+#### 滑动窗口会复制出很多的事件
-Sliding window assigners can create lots of window objects, and will copy each
event into every
-relevant window. For example, if you have sliding windows every 15 minutes
that are 24-hours in
-length, each event will be copied into 4 * 24 = 96 windows.
+滑动窗口分配器可以创建许多窗口对象,并将每个事件复制到每个相关的窗口中。例如,如果您每隔15分钟就有24小时的滑动
+窗口,则每个事件将被复制到4 * 24 = 96个窗口中。
Review comment:
```suggestion
窗口,则每个事件将被复制到 4 * 24 = 96 个窗口中。
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]