klion26 commented on a change in pull request #12237:
URL: https://github.com/apache/flink/pull/12237#discussion_r449576043
##########
File path: docs/learn-flink/streaming_analytics.zh.md
##########
@@ -437,37 +376,29 @@ stream
.reduce(<same reduce function>)
{% endhighlight %}
-You might expect Flink's runtime to be smart enough to do this parallel
pre-aggregation for you
-(provided you are using a ReduceFunction or AggregateFunction), but it's not.
+可能我们会猜测以 Flink 的能力,想要做到这样看起来是可行的(前提是你使用的是 ReduceFunction 或 AggregateFunction
),但不是。
-The reason why this works is that the events produced by a time window are
assigned timestamps
-based on the time at the end of the window. So, for example, all of the events
produced
-by an hour-long window will have timestamps marking the end of an hour. Any
subsequent window
-consuming those events should have a duration that is the same as, or a
multiple of, the
-previous window.
+之所以可行,是因为时间窗口产生的事件是根据窗口结束时的时间分配时间戳的。例如,一个小时小时的窗口所产生的所有事件都将带有标记一个小时结束的时间戳。后面的窗口内的数据消费和前面的流产生的数据是一致的。
-#### No Results for Empty TimeWindows
+<a name="No-Results-for-Empty-TimeWindows"></a>
+#### 空的时间窗口不会输出结果
-Windows are only created when events are assigned to them. So if there are no
events in a given time
-frame, no results will be reported.
+事件会触发窗口的创建。换句话说,如果在特定的窗口内没有事件,就不会有窗口,就不会有输出结果。
#### Late Events Can Cause Late Merges
-Session windows are based on an abstraction of windows that can _merge_. Each
element is initially
-assigned to a new window, after which windows are merged whenever the gap
between them is small
-enough. In this way, a late event can bridge the gap separating two previously
separate sessions,
-producing a late merge.
+会话窗口的实现是基于窗口的一个抽象能力,窗口可以
_聚合_。会话窗口中的每个数据在初始被消费时,都会被分配一个新的窗口,但是如果窗口之间的间隔足够小,多个窗口就会被聚合。延迟事件可以弥合两个先前分开的会话间隔,从而产生一个虽然有延迟但是更加准确地结果。
{% top %}
-## Hands-on
+## 动手练习
-The hands-on exercise that goes with this section is the [Hourly Tips
+本节附带的动手练习是[Hourly Tips
Review comment:
```suggestion
本节附带的动手练习是 [Hourly Tips
```
这个地方修改一下
##########
File path: docs/learn-flink/streaming_analytics.zh.md
##########
@@ -385,36 +331,30 @@ stream.
.process(...);
{% endhighlight %}
-When the allowed lateness is greater than zero, only those events that are so
late that they would
-be dropped are sent to the side output (if it has been configured).
+当允许的延迟大于零时,只有那些超过最大无序边界以至于会被丢弃的事件才会被发送到侧输出流(如果已配置)。
-### Surprises
+<a name="Surprises"></a>
+### 深入了解窗口操作
-Some aspects of Flink's windowing API may not behave in the way you would
expect. Based on
-frequently asked questions on the [flink-user mailing
-list](https://flink.apache.org/community.html#mailing-lists) and elsewhere,
here are some facts
-about windows that may surprise you.
+Flink 的窗口 API 某些方面有一些奇怪的行为,可能和我们预期的行为不一致。 根据 [Flink
用户邮件列表](https://flink.apache.org/community.html#mailing-lists) 和其他地方一些频繁被问起的问题,
以下是一些有关 Windows 的底层事实,这些信息可能会让您感到惊讶。
-#### Sliding Windows Make Copies
+<a name="Sliding-Windows-Make-Copies"></a>
+#### 滑动窗口是通过复制来实现的
-Sliding window assigners can create lots of window objects, and will copy each
event into every
-relevant window. For example, if you have sliding windows every 15 minutes
that are 24-hours in
-length, each event will be copied into 4 * 24 = 96 windows.
+滑动窗口分配器可以创建许多窗口对象,并将每个事件复制到每个相关的窗口中。例如,如果您每隔 15 分钟就有 24 小时的滑动窗口,则每个事件将被复制到 4 *
24 = 96 个窗口中。
-#### Time Windows are Aligned to the Epoch
+<a name="Time-Windows-are-Aligned-to-the-Epoch"></a>
+#### 时间窗口会和时间对齐
-Just because you are using hour-long processing-time windows and start your
application running at
-12:05 does not mean that the first window will close at 1:05. The first window
will be 55 minutes
-long and close at 1:00.
+仅仅因为我们使用的是一个小时的处理时间窗口并在 12:05 开始运行您的应用程序,并不意味着第一个窗口将在 1:05 关闭。第一个窗口将长 55 分钟,并在
1:00 关闭。
-Note, however, that the tumbling and sliding window assigners take an optional
offset parameter
-that can be used to change the alignment of the windows. See
-[Tumbling Windows]({% link dev/stream/operators/windows.zh.md
%}#tumbling-windows) and
-[Sliding Windows]({% link dev/stream/operators/windows.zh.md
%}#sliding-windows) for details.
+ 请注意,滑动窗口和滚动窗口分配器所采用的 offset 参数可用于改变窗口的对齐方式。有关详细的信息,请参见
+[滚动窗口]({% link dev/stream/operators/windows.zh.md %}#tumbling-windows) 和
+[滑动窗口]({% link dev/stream/operators/windows.zh.md %}#sliding-windows) 。
-#### Windows Can Follow Windows
+#### window 后面可以接 window
Review comment:
这个地方建议添加 `<a>` 标签,另外其他的 `<a>` 标签可以换成全小写的。这里有个小技巧。你可以打开 英文的文档,看看对应标签的 URL
是啥,然后把 `<a name=...>` 中 name 后面的内容补充为 name 后面的内容
这篇文档的其他地方也都添加 `<a>` 标签吧
##########
File path: docs/learn-flink/streaming_analytics.zh.md
##########
@@ -29,123 +29,86 @@ under the License.
## Event Time and Watermarks
-### Introduction
+<a name="Introduction"></a>
+### 概要
-Flink explicitly supports three different notions of time:
+Flink 明确支持以下三种时间语义:
-* _event time:_ the time when an event occurred, as recorded by the device
producing (or storing) the event
+* _事件时间(event time):_ 事件产生的时间,记录的是设备生产(或者存储)事件的时间
-* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the
event
+* _摄取时间(ingestion time):_ Flink 读取事件时记录的时间
-* _processing time:_ the time when a specific operator in your pipeline is
processing the event
+* _处理时间(processing time):_ Flink pipeline 中具体算子处理事件的时间
-For reproducible results, e.g., when computing the maximum price a stock
reached during the first
-hour of trading on a given day, you should use event time. In this way the
result won't depend on
-when the calculation is performed. This kind of real-time application is
sometimes performed using
-processing time, but then the results are determined by the events that happen
to be processed
-during that hour, rather than the events that occurred then. Computing
analytics based on processing
-time causes inconsistencies, and makes it difficult to re-analyze historic
data or test new
-implementations.
+为了获得可重现的结果,例如在计算过去的特定一天里第一个小时股票的最高价格时,我们应该使用事件时间。这样的话,无论什么时间去计算都不会影响输出结果。然而如果使用
processing time 的话,实时应用程序的结果是由程序运行的时间所决定。多次运行基于 processing time
的实时程序,可能得到的结果都不相同,也可能会导致再次分析历史数据或者测试新代码变得异常困难。
-### Working with Event Time
+<a name="Working-with-Event-Time"></a>
+### 使用 Event Time
-By default, Flink will use processing time. To change this, you can set the
Time Characteristic:
+Flink 在默认情况下是使用处理时间。也可以通过下面配置来告诉 Flink 选择哪种时间语义:
{% highlight java %}
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
{% endhighlight %}
-If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark
-Generator that Flink will use to track the progress of event time. This will
be covered in the
-section below on [Working with Watermarks]({% link
-learn-flink/streaming_analytics.zh.md %}#working-with-watermarks), but first
we should explain what
-watermarks are.
+如果想要使用事件时间,需要额外给 Flink 提供一个时间戳提取器和 Watermark 生成器,Flink
将使用它们来跟踪事件时间的进度。这将在选节[使用 Watermarks]({% link
learn-flink/streaming_analytics.zh.md %}#Working-with-Watermarks)
中介绍,但是首先我们需要解释一下 watermarks 是什么。
### Watermarks
-Let's work through a simple example that will show why watermarks are needed,
and how they work.
+让我们通过一个简单的示例来演示为什么需要 watermarks 及其工作方式。
-In this example you have a stream of timestamped events that arrive somewhat
out of order, as shown
-below. The numbers shown are timestamps that indicate when these events
actually occurred. The first
-event to arrive happened at time 4, and it is followed by an event that
happened earlier, at time 2,
-and so on:
+在此示例中,我们将看到带有混乱时间戳的事件流,如下所示。显示的数字表达的是这些事件实际发生时间的时间戳。到达的第一个事件发生在时间
4,随后发生的事件发生在更早的时间 2,依此类推:
<div class="text-center" style="font-size: x-large; word-spacing: 0.5em;
margin: 1em 0em;">
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
</div>
-Now imagine that you are trying create a stream sorter. This is meant to be an
application that
-processes each event from a stream as it arrives, and emits a new stream
containing the same events,
-but ordered by their timestamps.
+假设我们要对数据流排序,我们想要达到的目的是:应用程序应该在数据流里的事件到达时就有一个算子(我们暂且称之为排序)开始处理事件,这个算子所输出的流是按照时间戳排序好的。
-Some observations:
+让我们重新审视这些数据:
-(1) The first element your stream sorter sees is the 4, but you can't just
immediately release it as
-the first element of the sorted stream. It may have arrived out of order, and
an earlier event might
-yet arrive. In fact, you have the benefit of some god-like knowledge of this
stream's future, and
-you can see that your stream sorter should wait at least until the 2 arrives
before producing any
-results.
+(1) 我们的排序器看到的第一个事件的时间戳是
4,但是我们不能立即将其作为已排序的流释放。因为我们并不能确定它是有序的,并且较早的事件有可能并未到达。事实上,如果站在上帝视角,我们知道,必须要等到时间戳为
2 的元素到来时,排序器才可以有事件输出。
-*Some buffering, and some delay, is necessary.*
+*需要一些缓冲,需要一些时间,但这都是值得的*
-(2) If you do this wrong, you could end up waiting forever. First the sorter
saw an event from time
-4, and then an event from time 2. Will an event with a timestamp less than 2
ever arrive? Maybe.
-Maybe not. You could wait forever and never see a 1.
+(2) 接下来的这一步,如果我们选择的是固执的等待,我们永远不会有结果。首先,我们看到了时间戳为 4 的事件,然后看到了时间戳为 2
的事件。可是,时间戳小于 2 的事件接下来会不会到来呢?可能会,也可能不会。再次站在上帝视角,我们知道,我们永远不会看到时间戳 1。
-*Eventually you have to be courageous and emit the 2 as the start of the
sorted stream.*
+*最终,我们必须勇于承担责任,并发出指令,把带有时间戳 2 的事件作为已排序的事件流的开始*
-(3) What you need then is some sort of policy that defines when, for any given
timestamped event, to
-stop waiting for the arrival of earlier events.
+(3)然后,我们需要一种策略,该策略定义:对于任何给定时间戳的事件,Flink 何时停止等待较早事件的到来。
-*This is precisely what watermarks do* — they define when to stop waiting for
earlier events.
+*这正是 watermarks 的作用* — 它们定义何时停止等待较早的事件。
-Event time processing in Flink depends on *watermark generators* that insert
special timestamped
-elements into the stream, called *watermarks*. A watermark for time _t_ is an
assertion that the
-stream is (probably) now complete up through time _t_.
+Flink 中事件时间的处理取决于 *watermark 生成器*,后者将带有时间戳的特殊元素插入流中形成 *watermarks*。事件时间 _t_ 的
watermark 代表 _t_ 之前(很可能)都已经到达。
-When should this stream sorter stop waiting, and push out the 2 to start the
sorted stream? When a
-watermark arrives with a timestamp of 2, or greater.
+当 watermark 以 2 或更大的时间戳到达时,事件流的排序器应停止等待,并输出 2 作为已经排序好的流。
-(4) You might imagine different policies for deciding how to generate
watermarks.
+(4) 我们可能会思考,如何决定 watermarks 的不同生成策略
-Each event arrives after some delay, and these delays vary, so some events are
delayed more than
-others. One simple approach is to assume that these delays are bounded by some
maximum delay. Flink
-refers to this strategy as *bounded-out-of-orderness* watermarking. It is easy
to imagine more
-complex approaches to watermarking, but for most applications a fixed delay
works well enough.
+每个事件都会延迟一段时间后到达,然而这些延迟有所不同,有些事件可能比其他事件延迟得更多。一种简单的方法是假定这些延迟受某个最大延迟的限制。Flink
将此策略称为 *最大无序边界(bounded-out-of-orderness)* watermark。当然,我们可以想像出更好的生成 watermark
的方法,但是对于大多数应用而言,固定延迟策略已经足够了。
-### Latency vs. Completeness
+<a name="Latency-vs-Completeness"></a>
+### 延迟 VS 正确性
-Another way to think about watermarks is that they give you, the developer of
a streaming
-application, control over the tradeoff between latency and completeness.
Unlike in batch processing,
-where one has the luxury of being able to have complete knowledge of the input
before producing any
-results, with streaming you must eventually stop waiting to see more of the
input, and produce some
-sort of result.
+watermarks
给了开发者流处理的一种选择,它们使开发人员在开发应用程序时可以控制延迟和完整性之间的权衡。与批处理不同,批处理中的奢侈之处在于可以在产生任何结果之前完全了解输入,而使用流式传输,我们不被允许等待所有的时间都产生了,才输出排序好的数据,这与流相违背。
-You can either configure your watermarking aggressively, with a short bounded
delay, and thereby
-take the risk of producing results with rather incomplete knowledge of the
input -- i.e., a possibly
-wrong result, produced quickly. Or you can wait longer, and produce results
that take advantage of
-having more complete knowledge of the input stream(s).
+我们可以把 watermarks
的边界时间配置的相对较短,从而冒着在输入了解不完全的情况下产生结果的风险-即可能会很快产生错误结果。或者,你可以等待更长的时间,并利用对输入流的更全面的了解来产生结果。
-It is also possible to implement hybrid solutions that produce initial results
quickly, and then
-supply updates to those results as additional (late) data is processed. This
is a good approach for
-some applications.
+当然也可以实施混合解决方案,先快速产生初步结果,然后在处理其他(最新)数据时向这些结果提供更新。对于有一些对延迟的容忍程度很低,但是又对结果有很严格的要求的场景下,或许是一个福音。
-### Lateness
+<a name="Latency"></a>
+### 延迟
-Lateness is defined relative to the watermarks. A `Watermark(t)` asserts that
the stream is complete
-up through time _t_; any event following this watermark whose timestamp is
≤ _t_ is late.
+延迟是相对于 watermarks 定义的。`Watermark(t)` 判定事件流的时间已经到达了 _t_; watermark 之后的时间戳为 ≤
_t_ 的任何事件都被称之为延迟事件。
Review comment:
”`Watermark(t)` 判定事件流的时间已经到达了 _t_“ 这句话如果改成 “`Watermark(5)` 表示时间 _t_
之前的事件都已经达到” 会好一些吗?
“watermark 之后的时间戳为 ≤ _t_ 的” -> "watermark 之后的时间戳 ≤ _t_ 的" 会更好一些吗
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]