yangyichao-mango commented on a change in pull request #12311:
URL: https://github.com/apache/flink/pull/12311#discussion_r432833068
##########
File path: docs/training/index.zh.md
##########
@@ -29,158 +29,90 @@ under the License.
* This will be replaced by the TOC
{:toc}
-## Goals and Scope of this Training
+## 本章教程的目标及涵盖范围
-This training presents an introduction to Apache Flink that includes just
enough to get you started
-writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
-(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
-APIs for managing state and time, with the expectation that having mastered
these fundamentals,
-you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
-reference documentation. The links at the end of each section will lead you to
where you
-can learn more.
+本章教程对 Apache Flink 的基本概念进行了介绍,虽然省略了许多重要细节,但是如果你掌握了本章内容,就足以实现可扩展并行度的
ETL、数据分析以及事件驱动的流式应用程序。本章重点对 Flink API
中的状态管理和时间进行了介绍,掌握了这些基础知识后,你将能更好地从其他详细参考文档中获取和掌握你所需要的知识。每小节结尾都有链接去引导你了解更多内容。
-Specifically, you will learn:
+具体来说,你将在本章学习到以下内容:
-- how to implement streaming data processing pipelines
-- how and why Flink manages state
-- how to use event time to consistently compute accurate analytics
-- how to build event-driven applications on continuous streams
-- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+- 如何实现流数据处理管道(pipelines)
+- Flink 如何管理状态以及为何需要管理状态
+- 如何使用事件时间(event time)来一致并准确地进行计算分析
+- 如何在源源不断的数据流上构建事件驱动的应用程序
+- Flink 如何提供具有精确一次(exactly-once)计算语义的可容错、有状态流处理
-This training focuses on four critical concepts: continuous processing of
streaming data, event
-time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+本章教程着重介绍四个概念:源源不断的流式数据处理、事件时间、有状态流处理和状态快照。基本概念介绍如下。
-{% info Note %} Accompanying this training is a set of hands-on exercises that
will
-guide you through learning how to work with the concepts being presented. A
link to the relevant
-exercise is provided at the end of each section.
+{% info Note %} 每小节教程都有实践练习部分去引导你如何在程序中使用其所述的概念,并在小节结尾都提供了相关实践练习的代码链接。
{% top %}
-## Stream Processing
+## 流处理
-Streams are data's natural habitat. Whether it is events from web servers,
trades from a stock
-exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
-But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
-streams, and which of these paradigms you choose has profound consequences.
+在自然环境中,数据的产生原本就是流式的。无论是来自 Web
服务器的事件数据,证券交易所的交易数据,还是来自工厂车间机器上的传感器数据,其数据都是流式的进行生成。但是当你分析数据时,可以围绕
_有界流_(_bounded_)或 _无界流_(_unbounded_)两种模型来组织处理数据,当然,选择不同的模型,程序的执行和处理方式也都会不同。
<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
-**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
-operation you can choose to ingest the entire dataset before producing any
results, which means that
-it is possible, for example, to sort the data, compute global statistics, or
produce a final report
-that summarizes all of the input.
+**批处理**是有界数据流处理的范例。在这种模式下,你可以选择在计算结果输出之前输入整个数据集,这也就意味着你可以对整个数据集的数据进行排序、统计或汇总计算后再输出结果。
-**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
-the input may never end, and so you are forced to continuously process the
data as it arrives.
+**流处理**正相反,其包括了无界数据流。至少理论上来说,它的数据输入永远不会结束,因此程序必须持续不断地对到达的数据进行处理。
-In Flink, applications are composed of **streaming dataflows** that may be
transformed by
-user-defined **operators**. These dataflows form directed graphs that start
with one or more
-**sources**, and end in one or more **sinks**.
+在 Flink 中,应用程序由用户自定义**算子**转换而来的**流式 dataflows** 所组成。这些流式 dataflows
形成了有向图,其可以以一个或多个**源**(source)开始,并以一个或多个**汇**(sink)结束。
<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
-Often there is a one-to-one correspondence between the transformations in the
program and the
-operators in the dataflow. Sometimes, however, one transformation may consist
of multiple operators.
+通常,程序代码中的 transformation 和 dataflow 中的算子(operator)之间是一一对应的。但有时也会出现一个
transformation 包含多个算子的情况,如上图所示。
-An application may consume real-time data from streaming sources such as
message queues or
-distributed logs, like Apache Kafka or Kinesis. But flink can also consume
bounded, historic data
-from a variety of data sources. Similarly, the streams of results being
produced by a Flink
-application can be sent to a wide variety of systems that can be connected as
sinks.
+Flink 应用程序可以消费来自消息队列或分布式日志这类流式数据源(例如 Apache Kafka 或
Kinesis)的实时数据,也可以从各种的数据源中消费有界的历史数据。同样,Flink 应用程序生成的结果流也可以发送到各种可以连接到程序中的数据汇中。
<img src="{{ site.baseurl }}/fig/flink-application-sources-sinks.png"
alt="Flink application with sources and sinks" class="offset" width="90%" />
-### Parallel Dataflows
+### 并行 Dataflows
-Programs in Flink are inherently parallel and distributed. During execution, a
-*stream* has one or more **stream partitions**, and each *operator* has one or
-more **operator subtasks**. The operator subtasks are independent of one
-another, and execute in different threads and possibly on different machines or
-containers.
+Flink 程序本质上是分布式并行程序。在程序执行期间,一个流有一个或多个**流分区**(Stream
Partition),每个算子有一个或多个**算子子任务**(Operator
Subtask)。算子子任务彼此独立,并在不同的线程中运行,或在不同的计算机或容器中运行。
-The number of operator subtasks is the **parallelism** of that particular
-operator.
-Different operators of the same program may have different levels of
-parallelism.
+算子子任务数就是其对应算子的**并行度**。在同一程序中,不同算子也可能具有不同的并行度。
<img src="{{ site.baseurl }}/fig/parallel_dataflow.svg" alt="A parallel
dataflow" class="offset" width="80%" />
-Streams can transport data between two operators in a *one-to-one* (or
-*forwarding*) pattern, or in a *redistributing* pattern:
-
- - **One-to-one** streams (for example between the *Source* and the *map()*
- operators in the figure above) preserve the partitioning and ordering of
- the elements. That means that subtask[1] of the *map()* operator will see
- the same elements in the same order as they were produced by subtask[1] of
- the *Source* operator.
-
- - **Redistributing** streams (as between *map()* and *keyBy/window* above, as
- well as between *keyBy/window* and *Sink*) change the partitioning of
- streams. Each *operator subtask* sends data to different target subtasks,
- depending on the selected transformation. Examples are *keyBy()* (which
- re-partitions by hashing the key), *broadcast()*, or *rebalance()* (which
- re-partitions randomly). In a *redistributing* exchange the ordering among
- the elements is only preserved within each pair of sending and receiving
- subtasks (for example, subtask[1] of *map()* and subtask[2] of
- *keyBy/window*). So, for example, the redistribution between the
keyBy/window and
- the Sink operators shown above introduces non-determinism regarding the
- order in which the aggregated results for different keys arrive at the
Sink.
+Flink 算子之间可以通过*一对一*(*直传*)模式或*重新分发*模式传输数据:
+
+ - **一对一**模式(例如上图中的 *Source* 和 *map()* 算子之间)可以保留元素的分区和顺序信息。这意味着 *map()* 算子的
subtask[1] 输入的数据以及其顺序与 *Source* 算子的 subtask[1] 输出的数据和顺序完全相同。
Review comment:
> `保留元素的分区和顺序信息`
这里我理解是说,能保证下游和上游分区一致(同样分区的数据进入到同样分区的下游),且元素的顺序也一致。现在的描述中,保留顺序信息我觉得没问题,保留分区信息能否有更好的描述呢?
Thx.
我个人理解是`这意味着 *map()* 算子的 subtask[1] 输入的数据以及其顺序与 *Source* 算子的 subtask[1]
输出的数据和顺序完全相同。`描述中的`输入的数据`,`输出的数据完全相同`是在表达`同样分区的数据进入到同样分区的下游`,不过不是上述这样直观的描述,我再改进一下表达方式,感谢。
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]