RocMarshal commented on a change in pull request #16852:
URL: https://github.com/apache/flink/pull/16852#discussion_r697991286
##########
File path: docs/content.zh/docs/dev/datastream/overview.md
##########
@@ -209,46 +159,30 @@ print()
{{< /tab >}}
{{< /tabs >}}
-Once you specified the complete program you need to **trigger the program
-execution** by calling `execute()` on the `StreamExecutionEnvironment`.
-Depending on the type of the `ExecutionEnvironment` the execution will be
-triggered on your local machine or submit your program for execution on a
-cluster.
+一旦指定了完整的程序,需要调用 `StreamExecutionEnvironment` 的 `execute()` 方法来**触发程序执行**。根据
`ExecutionEnvironment` 的类型,执行会在你的本地机器上触发,或将你的程序提交到某个集群上执行。
-The `execute()` method will wait for the job to finish and then return a
-`JobExecutionResult`, this contains execution times and accumulator results.
+`execute()` 方法将等待作业完成,然后返回一个 `JobExecutionResult` ,其中包含执行时间和累加器结果。
-If you don't want to wait for the job to finish, you can trigger asynchronous
-job execution by calling `executeAsync()` on the `StreamExecutionEnvironment`.
-It will return a `JobClient` with which you can communicate with the job you
-just submitted. For instance, here is how to implement the semantics of
-`execute()` by using `executeAsync()`.
+如果不想等待作业完成,可以通过调用 `StreamExecutionEnvironment` 的 `executeAsync()`
方法来触发作业异步执行。它会返回一个 `JobClient` ,你可以通过它与刚刚提交的作业进行通信。如下是使用 `executeAsync()` 实现
`execute()` 语义的示例。
```java
final JobClient jobClient = env.executeAsync();
final JobExecutionResult jobExecutionResult =
jobClient.getJobExecutionResult().get();
```
-That last part about program execution is crucial to understanding when and how
-Flink operations are executed. All Flink programs are executed lazily: When the
-program's main method is executed, the data loading and transformations do not
-happen directly. Rather, each operation is created and added to a dataflow
-graph. The operations are actually executed when the execution is explicitly
-triggered by an `execute()` call on the execution environment. Whether the
-program is executed locally or on a cluster depends on the type of execution
-environment
+关于程序执行的最后一部分对于理解何时以及如何执行 Flink 算子是至关重要的。所有 Flink 程序都是惰性执行的:当程序的 main
方法被执行时,数据加载和转换不会直接发生。相反,每个算子都被创建并添加到 dataflow 形成的有向图。当执行被执行环境的 `execute()`
方法显示地触发时,这些算子才会真正执行。程序是在本地执行还是在集群上执行取决于执行环境的类型。
-The lazy evaluation lets you construct sophisticated programs that Flink
-executes as one holistically planned unit.
+惰性计算让你构建复杂的程序,Flink 将其作为一个整体规划的单元来执行。
Review comment:
Maybe you could do it better.
##########
File path: docs/content.zh/docs/dev/datastream/overview.md
##########
@@ -209,46 +159,30 @@ print()
{{< /tab >}}
{{< /tabs >}}
-Once you specified the complete program you need to **trigger the program
-execution** by calling `execute()` on the `StreamExecutionEnvironment`.
-Depending on the type of the `ExecutionEnvironment` the execution will be
-triggered on your local machine or submit your program for execution on a
-cluster.
+一旦指定了完整的程序,需要调用 `StreamExecutionEnvironment` 的 `execute()` 方法来**触发程序执行**。根据
`ExecutionEnvironment` 的类型,执行会在你的本地机器上触发,或将你的程序提交到某个集群上执行。
-The `execute()` method will wait for the job to finish and then return a
-`JobExecutionResult`, this contains execution times and accumulator results.
+`execute()` 方法将等待作业完成,然后返回一个 `JobExecutionResult` ,其中包含执行时间和累加器结果。
Review comment:
```suggestion
`execute()` 方法将等待作业完成,然后返回一个 `JobExecutionResult`,其中包含执行时间和累加器结果。
```
##########
File path: docs/content.zh/docs/dev/datastream/overview.md
##########
@@ -209,46 +159,30 @@ print()
{{< /tab >}}
{{< /tabs >}}
-Once you specified the complete program you need to **trigger the program
-execution** by calling `execute()` on the `StreamExecutionEnvironment`.
-Depending on the type of the `ExecutionEnvironment` the execution will be
-triggered on your local machine or submit your program for execution on a
-cluster.
+一旦指定了完整的程序,需要调用 `StreamExecutionEnvironment` 的 `execute()` 方法来**触发程序执行**。根据
`ExecutionEnvironment` 的类型,执行会在你的本地机器上触发,或将你的程序提交到某个集群上执行。
-The `execute()` method will wait for the job to finish and then return a
-`JobExecutionResult`, this contains execution times and accumulator results.
+`execute()` 方法将等待作业完成,然后返回一个 `JobExecutionResult` ,其中包含执行时间和累加器结果。
-If you don't want to wait for the job to finish, you can trigger asynchronous
-job execution by calling `executeAsync()` on the `StreamExecutionEnvironment`.
-It will return a `JobClient` with which you can communicate with the job you
-just submitted. For instance, here is how to implement the semantics of
-`execute()` by using `executeAsync()`.
+如果不想等待作业完成,可以通过调用 `StreamExecutionEnvironment` 的 `executeAsync()`
方法来触发作业异步执行。它会返回一个 `JobClient` ,你可以通过它与刚刚提交的作业进行通信。如下是使用 `executeAsync()` 实现
`execute()` 语义的示例。
```java
final JobClient jobClient = env.executeAsync();
final JobExecutionResult jobExecutionResult =
jobClient.getJobExecutionResult().get();
```
-That last part about program execution is crucial to understanding when and how
-Flink operations are executed. All Flink programs are executed lazily: When the
-program's main method is executed, the data loading and transformations do not
-happen directly. Rather, each operation is created and added to a dataflow
-graph. The operations are actually executed when the execution is explicitly
-triggered by an `execute()` call on the execution environment. Whether the
-program is executed locally or on a cluster depends on the type of execution
-environment
+关于程序执行的最后一部分对于理解何时以及如何执行 Flink 算子是至关重要的。所有 Flink 程序都是惰性执行的:当程序的 main
方法被执行时,数据加载和转换不会直接发生。相反,每个算子都被创建并添加到 dataflow 形成的有向图。当执行被执行环境的 `execute()`
方法显示地触发时,这些算子才会真正执行。程序是在本地执行还是在集群上执行取决于执行环境的类型。
-The lazy evaluation lets you construct sophisticated programs that Flink
-executes as one holistically planned unit.
+惰性计算让你构建复杂的程序,Flink 将其作为一个整体规划的单元来执行。
{{< top >}}
-Example Program
+<a name="example-program"></a>
+
+示例程序
---------------
-The following program is a complete, working example of streaming window word
count application, that counts the
-words coming from a web socket in 5 second windows. You can copy & paste
the code to run it locally.
+如下是一个完整、可运行的程序示例,它是基于流窗口的单词统计应用程序,计算 5 秒窗口内来自 Web 套接字的单词数。你可以复制并粘贴代码以在本地运行。
Review comment:
```suggestion
如下是一个完整的、可运行的程序示例,它是基于流窗口的单词统计应用程序,计算 5 秒窗口内来自 Web 套接字的单词数。你可以复制并粘贴代码以在本地运行。
```
##########
File path: docs/content.zh/docs/dev/datastream/overview.md
##########
@@ -209,46 +159,30 @@ print()
{{< /tab >}}
{{< /tabs >}}
-Once you specified the complete program you need to **trigger the program
-execution** by calling `execute()` on the `StreamExecutionEnvironment`.
-Depending on the type of the `ExecutionEnvironment` the execution will be
-triggered on your local machine or submit your program for execution on a
-cluster.
+一旦指定了完整的程序,需要调用 `StreamExecutionEnvironment` 的 `execute()` 方法来**触发程序执行**。根据
`ExecutionEnvironment` 的类型,执行会在你的本地机器上触发,或将你的程序提交到某个集群上执行。
-The `execute()` method will wait for the job to finish and then return a
-`JobExecutionResult`, this contains execution times and accumulator results.
+`execute()` 方法将等待作业完成,然后返回一个 `JobExecutionResult` ,其中包含执行时间和累加器结果。
-If you don't want to wait for the job to finish, you can trigger asynchronous
-job execution by calling `executeAsync()` on the `StreamExecutionEnvironment`.
-It will return a `JobClient` with which you can communicate with the job you
-just submitted. For instance, here is how to implement the semantics of
-`execute()` by using `executeAsync()`.
+如果不想等待作业完成,可以通过调用 `StreamExecutionEnvironment` 的 `executeAsync()`
方法来触发作业异步执行。它会返回一个 `JobClient` ,你可以通过它与刚刚提交的作业进行通信。如下是使用 `executeAsync()` 实现
`execute()` 语义的示例。
Review comment:
```suggestion
如果不想等待作业完成,可以通过调用 `StreamExecutionEnvironment` 的 `executeAsync()`
方法来触发作业异步执行。它会返回一个 `JobClient`,你可以通过它与刚刚提交的作业进行通信。如下是使用 `executeAsync()` 实现
`execute()` 语义的示例。
```
##########
File path: docs/content.zh/docs/dev/datastream/overview.md
##########
@@ -319,248 +253,206 @@ object WindowWordCount {
{{< /tab >}}
{{< /tabs >}}
-To run the example program, start the input stream with netcat first from a
terminal:
+要运行示例程序,首先从终端使用 netcat 启动输入流:
```bash
nc -lk 9999
```
-Just type some words hitting return for a new word. These will be the input to
the
-word count program. If you want to see counts greater than 1, type the same
word again and again within
-5 seconds (increase the window size from 5 seconds if you cannot type that
fast ☺).
+只需输入一些单词,然后按回车键即可传入新单词。这些将作为单词统计程序的输入。如果想查看大于 1 的计数,在 5
秒内重复输入相同的单词即可(如果无法快速输入,则可以将窗口大小从 5 秒增加 ☺)。
{{< top >}}
+<a name="data-sources"></a>
+
Data Sources
------------
{{< tabs "8104e62c-db79-40b0-8519-0063e9be791f" >}}
{{< tab "Java" >}}
-Sources are where your program reads its input from. You can attach a source
to your program by
-using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with
a number of pre-implemented
-source functions, but you can always write your own custom sources by
implementing the `SourceFunction`
-for non-parallel sources, or by implementing the `ParallelSourceFunction`
interface or extending the
-`RichParallelSourceFunction` for parallel sources.
-
-There are several predefined stream sources accessible from the
`StreamExecutionEnvironment`:
-
-File-based:
+Source 是你程序从中读取其输入的地方。你可以用
`StreamExecutionEnvironment.addSource(sourceFunction)` 将一个 source 关联到你的程序。Flink
自带了许多预先实现的 source functions,不过你总是可以编写自定义的 source,可以为非并行 source 实现
`SourceFunction` 接口,也可以为并行 source 实现 `ParallelSourceFunction` 接口或继承
`RichParallelSourceFunction`。
-- `readTextFile(path)` - Reads text files, i.e. files that respect the
`TextInputFormat` specification, line-by-line and returns them as Strings.
+通过 `StreamExecutionEnvironment` 可以访问多种预定义的 stream source:
-- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the
specified file input format.
+基于文件:
-- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)`
- This is the method called internally by the two previous ones. It reads
files in the `path` based on the given `fileInputFormat`. Depending on the
provided `watchType`, this source may periodically monitor (every `interval`
ms) the path for new data (`FileProcessingMode.PROCESS_CONTINUOUSLY`), or
process once the data currently in the path and exit
(`FileProcessingMode.PROCESS_ONCE`). Using the `pathFilter`, the user can
further exclude files from being processed.
+- `readTextFile(path)` - 读取文本文件,例如遵守 TextInputFormat 规范的文件,逐行读取并将它们作为字符串返回。
- *IMPLEMENTATION:*
+- `readFile(fileInputFormat, path)` - 按照指定的文件输入格式读取(一次)文件。
- Under the hood, Flink splits the file reading process into two sub-tasks,
namely *directory monitoring* and *data reading*. Each of these sub-tasks is
implemented by a separate entity. Monitoring is implemented by a single,
**non-parallel** (parallelism = 1) task, while reading is performed by multiple
tasks running in parallel. The parallelism of the latter is equal to the job
parallelism. The role of the single monitoring task is to scan the directory
(periodically or only once depending on the `watchType`), find the files to be
processed, divide them in *splits*, and assign these splits to the downstream
readers. The readers are the ones who will read the actual data. Each split is
read by only one reader, while a reader can read multiple splits, one-by-one.
+- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)`
- 这是前两个方法内部调用的方法。它基于给定的 `fileInputFormat` 读取路径`path` 上的文件。根据提供的 `watchType`
的不同,source 可能定期(每 `interval` 毫秒)监控路径上的新数据(watchType 为
`FileProcessingMode.PROCESS_CONTINUOUSLY`),或者处理一次当前路径中的数据然后退出 (watchType 为
`FileProcessingMode.PROCESS_ONCE`)。使用 `pathFilter`,用户可以进一步排除正在处理的文件。
Review comment:
```suggestion
- `readFile(fileInputFormat, path, watchType, interval, pathFilter,
typeInfo)` - 这是前两个方法内部调用的方法。它基于给定的 `fileInputFormat` 读取路径 `path` 上的文件。根据提供的
`watchType` 的不同,source 可能定期(每 `interval` 毫秒)监控路径上的新数据(watchType 为
`FileProcessingMode.PROCESS_CONTINUOUSLY`),或者处理一次当前路径中的数据然后退出(watchType 为
`FileProcessingMode.PROCESS_ONCE`)。使用 `pathFilter`,用户可以进一步排除正在处理的文件。
```
##########
File path: docs/content.zh/docs/dev/datastream/overview.md
##########
@@ -319,248 +253,206 @@ object WindowWordCount {
{{< /tab >}}
{{< /tabs >}}
-To run the example program, start the input stream with netcat first from a
terminal:
+要运行示例程序,首先从终端使用 netcat 启动输入流:
```bash
nc -lk 9999
```
-Just type some words hitting return for a new word. These will be the input to
the
-word count program. If you want to see counts greater than 1, type the same
word again and again within
-5 seconds (increase the window size from 5 seconds if you cannot type that
fast ☺).
+只需输入一些单词,然后按回车键即可传入新单词。这些将作为单词统计程序的输入。如果想查看大于 1 的计数,在 5
秒内重复输入相同的单词即可(如果无法快速输入,则可以将窗口大小从 5 秒增加 ☺)。
{{< top >}}
+<a name="data-sources"></a>
+
Data Sources
------------
{{< tabs "8104e62c-db79-40b0-8519-0063e9be791f" >}}
{{< tab "Java" >}}
-Sources are where your program reads its input from. You can attach a source
to your program by
-using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with
a number of pre-implemented
-source functions, but you can always write your own custom sources by
implementing the `SourceFunction`
-for non-parallel sources, or by implementing the `ParallelSourceFunction`
interface or extending the
-`RichParallelSourceFunction` for parallel sources.
-
-There are several predefined stream sources accessible from the
`StreamExecutionEnvironment`:
-
-File-based:
+Source 是你程序从中读取其输入的地方。你可以用
`StreamExecutionEnvironment.addSource(sourceFunction)` 将一个 source 关联到你的程序。Flink
自带了许多预先实现的 source functions,不过你总是可以编写自定义的 source,可以为非并行 source 实现
`SourceFunction` 接口,也可以为并行 source 实现 `ParallelSourceFunction` 接口或继承
`RichParallelSourceFunction`。
-- `readTextFile(path)` - Reads text files, i.e. files that respect the
`TextInputFormat` specification, line-by-line and returns them as Strings.
+通过 `StreamExecutionEnvironment` 可以访问多种预定义的 stream source:
-- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the
specified file input format.
+基于文件:
-- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)`
- This is the method called internally by the two previous ones. It reads
files in the `path` based on the given `fileInputFormat`. Depending on the
provided `watchType`, this source may periodically monitor (every `interval`
ms) the path for new data (`FileProcessingMode.PROCESS_CONTINUOUSLY`), or
process once the data currently in the path and exit
(`FileProcessingMode.PROCESS_ONCE`). Using the `pathFilter`, the user can
further exclude files from being processed.
+- `readTextFile(path)` - 读取文本文件,例如遵守 TextInputFormat 规范的文件,逐行读取并将它们作为字符串返回。
- *IMPLEMENTATION:*
+- `readFile(fileInputFormat, path)` - 按照指定的文件输入格式读取(一次)文件。
- Under the hood, Flink splits the file reading process into two sub-tasks,
namely *directory monitoring* and *data reading*. Each of these sub-tasks is
implemented by a separate entity. Monitoring is implemented by a single,
**non-parallel** (parallelism = 1) task, while reading is performed by multiple
tasks running in parallel. The parallelism of the latter is equal to the job
parallelism. The role of the single monitoring task is to scan the directory
(periodically or only once depending on the `watchType`), find the files to be
processed, divide them in *splits*, and assign these splits to the downstream
readers. The readers are the ones who will read the actual data. Each split is
read by only one reader, while a reader can read multiple splits, one-by-one.
+- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)`
- 这是前两个方法内部调用的方法。它基于给定的 `fileInputFormat` 读取路径`path` 上的文件。根据提供的 `watchType`
的不同,source 可能定期(每 `interval` 毫秒)监控路径上的新数据(watchType 为
`FileProcessingMode.PROCESS_CONTINUOUSLY`),或者处理一次当前路径中的数据然后退出 (watchType 为
`FileProcessingMode.PROCESS_ONCE`)。使用 `pathFilter`,用户可以进一步排除正在处理的文件。
- *IMPORTANT NOTES:*
+ *实现:*
- 1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`,
when a file is modified, its contents are re-processed entirely. This can break
the "exactly-once" semantics, as appending data at the end of a file will lead
to **all** its contents being re-processed.
+ 在底层,Flink 将文件读取过程拆分为两个子任务,即 *目录监控* 和
*数据读取*。每个子任务都由一个单独的实体实现。监控由单个**非并行**(并行度 =
1)任务实现,而读取由多个并行运行的任务执行。后者的并行度和作业的并行度相等。单个监控任务的作用是扫描目录(定期或仅扫描一次,取决于
`watchType`),找到要处理的文件,将它们划分为 *分片*,并将这些分片分配给下游 reader。Reader
是将实际获取数据的角色。每个分片只能被一个 reader 读取,而一个 reader 可以一个一个地读取多个分片。
- 2. If the `watchType` is set to `FileProcessingMode.PROCESS_ONCE`, the
source scans the path **once** and exits, without waiting for the readers to
finish reading the file contents. Of course the readers will continue reading
until all file contents are read. Closing the source leads to no more
checkpoints after that point. This may lead to slower recovery after a node
failure, as the job will resume reading from the last checkpoint.
+ *重要提示:*
-Socket-based:
+ 1. 如果 `watchType` 设置为
`FileProcessingMode.PROCESS_CONTINUOUSLY`,当一个文件被修改时,它的内容会被完全重新处理。这可能会打破 "精确一次"
的语义,因为在文件末尾追加数据将导致重新处理文件的**所有**内容。
-- `socketTextStream` - Reads from a socket. Elements can be separated by a
delimiter.
+ 2. 如果 `watchType` 设置为 `FileProcessingMode.PROCESS_ONCE`,source
扫描**一次**路径然后退出,无需等待 reader 读完文件内容。当然,reader 会继续读取数据,直到所有文件内容都读完。关闭 source
会导致在那之后不再有检查点。 这可能会导致节点故障后恢复速度变慢,因为作业将从最后一个检查点恢复读取。
-Collection-based:
+基于套接字:
-- `fromCollection(Collection)` - Creates a data stream from the Java
Java.util.Collection. All elements
- in the collection must be of the same type.
+- `socketTextStream` - 从套接字读取。元素可以由分隔符分隔。
-- `fromCollection(Iterator, Class)` - Creates a data stream from an iterator.
The class specifies the
- data type of the elements returned by the iterator.
+基于集合:
-- `fromElements(T ...)` - Creates a data stream from the given sequence of
objects. All objects must be
- of the same type.
+- `fromCollection(Collection)` - 从 Java Java.util.Collection
创建数据流。集合中的所有元素必须属于同一类型。
+
+- `fromCollection(Iterator, Class)` - 从迭代器创建数据流。class 参数指定迭代器返回元素的数据类型。
+
+- `fromElements(T ...)` - 从给定的对象序列中创建数据流。所有的对象必须属于同一类型。
+
+- `fromParallelCollection(SplittableIterator, Class)` - 从迭代器并行创建数据流。class
参数指定迭代器返回元素的数据类型。
+
+- `generateSequence(from, to)` - 基于给定间隔内的数字序列并行生成数据流。
-- `fromParallelCollection(SplittableIterator, Class)` - Creates a data stream
from an iterator, in
- parallel. The class specifies the data type of the elements returned by the
iterator.
+自定义:
-- `generateSequence(from, to)` - Generates the sequence of numbers in the
given interval, in
- parallel.
-
-Custom:
-
-- `addSource` - Attach a new source function. For example, to read from Apache
Kafka you can use
- `addSource(new FlinkKafkaConsumer<>(...))`. See [connectors]({{< ref
"docs/connectors/datastream/overview" >}}) for more details.
+- `addSource` - 关联一个新的 source function。例如,你可以使用 `addSource(new
FlinkKafkaConsumer<>(...))` 来从 Apache Kafka 获取数据。更多详细信息见[连接器]({{< ref
"docs/connectors/datastream/overview" >}})。
{{< /tab >}}
{{< tab "Scala" >}}
-Sources are where your program reads its input from. You can attach a source
to your program by
-using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with
a number of pre-implemented
-source functions, but you can always write your own custom sources by
implementing the `SourceFunction`
-for non-parallel sources, or by implementing the `ParallelSourceFunction`
interface or extending the
-`RichParallelSourceFunction` for parallel sources.
-
-There are several predefined stream sources accessible from the
`StreamExecutionEnvironment`:
-
-File-based:
+Source 是你程序从中读取其输入的地方。你可以用
`StreamExecutionEnvironment.addSource(sourceFunction)` 将一个 source 关联到你的程序。Flink
自带了许多预先实现的 source functions,不过你总是可以编写自定义的 source,可以为非并行 source 实现
`SourceFunction` 接口,也可以为并行 source 实现 `ParallelSourceFunction` 接口或继承
`RichParallelSourceFunction`。
Review comment:
Maybe you should check the lines between `line` 319and `line` 366 based
on comments between `line` 164 and `line` 318.
##########
File path: docs/content.zh/docs/dev/datastream/overview.md
##########
@@ -319,248 +253,206 @@ object WindowWordCount {
{{< /tab >}}
{{< /tabs >}}
-To run the example program, start the input stream with netcat first from a
terminal:
+要运行示例程序,首先从终端使用 netcat 启动输入流:
```bash
nc -lk 9999
```
-Just type some words hitting return for a new word. These will be the input to
the
-word count program. If you want to see counts greater than 1, type the same
word again and again within
-5 seconds (increase the window size from 5 seconds if you cannot type that
fast ☺).
+只需输入一些单词,然后按回车键即可传入新单词。这些将作为单词统计程序的输入。如果想查看大于 1 的计数,在 5
秒内重复输入相同的单词即可(如果无法快速输入,则可以将窗口大小从 5 秒增加 ☺)。
{{< top >}}
+<a name="data-sources"></a>
+
Data Sources
------------
{{< tabs "8104e62c-db79-40b0-8519-0063e9be791f" >}}
{{< tab "Java" >}}
-Sources are where your program reads its input from. You can attach a source
to your program by
-using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with
a number of pre-implemented
-source functions, but you can always write your own custom sources by
implementing the `SourceFunction`
-for non-parallel sources, or by implementing the `ParallelSourceFunction`
interface or extending the
-`RichParallelSourceFunction` for parallel sources.
-
-There are several predefined stream sources accessible from the
`StreamExecutionEnvironment`:
-
-File-based:
+Source 是你程序从中读取其输入的地方。你可以用
`StreamExecutionEnvironment.addSource(sourceFunction)` 将一个 source 关联到你的程序。Flink
自带了许多预先实现的 source functions,不过你总是可以编写自定义的 source,可以为非并行 source 实现
`SourceFunction` 接口,也可以为并行 source 实现 `ParallelSourceFunction` 接口或继承
`RichParallelSourceFunction`。
-- `readTextFile(path)` - Reads text files, i.e. files that respect the
`TextInputFormat` specification, line-by-line and returns them as Strings.
+通过 `StreamExecutionEnvironment` 可以访问多种预定义的 stream source:
-- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the
specified file input format.
+基于文件:
-- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)`
- This is the method called internally by the two previous ones. It reads
files in the `path` based on the given `fileInputFormat`. Depending on the
provided `watchType`, this source may periodically monitor (every `interval`
ms) the path for new data (`FileProcessingMode.PROCESS_CONTINUOUSLY`), or
process once the data currently in the path and exit
(`FileProcessingMode.PROCESS_ONCE`). Using the `pathFilter`, the user can
further exclude files from being processed.
+- `readTextFile(path)` - 读取文本文件,例如遵守 TextInputFormat 规范的文件,逐行读取并将它们作为字符串返回。
- *IMPLEMENTATION:*
+- `readFile(fileInputFormat, path)` - 按照指定的文件输入格式读取(一次)文件。
- Under the hood, Flink splits the file reading process into two sub-tasks,
namely *directory monitoring* and *data reading*. Each of these sub-tasks is
implemented by a separate entity. Monitoring is implemented by a single,
**non-parallel** (parallelism = 1) task, while reading is performed by multiple
tasks running in parallel. The parallelism of the latter is equal to the job
parallelism. The role of the single monitoring task is to scan the directory
(periodically or only once depending on the `watchType`), find the files to be
processed, divide them in *splits*, and assign these splits to the downstream
readers. The readers are the ones who will read the actual data. Each split is
read by only one reader, while a reader can read multiple splits, one-by-one.
+- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)`
- 这是前两个方法内部调用的方法。它基于给定的 `fileInputFormat` 读取路径`path` 上的文件。根据提供的 `watchType`
的不同,source 可能定期(每 `interval` 毫秒)监控路径上的新数据(watchType 为
`FileProcessingMode.PROCESS_CONTINUOUSLY`),或者处理一次当前路径中的数据然后退出 (watchType 为
`FileProcessingMode.PROCESS_ONCE`)。使用 `pathFilter`,用户可以进一步排除正在处理的文件。
- *IMPORTANT NOTES:*
+ *实现:*
- 1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`,
when a file is modified, its contents are re-processed entirely. This can break
the "exactly-once" semantics, as appending data at the end of a file will lead
to **all** its contents being re-processed.
+ 在底层,Flink 将文件读取过程拆分为两个子任务,即 *目录监控* 和
*数据读取*。每个子任务都由一个单独的实体实现。监控由单个**非并行**(并行度 =
1)任务实现,而读取由多个并行运行的任务执行。后者的并行度和作业的并行度相等。单个监控任务的作用是扫描目录(定期或仅扫描一次,取决于
`watchType`),找到要处理的文件,将它们划分为 *分片*,并将这些分片分配给下游 reader。Reader
是将实际获取数据的角色。每个分片只能被一个 reader 读取,而一个 reader 可以一个一个地读取多个分片。
- 2. If the `watchType` is set to `FileProcessingMode.PROCESS_ONCE`, the
source scans the path **once** and exits, without waiting for the readers to
finish reading the file contents. Of course the readers will continue reading
until all file contents are read. Closing the source leads to no more
checkpoints after that point. This may lead to slower recovery after a node
failure, as the job will resume reading from the last checkpoint.
+ *重要提示:*
-Socket-based:
+ 1. 如果 `watchType` 设置为
`FileProcessingMode.PROCESS_CONTINUOUSLY`,当一个文件被修改时,它的内容会被完全重新处理。这可能会打破 "精确一次"
的语义,因为在文件末尾追加数据将导致重新处理文件的**所有**内容。
-- `socketTextStream` - Reads from a socket. Elements can be separated by a
delimiter.
+ 2. 如果 `watchType` 设置为 `FileProcessingMode.PROCESS_ONCE`,source
扫描**一次**路径然后退出,无需等待 reader 读完文件内容。当然,reader 会继续读取数据,直到所有文件内容都读完。关闭 source
会导致在那之后不再有检查点。 这可能会导致节点故障后恢复速度变慢,因为作业将从最后一个检查点恢复读取。
Review comment:
```suggestion
2. 如果 `watchType` 设置为 `FileProcessingMode.PROCESS_ONCE`,source
扫描**一次**路径然后退出,无需等待 reader 读完文件内容。当然,reader 会继续读取数据,直到所有文件内容都读完。关闭 source
会导致在那之后不再有检查点。这可能会导致节点故障后恢复速度变慢,因为作业将从最后一个检查点恢复读取。
```
##########
File path: docs/content.zh/docs/dev/datastream/overview.md
##########
@@ -319,248 +253,206 @@ object WindowWordCount {
{{< /tab >}}
{{< /tabs >}}
-To run the example program, start the input stream with netcat first from a
terminal:
+要运行示例程序,首先从终端使用 netcat 启动输入流:
```bash
nc -lk 9999
```
-Just type some words hitting return for a new word. These will be the input to
the
-word count program. If you want to see counts greater than 1, type the same
word again and again within
-5 seconds (increase the window size from 5 seconds if you cannot type that
fast ☺).
+只需输入一些单词,然后按回车键即可传入新单词。这些将作为单词统计程序的输入。如果想查看大于 1 的计数,在 5
秒内重复输入相同的单词即可(如果无法快速输入,则可以将窗口大小从 5 秒增加 ☺)。
{{< top >}}
+<a name="data-sources"></a>
+
Data Sources
------------
{{< tabs "8104e62c-db79-40b0-8519-0063e9be791f" >}}
{{< tab "Java" >}}
-Sources are where your program reads its input from. You can attach a source
to your program by
-using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with
a number of pre-implemented
-source functions, but you can always write your own custom sources by
implementing the `SourceFunction`
-for non-parallel sources, or by implementing the `ParallelSourceFunction`
interface or extending the
-`RichParallelSourceFunction` for parallel sources.
-
-There are several predefined stream sources accessible from the
`StreamExecutionEnvironment`:
-
-File-based:
+Source 是你程序从中读取其输入的地方。你可以用
`StreamExecutionEnvironment.addSource(sourceFunction)` 将一个 source 关联到你的程序。Flink
自带了许多预先实现的 source functions,不过你总是可以编写自定义的 source,可以为非并行 source 实现
`SourceFunction` 接口,也可以为并行 source 实现 `ParallelSourceFunction` 接口或继承
`RichParallelSourceFunction`。
Review comment:
```suggestion
Source 是你的程序从中读取其输入的地方。你可以用
`StreamExecutionEnvironment.addSource(sourceFunction)` 将一个 source 关联到你的程序。Flink
自带了许多预先实现的 source functions,不过你仍然可以通过实现 `SourceFunction` 接口编写自定义的非并行
source,也可以通过实现 `ParallelSourceFunction` 接口或者继承 `RichParallelSourceFunction`
类编写自定义的并行 sources。
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]