hapihu commented on a change in pull request #16852:
URL: https://github.com/apache/flink/pull/16852#discussion_r698034007
##########
File path: docs/content.zh/docs/dev/datastream/overview.md
##########
@@ -319,248 +253,206 @@ object WindowWordCount {
{{< /tab >}}
{{< /tabs >}}
-To run the example program, start the input stream with netcat first from a
terminal:
+要运行示例程序,首先从终端使用 netcat 启动输入流:
```bash
nc -lk 9999
```
-Just type some words hitting return for a new word. These will be the input to
the
-word count program. If you want to see counts greater than 1, type the same
word again and again within
-5 seconds (increase the window size from 5 seconds if you cannot type that
fast ☺).
+只需输入一些单词,然后按回车键即可传入新单词。这些将作为单词统计程序的输入。如果想查看大于 1 的计数,在 5
秒内重复输入相同的单词即可(如果无法快速输入,则可以将窗口大小从 5 秒增加 ☺)。
{{< top >}}
+<a name="data-sources"></a>
+
Data Sources
------------
{{< tabs "8104e62c-db79-40b0-8519-0063e9be791f" >}}
{{< tab "Java" >}}
-Sources are where your program reads its input from. You can attach a source
to your program by
-using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with
a number of pre-implemented
-source functions, but you can always write your own custom sources by
implementing the `SourceFunction`
-for non-parallel sources, or by implementing the `ParallelSourceFunction`
interface or extending the
-`RichParallelSourceFunction` for parallel sources.
-
-There are several predefined stream sources accessible from the
`StreamExecutionEnvironment`:
-
-File-based:
+Source 是你程序从中读取其输入的地方。你可以用
`StreamExecutionEnvironment.addSource(sourceFunction)` 将一个 source 关联到你的程序。Flink
自带了许多预先实现的 source functions,不过你总是可以编写自定义的 source,可以为非并行 source 实现
`SourceFunction` 接口,也可以为并行 source 实现 `ParallelSourceFunction` 接口或继承
`RichParallelSourceFunction`。
-- `readTextFile(path)` - Reads text files, i.e. files that respect the
`TextInputFormat` specification, line-by-line and returns them as Strings.
+通过 `StreamExecutionEnvironment` 可以访问多种预定义的 stream source:
-- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the
specified file input format.
+基于文件:
-- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)`
- This is the method called internally by the two previous ones. It reads
files in the `path` based on the given `fileInputFormat`. Depending on the
provided `watchType`, this source may periodically monitor (every `interval`
ms) the path for new data (`FileProcessingMode.PROCESS_CONTINUOUSLY`), or
process once the data currently in the path and exit
(`FileProcessingMode.PROCESS_ONCE`). Using the `pathFilter`, the user can
further exclude files from being processed.
+- `readTextFile(path)` - 读取文本文件,例如遵守 TextInputFormat 规范的文件,逐行读取并将它们作为字符串返回。
- *IMPLEMENTATION:*
+- `readFile(fileInputFormat, path)` - 按照指定的文件输入格式读取(一次)文件。
- Under the hood, Flink splits the file reading process into two sub-tasks,
namely *directory monitoring* and *data reading*. Each of these sub-tasks is
implemented by a separate entity. Monitoring is implemented by a single,
**non-parallel** (parallelism = 1) task, while reading is performed by multiple
tasks running in parallel. The parallelism of the latter is equal to the job
parallelism. The role of the single monitoring task is to scan the directory
(periodically or only once depending on the `watchType`), find the files to be
processed, divide them in *splits*, and assign these splits to the downstream
readers. The readers are the ones who will read the actual data. Each split is
read by only one reader, while a reader can read multiple splits, one-by-one.
+- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)`
- 这是前两个方法内部调用的方法。它基于给定的 `fileInputFormat` 读取路径`path` 上的文件。根据提供的 `watchType`
的不同,source 可能定期(每 `interval` 毫秒)监控路径上的新数据(watchType 为
`FileProcessingMode.PROCESS_CONTINUOUSLY`),或者处理一次当前路径中的数据然后退出 (watchType 为
`FileProcessingMode.PROCESS_ONCE`)。使用 `pathFilter`,用户可以进一步排除正在处理的文件。
- *IMPORTANT NOTES:*
+ *实现:*
- 1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`,
when a file is modified, its contents are re-processed entirely. This can break
the "exactly-once" semantics, as appending data at the end of a file will lead
to **all** its contents being re-processed.
+ 在底层,Flink 将文件读取过程拆分为两个子任务,即 *目录监控* 和
*数据读取*。每个子任务都由一个单独的实体实现。监控由单个**非并行**(并行度 =
1)任务实现,而读取由多个并行运行的任务执行。后者的并行度和作业的并行度相等。单个监控任务的作用是扫描目录(定期或仅扫描一次,取决于
`watchType`),找到要处理的文件,将它们划分为 *分片*,并将这些分片分配给下游 reader。Reader
是将实际获取数据的角色。每个分片只能被一个 reader 读取,而一个 reader 可以一个一个地读取多个分片。
- 2. If the `watchType` is set to `FileProcessingMode.PROCESS_ONCE`, the
source scans the path **once** and exits, without waiting for the readers to
finish reading the file contents. Of course the readers will continue reading
until all file contents are read. Closing the source leads to no more
checkpoints after that point. This may lead to slower recovery after a node
failure, as the job will resume reading from the last checkpoint.
+ *重要提示:*
-Socket-based:
+ 1. 如果 `watchType` 设置为
`FileProcessingMode.PROCESS_CONTINUOUSLY`,当一个文件被修改时,它的内容会被完全重新处理。这可能会打破 "精确一次"
的语义,因为在文件末尾追加数据将导致重新处理文件的**所有**内容。
-- `socketTextStream` - Reads from a socket. Elements can be separated by a
delimiter.
+ 2. 如果 `watchType` 设置为 `FileProcessingMode.PROCESS_ONCE`,source
扫描**一次**路径然后退出,无需等待 reader 读完文件内容。当然,reader 会继续读取数据,直到所有文件内容都读完。关闭 source
会导致在那之后不再有检查点。 这可能会导致节点故障后恢复速度变慢,因为作业将从最后一个检查点恢复读取。
-Collection-based:
+基于套接字:
-- `fromCollection(Collection)` - Creates a data stream from the Java
Java.util.Collection. All elements
- in the collection must be of the same type.
+- `socketTextStream` - 从套接字读取。元素可以由分隔符分隔。
-- `fromCollection(Iterator, Class)` - Creates a data stream from an iterator.
The class specifies the
- data type of the elements returned by the iterator.
+基于集合:
-- `fromElements(T ...)` - Creates a data stream from the given sequence of
objects. All objects must be
- of the same type.
+- `fromCollection(Collection)` - 从 Java Java.util.Collection
创建数据流。集合中的所有元素必须属于同一类型。
+
+- `fromCollection(Iterator, Class)` - 从迭代器创建数据流。class 参数指定迭代器返回元素的数据类型。
+
+- `fromElements(T ...)` - 从给定的对象序列中创建数据流。所有的对象必须属于同一类型。
+
+- `fromParallelCollection(SplittableIterator, Class)` - 从迭代器并行创建数据流。class
参数指定迭代器返回元素的数据类型。
+
+- `generateSequence(from, to)` - 基于给定间隔内的数字序列并行生成数据流。
-- `fromParallelCollection(SplittableIterator, Class)` - Creates a data stream
from an iterator, in
- parallel. The class specifies the data type of the elements returned by the
iterator.
+自定义:
-- `generateSequence(from, to)` - Generates the sequence of numbers in the
given interval, in
- parallel.
-
-Custom:
-
-- `addSource` - Attach a new source function. For example, to read from Apache
Kafka you can use
- `addSource(new FlinkKafkaConsumer<>(...))`. See [connectors]({{< ref
"docs/connectors/datastream/overview" >}}) for more details.
+- `addSource` - 关联一个新的 source function。例如,你可以使用 `addSource(new
FlinkKafkaConsumer<>(...))` 来从 Apache Kafka 获取数据。更多详细信息见[连接器]({{< ref
"docs/connectors/datastream/overview" >}})。
{{< /tab >}}
{{< tab "Scala" >}}
-Sources are where your program reads its input from. You can attach a source
to your program by
-using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with
a number of pre-implemented
-source functions, but you can always write your own custom sources by
implementing the `SourceFunction`
-for non-parallel sources, or by implementing the `ParallelSourceFunction`
interface or extending the
-`RichParallelSourceFunction` for parallel sources.
-
-There are several predefined stream sources accessible from the
`StreamExecutionEnvironment`:
-
-File-based:
+Source 是你程序从中读取其输入的地方。你可以用
`StreamExecutionEnvironment.addSource(sourceFunction)` 将一个 source 关联到你的程序。Flink
自带了许多预先实现的 source functions,不过你总是可以编写自定义的 source,可以为非并行 source 实现
`SourceFunction` 接口,也可以为并行 source 实现 `ParallelSourceFunction` 接口或继承
`RichParallelSourceFunction`。
Review comment:
👌
Thank you for your review and suggestions.
I have made the modification as suggested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]