This is an automated email from the ASF dual-hosted git repository. jark pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/flink.git
commit 345ac8868b705b7c9be9bb70e6fb54d1d15baa9b Author: aloys <[email protected]> AuthorDate: Wed Jun 26 01:22:22 2019 +0800 [FLINK-12943][docs-zh] Translate "HDFS Connector" page into Chinese This closes #8897 --- docs/dev/connectors/filesystem_sink.zh.md | 80 ++++++++++++------------------- 1 file changed, 31 insertions(+), 49 deletions(-) diff --git a/docs/dev/connectors/filesystem_sink.zh.md b/docs/dev/connectors/filesystem_sink.zh.md index f9a828d..54b0c64 100644 --- a/docs/dev/connectors/filesystem_sink.zh.md +++ b/docs/dev/connectors/filesystem_sink.zh.md @@ -1,5 +1,5 @@ --- -title: "HDFS Connector" +title: "HDFS 连接器" nav-title: Rolling File Sink nav-parent_id: connectors nav-pos: 5 @@ -23,9 +23,8 @@ specific language governing permissions and limitations under the License. --> -This connector provides a Sink that writes partitioned files to any filesystem supported by -[Hadoop FileSystem](http://hadoop.apache.org). To use this connector, add the -following dependency to your project: +这个连接器可以向所有 [Hadoop FileSystem](http://hadoop.apache.org) 支持的文件系统写入分区文件。 +使用前,需要在工程里添加下面的依赖: {% highlight xml %} <dependency> @@ -35,16 +34,11 @@ following dependency to your project: </dependency> {% endhighlight %} -Note that the streaming connectors are currently not part of the binary -distribution. See -[here]({{site.baseurl}}/dev/projectsetup/dependencies.html) -for information about how to package the program with the libraries for -cluster execution. +注意连接器目前还不是二进制发行版的一部分,添加依赖、打包配置以及集群运行信息请参考 [这里]({{site.baseurl}}/zh/dev/projectsetup/dependencies.html)。 -#### Bucketing File Sink +#### 分桶文件 Sink -The bucketing behaviour as well as the writing can be configured but we will get to that later. -This is how you can create a bucketing sink which by default, sinks to rolling files that are split by time: +关于分桶的配置我们后面会有讲述,这里先创建一个分桶 sink,默认情况下这个 sink 会将数据写入到按照时间切分的滚动文件中: <div class="codetabs" markdown="1"> <div data-lang="java" markdown="1"> @@ -65,40 +59,30 @@ input.addSink(new BucketingSink[String]("/base/path")) </div> </div> -The only required parameter is the base path where the buckets will be -stored. The sink can be further configured by specifying a custom bucketer, writer and batch size. - -By default the bucketing sink will split by the current system time when elements arrive and will -use the datetime pattern `"yyyy-MM-dd--HH"` to name the buckets. This pattern is passed to -`DateTimeFormatter` with the current system time and JVM's default timezone to form a bucket path. -Users can also specify a timezone for the bucketer to format bucket path. A new bucket will be created -whenever a new date is encountered. For example, if you have a pattern that contains minutes as the -finest granularity you will get a new bucket every minute. Each bucket is itself a directory that -contains several part files: each parallel instance of the sink will create its own part file and -when part files get too big the sink will also create a new part file next to the others. When a -bucket becomes inactive, the open part file will be flushed and closed. A bucket is regarded as -inactive when it hasn't been written to recently. By default, the sink checks for inactive buckets -every minute, and closes any buckets which haven't been written to for over a minute. This -behaviour can be configured with `setInactiveBucketCheckInterval()` and -`setInactiveBucketThreshold()` on a `BucketingSink`. - -You can also specify a custom bucketer by using `setBucketer()` on a `BucketingSink`. If desired, -the bucketer can use a property of the element or tuple to determine the bucket directory. - -The default writer is `StringWriter`. This will call `toString()` on the incoming elements -and write them to part files, separated by newline. To specify a custom writer use `setWriter()` -on a `BucketingSink`. If you want to write Hadoop SequenceFiles you can use the provided -`SequenceFileWriter` which can also be configured to use compression. - -There are two configuration options that specify when a part file should be closed -and a new one started: +初始化时只需要一个参数,这个参数表示分桶文件存储的路径。分桶 sink 可以通过指定自定义的 bucketer、 writer 和 batch 值进一步配置。 + +默认情况下,当数据到来时,分桶 sink 会按照系统时间对数据进行切分,并以 `"yyyy-MM-dd--HH"` 的时间格式给每个桶命名。然后 +`DateTimeFormatter` 按照这个时间格式将当前系统时间以 JVM 默认时区转换成分桶的路径。用户可以自定义时区来生成 +分桶的路径。每遇到一个新的日期都会产生一个新的桶。例如,如果时间的格式以分钟为粒度,那么每分钟都会产生一个桶。每个桶都是一个目录, +目录下包含了几个部分文件(part files):每个 sink 的并发实例都会创建一个属于自己的部分文件,当这些文件太大的时候,sink 会产生新的部分文件。 +当一个桶不再活跃时,打开的部分文件会刷盘并且关闭。如果一个桶最近一段时间都没有写入,那么这个桶被认为是不活跃的。sink 默认会每分钟 +检查不活跃的桶、关闭那些超过一分钟没有写入的桶。这些行为可以通过 `BucketingSink` 的 `setInactiveBucketCheckInterval()` +和 `setInactiveBucketThreshold()` 进行设置。 + +可以调用`BucketingSink` 的 `setBucketer()` 方法指定自定义的 bucketer,如果需要的话,也可以使用一个元素或者元组属性来决定桶的路径。 + +默认的 writer 是 `StringWriter`。数据到达时,通过 `toString()` 方法得到内容,内容以换行符分隔,`StringWriter` 将数据 +内容写入部分文件。可以通过 `BucketingSink` 的 `setWriter()` 指定自定义的 writer。`SequenceFileWriter` 支持写入 Hadoop +SequenceFiles,并且可以配置是否开启压缩。 + +关闭部分文件和打开新部分文件的时机可以通过两个配置来确定: -* By setting a batch size (The default part file size is 384 MB) -* By setting a batch roll over time interval (The default roll over interval is `Long.MAX_VALUE`) +* 设置文件大小(默认文件大小是384MB) +* 设置文件滚动周期,单位是毫秒(默认滚动周期是 `Long.MAX_VALUE`) -A new part file is started when either of these two conditions is satisfied. +当上述两个条件中的任意一个被满足,都会生成一个新的部分文件。 -Example: +示例: <div class="codetabs" markdown="1"> <div data-lang="java" markdown="1"> @@ -133,17 +117,15 @@ input.addSink(sink) </div> </div> -This will create a sink that writes to bucket files that follow this schema: +上述代码会创建一个 sink,这个 sink 按下面的模式写入桶文件: {% highlight plain %} /base/path/{date-time}/part-{parallel-task}-{count} {% endhighlight %} -Where `date-time` is the string that we get from the date/time format, `parallel-task` is the index -of the parallel sink instance and `count` is the running number of part files that were created -because of the batch size or batch roll over interval. +`date-time` 是我们从日期/时间格式获得的字符串,`parallel-task` 是 sink 并发实例的索引,`count` 是因文件大小或者滚动周期而产生的 +文件的编号。 -For in-depth information, please refer to the JavaDoc for -[BucketingSink](http://flink.apache.org/docs/latest/api/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.html). +更多信息,请参考 [BucketingSink](http://flink.apache.org/docs/latest/api/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.html)。 {% top %}
