(flink) branch master updated: [FLINK-35773][docs] Document s5cmd

pnowojski Wed, 28 Aug 2024 01:21:35 -0700

This is an automated email from the ASF dual-hosted git repository.

pnowojski pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git



The following commit(s) were added to refs/heads/master by this push:
     new 20e6d634663 [FLINK-35773][docs] Document s5cmd
20e6d634663 is described below

commit 20e6d63466324217aa7ae93f51871665f51760f6
Author: Piotr Nowojski <[email protected]>
AuthorDate: Wed Aug 21 13:18:16 2024 +0200

    [FLINK-35773][docs] Document s5cmd
---
 docs/content.zh/docs/deployment/filesystems/s3.md | 34 +++++++++++++++++++++++
 docs/content/docs/deployment/filesystems/s3.md    | 34 +++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/docs/content.zh/docs/deployment/filesystems/s3.md 
b/docs/content.zh/docs/deployment/filesystems/s3.md
index 12f66375afe..aae2e47eb3c 100644
--- a/docs/content.zh/docs/deployment/filesystems/s3.md
+++ b/docs/content.zh/docs/deployment/filesystems/s3.md
@@ -155,4 +155,38 @@ s3.entropy.length: 4 (default)
 如果文件系统操作没有经过 *"熵注入"* 写入，entropy key 字串将被直接移除。
 `s3.entropy.length` 定义了用于熵注入的随机字母/数字字符的数量。
 
+## s5cmd
+
+Both `flink-s3-fs-hadoop` and `flink-s3-fs-presto` can be configured to use 
the [s5cmd tool](https://github.com/peak/s5cmd) for faster file upload and 
download.
+[Benchmark 
results](https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support)
 are showing that `s5cmd` can be over 2 times more CPU efficient.
+Which means either using half the CPU to upload or download the same set of 
files, or doing that twice as fast with the same amount of available CPU.
+
+In order to use this feature, the `s5cmd` binary has to be present and 
accessible to the Flink's task managers, for example via embedding it in the 
used docker image.
+Secondly, the path to the `s5cmd` has to be configured via:
+```yaml
+s3.s5cmd.path: /path/to/the/s5cmd
+```
+
+The remaining configuration options (with their default value listed below) 
are:
+```yaml
+# Extra arguments that will be passed directly to the s5cmd call. Please refer 
to the s5cmd's official documentation.
+s3.s5cmd.args: -r 0
+# Maximum size of files that will be uploaded via a single s5cmd call.
+s3.s5cmd.batch.max-size: 1024mb
+# Maximum number of files that will be uploaded via a single s5cmd call.
+s3.s5cmd.batch.max-files: 100
+```
+Both `s3.s5cmd.batch.max-size` and `s3.s5cmd.batch.max-files` are used to 
control resource usage of the `s5cmd` binary, to prevent it from overloading 
the task manager.
+
+It is recommended to first configure and making sure Flink works without using 
`s5cmd` and only then enabling this feature.
+
+### Credentials
+
+If you are using [access keys](#access-keys-discouraged), they will be passed 
to the `s5cmd`.
+Apart from that `s5cmd` has its own independent (but similar) of Flink way of 
[using 
credentials](https://github.com/peak/s5cmd?tab=readme-ov-file#specifying-credentials).
+
+### Limitations
+
+Currently, Flink will use `s5cmd` only during recovery, when downloading state 
files from S3 and using RocksDB.
+
 {{< top >}}
diff --git a/docs/content/docs/deployment/filesystems/s3.md 
b/docs/content/docs/deployment/filesystems/s3.md
index 389b56244a9..ac11f5341e1 100644
--- a/docs/content/docs/deployment/filesystems/s3.md
+++ b/docs/content/docs/deployment/filesystems/s3.md
@@ -164,4 +164,38 @@ The `s3.entropy.key` defines the string in paths that is 
replaced by the random
 If a file system operation does not pass the *"inject entropy"* write option, 
the entropy key substring is simply removed.
 The `s3.entropy.length` defines the number of random alphanumeric characters 
used for entropy.
 
+## s5cmd
+
+Both `flink-s3-fs-hadoop` and `flink-s3-fs-presto` can be configured to use 
the [s5cmd tool](https://github.com/peak/s5cmd) for faster file upload and 
download.
+[Benchmark 
results](https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support)
 are showing that `s5cmd` can be over 2 times more CPU efficient. 
+Which means either using half the CPU to upload or download the same set of 
files, or doing that twice as fast with the same amount of available CPU.
+
+In order to use this feature, the `s5cmd` binary has to be present and 
accessible to the Flink's task managers, for example via embedding it in the 
used docker image.
+Secondly, the path to the `s5cmd` has to be configured via:
+```yaml
+s3.s5cmd.path: /path/to/the/s5cmd
+```
+
+The remaining configuration options (with their default value listed below) 
are:
+```yaml
+# Extra arguments that will be passed directly to the s5cmd call. Please refer 
to the s5cmd's official documentation.
+s3.s5cmd.args: -r 0
+# Maximum size of files that will be uploaded via a single s5cmd call.
+s3.s5cmd.batch.max-size: 1024mb
+# Maximum number of files that will be uploaded via a single s5cmd call.
+s3.s5cmd.batch.max-files: 100
+```
+Both `s3.s5cmd.batch.max-size` and `s3.s5cmd.batch.max-files` are used to 
control resource usage of the `s5cmd` binary, to prevent it from overloading 
the task manager.
+
+It is recommended to first configure and making sure Flink works without using 
`s5cmd` and only then enabling this feature.
+
+### Credentials
+
+If you are using [access keys](#access-keys-discouraged), they will be passed 
to the `s5cmd`.
+Apart from that `s5cmd` has its own independent (but similar) of Flink way of 
[using 
credentials](https://github.com/peak/s5cmd?tab=readme-ov-file#specifying-credentials).
 
+
+### Limitations
+
+Currently, Flink will use `s5cmd` only during recovery, when downloading state 
files from S3 and using RocksDB.
+
 {{< top >}}

(flink) branch master updated: [FLINK-35773][docs] Document s5cmd

Reply via email to