xintongsong commented on a change in pull request #18430:
URL: https://github.com/apache/flink/pull/18430#discussion_r790487706
##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
```
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other
places as well, including your [high availability setup]({{< ref
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that
Flink expects a FileSystem URI.
-You must include the following jars in Flink's `lib` directory to connect
Flink with gcs:
+### GCS File System plugin
-```xml
-<dependency>
- <groupId>org.apache.flink</groupId>
- <artifactId>flink-shaded-hadoop2-uber</artifactId>
- <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there
is no need to add Hadoop to the classpath to use it.
-<dependency>
- <groupId>com.google.cloud.bigdataoss</groupId>
- <artifactId>gcs-connector</artifactId>
- <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the
*gs://* scheme. It uses Google's
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
Hadoop library to access GCS. It also uses Google's
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
library to provide `RecoverableWriter` support.
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref
"docs/connectors/datastream/file_sink" >}}).
Review comment:
These two references are invalid, as CI complained. They are removed /
renamed in FLINK-20188. I think we should now refers to
"docs/content/docs/connectors/datastream/filesystem.md"
##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
```
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other
places as well, including your [high availability setup]({{< ref
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that
Flink expects a FileSystem URI.
-You must include the following jars in Flink's `lib` directory to connect
Flink with gcs:
+### GCS File System plugin
-```xml
-<dependency>
- <groupId>org.apache.flink</groupId>
- <artifactId>flink-shaded-hadoop2-uber</artifactId>
- <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there
is no need to add Hadoop to the classpath to use it.
-<dependency>
- <groupId>com.google.cloud.bigdataoss</groupId>
- <artifactId>gcs-connector</artifactId>
- <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the
*gs://* scheme. It uses Google's
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
Hadoop library to access GCS. It also uses Google's
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
library to provide `RecoverableWriter` support.
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref
"docs/connectors/datastream/file_sink" >}}).
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the
`plugins` directory of your Flink distribution before starting Flink, i.e.
-Most operations on GCS require authentication. Please see [the documentation
on Google Cloud Storage
authentication](https://cloud.google.com/storage/docs/authentication) for more
information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
-You can use the following method for authentication
-* Configure via core-site.xml
- You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs
configuration
keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)
by adding the configurations to your `flink-conf.yaml`.
Review comment:
Correct me if I'm wrong, I think the precise description should be:
```suggestion
The underlying Hadoop file system can be [configured using gcs-connector's
Hadoop configuration
keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)
by adding the configurations to your `flink-conf.yaml`.
```
##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
```
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other
places as well, including your [high availability setup]({{< ref
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that
Flink expects a FileSystem URI.
-You must include the following jars in Flink's `lib` directory to connect
Flink with gcs:
+### GCS File System plugin
-```xml
-<dependency>
- <groupId>org.apache.flink</groupId>
- <artifactId>flink-shaded-hadoop2-uber</artifactId>
- <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there
is no need to add Hadoop to the classpath to use it.
-<dependency>
- <groupId>com.google.cloud.bigdataoss</groupId>
- <artifactId>gcs-connector</artifactId>
- <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the
*gs://* scheme. It uses Google's
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
Hadoop library to access GCS. It also uses Google's
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
library to provide `RecoverableWriter` support.
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref
"docs/connectors/datastream/file_sink" >}}).
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the
`plugins` directory of your Flink distribution before starting Flink, i.e.
-Most operations on GCS require authentication. Please see [the documentation
on Google Cloud Storage
authentication](https://cloud.google.com/storage/docs/authentication) for more
information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
-You can use the following method for authentication
-* Configure via core-site.xml
- You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs
configuration
keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)
by adding the configurations to your `flink-conf.yaml`.
- ```xml
- <configuration>
- <property>
- <name>google.cloud.auth.service.account.enable</name>
- <value>true</value>
- </property>
- <property>
- <name>google.cloud.auth.service.account.json.keyfile</name>
- <value><PATH TO GOOGLE AUTHENTICATION JSON></value>
- </property>
- </configuration>
- ```
+For example, Hadoop has a `fs.gs.http.connect-timeout` configuration key. If
you want to change it, you need to set `gs.http.connect-timeout: xyz` in
`flink-conf.yaml`. Flink will internally translate this back to
`fs.gs.http.connect-timeout`. There is no need to pass configuration parameters
using Hadoop's XML configuration files.
Review comment:
```suggestion
For example, gcs-connector has a `fs.gs.http.connect-timeout` configuration
key. If you want to change it, you need to set `gs.http.connect-timeout: xyz`
in `flink-conf.yaml`. Flink will internally translate this back to
`fs.gs.http.connect-timeout`. There is no need to pass configuration parameters
using Hadoop's XML configuration files.
```
##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
```
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other
places as well, including your [high availability setup]({{< ref
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that
Flink expects a FileSystem URI.
-You must include the following jars in Flink's `lib` directory to connect
Flink with gcs:
+### GCS File System plugin
-```xml
-<dependency>
- <groupId>org.apache.flink</groupId>
- <artifactId>flink-shaded-hadoop2-uber</artifactId>
- <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there
is no need to add Hadoop to the classpath to use it.
-<dependency>
- <groupId>com.google.cloud.bigdataoss</groupId>
- <artifactId>gcs-connector</artifactId>
- <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the
*gs://* scheme. It uses Google's
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
Hadoop library to access GCS. It also uses Google's
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
library to provide `RecoverableWriter` support.
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref
"docs/connectors/datastream/file_sink" >}}).
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the
`plugins` directory of your Flink distribution before starting Flink, i.e.
-Most operations on GCS require authentication. Please see [the documentation
on Google Cloud Storage
authentication](https://cloud.google.com/storage/docs/authentication) for more
information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
-You can use the following method for authentication
-* Configure via core-site.xml
- You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs
configuration
keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)
by adding the configurations to your `flink-conf.yaml`.
- ```xml
- <configuration>
- <property>
- <name>google.cloud.auth.service.account.enable</name>
- <value>true</value>
- </property>
- <property>
- <name>google.cloud.auth.service.account.json.keyfile</name>
- <value><PATH TO GOOGLE AUTHENTICATION JSON></value>
- </property>
- </configuration>
- ```
+For example, Hadoop has a `fs.gs.http.connect-timeout` configuration key. If
you want to change it, you need to set `gs.http.connect-timeout: xyz` in
`flink-conf.yaml`. Flink will internally translate this back to
`fs.gs.http.connect-timeout`. There is no need to pass configuration parameters
using Hadoop's XML configuration files.
- You would need to add the following to `flink-conf.yaml`
+`flink-gs-fs-hadoop` can also be configured by setting the following options
in `flink-conf.yaml`:
- ```yaml
- flinkConfiguration:
- fs.hdfs.hadoopconf: <DIRECTORY PATH WHERE core-site.xml IS SAVED>
- ```
+| Key | Description
|
+|-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| gs.writer.temporary.bucket.name | If this property is not set,
temporary blobs for in-progress writes via `RecoverableWriter` will be written
to same bucket as the final file being written, prefixed with `.inprogress/`.
<br><br>Set this property to choose a different bucket to hold the temporary
blobs. It is recommended to choose a separate bucket in order to [assign it a
TTL](https://cloud.google.com/storage/docs/lifecycle), to provide a mechanism
to clean up orphaned blobs that can occur when restoring from
check/savepoints.<br><br>If you do use a separate bucket with a TTL for
temporary blobs, attempts to restart jobs from check/savepoints after the TTL
interval expires may fail.
Review comment:
Maybe swap the first and second paragraph?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]