[GitHub] [flink] xintongsong commented on a change in pull request #18430: [FLINK-25577][docs] Update GCS documentation

GitBox Mon, 24 Jan 2022 00:26:51 -0800


xintongsong commented on a change in pull request #18430:
URL: https://github.com/apache/flink/pull/18430#discussion_r790487706




##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ 
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other 
places as well, including your [high availability setup]({{< ref 
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that 
Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect 
Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there 
is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the 
*gs://* scheme. It uses Google's 
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
 Hadoop library to access GCS. It also uses Google's 
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
 library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref 
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref 
"docs/connectors/datastream/file_sink" >}}).

Review comment:
       These two references are invalid, as CI complained. They are removed / 
renamed in FLINK-20188. I think we should now refers to 
"docs/content/docs/connectors/datastream/filesystem.md"

##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ 
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other 
places as well, including your [high availability setup]({{< ref 
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that 
Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect 
Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there 
is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the 
*gs://* scheme. It uses Google's 
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
 Hadoop library to access GCS. It also uses Google's 
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
 library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref 
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref 
"docs/connectors/datastream/file_sink" >}}).
 
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the 
`plugins` directory of your Flink distribution before starting Flink, i.e.
 
-Most operations on GCS require authentication. Please see [the documentation 
on Google Cloud Storage 
authentication](https://cloud.google.com/storage/docs/authentication) for more 
information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
 
-You can use the following method for authentication
-* Configure via core-site.xml
-  You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs 
configuration 
keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)
 by adding the configurations to your `flink-conf.yaml`.

Review comment:
       Correct me if I'm wrong, I think the precise description should be:
   ```suggestion
   The underlying Hadoop file system can be [configured using gcs-connector's 
Hadoop configuration 
keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)
 by adding the configurations to your `flink-conf.yaml`.
   ```

##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ 
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other 
places as well, including your [high availability setup]({{< ref 
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that 
Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect 
Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there 
is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the 
*gs://* scheme. It uses Google's 
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
 Hadoop library to access GCS. It also uses Google's 
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
 library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref 
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref 
"docs/connectors/datastream/file_sink" >}}).
 
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the 
`plugins` directory of your Flink distribution before starting Flink, i.e.
 
-Most operations on GCS require authentication. Please see [the documentation 
on Google Cloud Storage 
authentication](https://cloud.google.com/storage/docs/authentication) for more 
information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
 
-You can use the following method for authentication
-* Configure via core-site.xml
-  You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs 
configuration 
keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)
 by adding the configurations to your `flink-conf.yaml`.
 
-  ```xml
-  <configuration>
-    <property>
-      <name>google.cloud.auth.service.account.enable</name>
-      <value>true</value>
-    </property>
-    <property>
-      <name>google.cloud.auth.service.account.json.keyfile</name>
-      <value><PATH TO GOOGLE AUTHENTICATION JSON></value>
-    </property>
-  </configuration>
-  ```
+For example, Hadoop has a `fs.gs.http.connect-timeout` configuration key. If 
you want to change it, you need to set `gs.http.connect-timeout: xyz` in 
`flink-conf.yaml`. Flink will internally translate this back to 
`fs.gs.http.connect-timeout`. There is no need to pass configuration parameters 
using Hadoop's XML configuration files.

Review comment:
       ```suggestion
   For example, gcs-connector has a `fs.gs.http.connect-timeout` configuration 
key. If you want to change it, you need to set `gs.http.connect-timeout: xyz` 
in `flink-conf.yaml`. Flink will internally translate this back to 
`fs.gs.http.connect-timeout`. There is no need to pass configuration parameters 
using Hadoop's XML configuration files.
   ```

##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ 
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other 
places as well, including your [high availability setup]({{< ref 
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that 
Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect 
Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there 
is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the 
*gs://* scheme. It uses Google's 
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
 Hadoop library to access GCS. It also uses Google's 
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
 library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref 
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref 
"docs/connectors/datastream/file_sink" >}}).
 
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the 
`plugins` directory of your Flink distribution before starting Flink, i.e.
 
-Most operations on GCS require authentication. Please see [the documentation 
on Google Cloud Storage 
authentication](https://cloud.google.com/storage/docs/authentication) for more 
information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
 
-You can use the following method for authentication
-* Configure via core-site.xml
-  You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs 
configuration 
keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)
 by adding the configurations to your `flink-conf.yaml`.
 
-  ```xml
-  <configuration>
-    <property>
-      <name>google.cloud.auth.service.account.enable</name>
-      <value>true</value>
-    </property>
-    <property>
-      <name>google.cloud.auth.service.account.json.keyfile</name>
-      <value><PATH TO GOOGLE AUTHENTICATION JSON></value>
-    </property>
-  </configuration>
-  ```
+For example, Hadoop has a `fs.gs.http.connect-timeout` configuration key. If 
you want to change it, you need to set `gs.http.connect-timeout: xyz` in 
`flink-conf.yaml`. Flink will internally translate this back to 
`fs.gs.http.connect-timeout`. There is no need to pass configuration parameters 
using Hadoop's XML configuration files.
 
-  You would need to add the following to `flink-conf.yaml`
+`flink-gs-fs-hadoop` can also be configured by setting the following options 
in `flink-conf.yaml`:
 
-  ```yaml
-  flinkConfiguration:
-    fs.hdfs.hadoopconf: <DIRECTORY PATH WHERE core-site.xml IS SAVED>
-  ```
+| Key                                       | Description                      
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                               |
+|-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| gs.writer.temporary.bucket.name           | If this property is not set, 
temporary blobs for in-progress writes via `RecoverableWriter` will be written 
to same bucket as the final file being written, prefixed with `.inprogress/`. 
<br><br>Set this property to choose a different bucket to hold the temporary 
blobs. It is recommended to choose a separate bucket in order to [assign it a 
TTL](https://cloud.google.com/storage/docs/lifecycle), to provide a mechanism 
to clean up orphaned blobs that can occur when restoring from 
check/savepoints.<br><br>If you do use a separate bucket with a TTL for 
temporary blobs, attempts to restart jobs from check/savepoints after the TTL 
interval expires may fail. 

Review comment:
       Maybe swap the first and second paragraph?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] xintongsong commented on a change in pull request #18430: [FLINK-25577][docs] Update GCS documentation

Reply via email to