galenwarren commented on a change in pull request #15599:
URL: https://github.com/apache/flink/pull/15599#discussion_r619844225



##########
File path: 
flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemFactory.java
##########
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.fs.gs;
+
+import org.apache.flink.configuration.ConfigOption;
+import org.apache.flink.configuration.ConfigOptions;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.core.fs.FileSystem;
+import org.apache.flink.core.fs.FileSystemFactory;
+import org.apache.flink.runtime.util.HadoopConfigLoader;
+import org.apache.flink.util.Preconditions;
+
+import com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem;
+
+import java.io.IOException;
+import java.net.URI;
+import java.util.Collections;
+
+/**
+ * Implementation of the Flink {@link 
org.apache.flink.core.fs.FileSystemFactory} interface for
+ * Google Storage.
+ */
+public class GSFileSystemFactory implements FileSystemFactory {
+
+    private static final String SCHEME = "gs";
+
+    private static final String HADOOP_CONFIG_PREFIX = "fs.gs.";
+
+    private static final String[] FLINK_CONFIG_PREFIXES = {"gs.", 
HADOOP_CONFIG_PREFIX};
+
+    private static final String[][] MIRRORED_CONFIG_KEYS = {};
+
+    private static final String FLINK_SHADING_PREFIX = "";
+
+    public static final ConfigOption<String> WRITER_TEMPORARY_BUCKET_NAME =
+            ConfigOptions.key("gs.writer.temporary.bucket.name")
+                    .stringType()
+                    
.defaultValue(GSFileSystemOptions.DEFAULT_WRITER_TEMPORARY_BUCKET_NAME)
+                    .withDescription(
+                            "This option sets the bucket name used by the 
recoverable writer to store temporary files. "
+                                    + "If empty, temporary files are stored in 
the same bucket as the final file being written.");
+
+    public static final ConfigOption<String> WRITER_TEMPORARY_OBJECT_PREFIX =
+            ConfigOptions.key("gs.writer.temporary.object.prefix")
+                    .stringType()
+                    
.defaultValue(GSFileSystemOptions.DEFAULT_WRITER_TEMPORARY_OBJECT_PREFIX)
+                    .withDescription(
+                            "This option sets the prefix used by the 
recoverable writer when writing temporary files. This prefix is applied to the "
+                                    + "final object name to form the base name 
for temporary files.");

Review comment:
       It's not really necessary, I suppose. The use case I was envisioning 
here was this: When you specify a dedicatedt bucket to hold in-progress 
temporary blobs, the ".inprogress" prefix is sort of redundant. So this lets 
you set it to an empty string and have the files just appear directly in the 
root of the bucket, i.e. the files would look like:
   
   ```/foo/bar/7b342499-6918-48f0-bcf9-11cf2bc18c51```
   
   ... instead of ...
   
   ```/.inprogress/foo/bar/7b342499-6918-48f0-bcf9-11cf2bc18c51```
   
   Admittedly, this isn't really that important to be able to do. So I'm fine 
with removing this option if that's what you'd prefer.
   
   Also, this reminds me of a change I'm planning to submit with the next 
batch, which is to have the generated temporary blob names include both the 
final bucket name and the final object name. Currently, the temporary blob 
names only include the final object name, but not the bucket name. So what is 
now:
   
   ```/.inprogress/foo/bar/7b342499-6918-48f0-bcf9-11cf2bc18c51```
   
   ... would become:
   
   ```/.inprogress/bucket_name/foo/bar/7b342499-6918-48f0-bcf9-11cf2bc18c51```
   
   This keeps temporary blobs properly separated in the event that a) a 
dedicated bucket for temporary blobs is used and b) a blob with the same name 
(i.e. /foo/bar) is written into two different buckets by a StreamingFileSink at 
the same time.
   
   (And, it would actually technically still work *without* adding the bucket 
name, because the UUIDs would be distinct, but it just seemed to be potentially 
confusing to commingle the temporary blobs for different writes in the same 
storage tree.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to