galenwarren commented on a change in pull request #15599: URL: https://github.com/apache/flink/pull/15599#discussion_r619844225
########## File path: flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemFactory.java ########## @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.fs.gs; + +import org.apache.flink.configuration.ConfigOption; +import org.apache.flink.configuration.ConfigOptions; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.core.fs.FileSystem; +import org.apache.flink.core.fs.FileSystemFactory; +import org.apache.flink.runtime.util.HadoopConfigLoader; +import org.apache.flink.util.Preconditions; + +import com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem; + +import java.io.IOException; +import java.net.URI; +import java.util.Collections; + +/** + * Implementation of the Flink {@link org.apache.flink.core.fs.FileSystemFactory} interface for + * Google Storage. + */ +public class GSFileSystemFactory implements FileSystemFactory { + + private static final String SCHEME = "gs"; + + private static final String HADOOP_CONFIG_PREFIX = "fs.gs."; + + private static final String[] FLINK_CONFIG_PREFIXES = {"gs.", HADOOP_CONFIG_PREFIX}; + + private static final String[][] MIRRORED_CONFIG_KEYS = {}; + + private static final String FLINK_SHADING_PREFIX = ""; + + public static final ConfigOption<String> WRITER_TEMPORARY_BUCKET_NAME = + ConfigOptions.key("gs.writer.temporary.bucket.name") + .stringType() + .defaultValue(GSFileSystemOptions.DEFAULT_WRITER_TEMPORARY_BUCKET_NAME) + .withDescription( + "This option sets the bucket name used by the recoverable writer to store temporary files. " + + "If empty, temporary files are stored in the same bucket as the final file being written."); + + public static final ConfigOption<String> WRITER_TEMPORARY_OBJECT_PREFIX = + ConfigOptions.key("gs.writer.temporary.object.prefix") + .stringType() + .defaultValue(GSFileSystemOptions.DEFAULT_WRITER_TEMPORARY_OBJECT_PREFIX) + .withDescription( + "This option sets the prefix used by the recoverable writer when writing temporary files. This prefix is applied to the " + + "final object name to form the base name for temporary files."); Review comment: It's not really necessary, I suppose. The use case I was envisioning here was this: When you specify a dedicatedt bucket to hold in-progress temporary blobs, the ".inprogress" prefix is sort of redundant. So this lets you set it to an empty string and have the files just appear directly in the root of the bucket, i.e. the files would look like: ```/foo/bar/7b342499-6918-48f0-bcf9-11cf2bc18c51``` ... instead of ... ```/.inprogress/foo/bar/7b342499-6918-48f0-bcf9-11cf2bc18c51``` Admittedly, this isn't really that important to be able to do. So I'm fine with removing this option if that's what you'd prefer. Also, this reminds me of a change I'm planning to submit with the next batch, which is to have the generated temporary blob names include both the final bucket name and the final object name. Currently, the temporary blob names only include the final object name, but not the bucket name. So what is now: ```/.inprogress/foo/bar/7b342499-6918-48f0-bcf9-11cf2bc18c51``` ... would become: ```/.inprogress/bucket_name/foo/bar/7b342499-6918-48f0-bcf9-11cf2bc18c51``` This keeps temporary blobs properly separated in the event that a) a dedicated bucket for temporary blobs is used and b) a blob with the same name (i.e. /foo/bar) is written into two different buckets by a StreamingFileSink at the same time. (And, it would actually technically still work *without* adding the bucket name, because the UUIDs would be distinct, but it just seemed to be potentially confusing to commingle the temporary blobs for different writes in the same storage tree. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
