galenwarren commented on a change in pull request #15599:
URL: https://github.com/apache/flink/pull/15599#discussion_r619674758



##########
File path: 
flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/utils/BlobUtils.java
##########
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.fs.gs.utils;
+
+import com.google.cloud.storage.BlobId;
+
+import java.net.URI;
+
+/** Utility functions related to blobs. */
+public class BlobUtils {
+
+    /** The maximum number of blobs that can be composed in a single 
operation. */
+    public static final int COMPOSE_MAX_BLOBS = 32;
+
+    /**
+     * Normalizes a blob id, ensuring that the generation is null.
+     *
+     * @param blobId The blob id
+     * @return The blob id with the generation set to null
+     */
+    public static BlobId normalizeBlobId(BlobId blobId) {
+        return BlobId.of(blobId.getBucket(), blobId.getName());
+    }

Review comment:
       A BlobId has three parts -- a bucket name, an object name, and a 
generation. If the generation is null, that means "latest generation". So the 
normalization serves to essentially remove the explicit generation from a blob, 
i.e. making it null which means "latest generation." This is done so that we 
can compare "expected" temporary blob ids, which don't have generations 
specified, with blob ids returned from the API, which do have generations set. 
   
   This comparison is done in exactly one place, in 
```cleanupRecoverableState``` where we're matching up the list of expected 
BlobIds from the recoverable writer state with the list of temporary blobs that 
were found in storage, in order to be able to return true/false from that 
function to indicate whether everything that was expected to be deleted was, in 
fact, deleted. 
   
   Temporary blobs should never be overwritten -- the object name contains a 
UUID that is generated each time one is created -- so the generation of a 
temporary blob should always be zero. So, I figured it would be safe to compare 
them in this way.
   
   However, as you pointed out in a different comment, I probably shouldn't be 
deleting all of those blobs based on partial name match anyway, so I think the 
need for this normalization goes away. So I'll just plan to remove it.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to