galenwarren commented on a change in pull request #15599: URL: https://github.com/apache/flink/pull/15599#discussion_r619674758
########## File path: flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/utils/BlobUtils.java ########## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.fs.gs.utils; + +import com.google.cloud.storage.BlobId; + +import java.net.URI; + +/** Utility functions related to blobs. */ +public class BlobUtils { + + /** The maximum number of blobs that can be composed in a single operation. */ + public static final int COMPOSE_MAX_BLOBS = 32; + + /** + * Normalizes a blob id, ensuring that the generation is null. + * + * @param blobId The blob id + * @return The blob id with the generation set to null + */ + public static BlobId normalizeBlobId(BlobId blobId) { + return BlobId.of(blobId.getBucket(), blobId.getName()); + } Review comment: A BlobId has three parts -- a bucket name, an object name, and a generation. If the generation is null, that means "latest generation". So the normalization serves to essentially remove the explicit generation from a blob, i.e. making it null which means "latest generation." This is done so that we can compare "expected" temporary blob ids, which don't have generations specified, with blob ids returned from the API, which do have generations set. This comparison is done in exactly one place, in ```cleanupRecoverableState``` where we're matching up the list of expected BlobIds from the recoverable writer state with the list of temporary blobs that were found in storage, in order to be able to return true/false from that function to indicate whether everything that was expected to be deleted was, in fact, deleted. Temporary blobs should never be overwritten -- the object name contains a UUID that is generated each time one is created -- so the generation of a temporary blob should always be zero. So, I figured it would be safe to compare them in this way. However, as you pointed out in a different comment, I probably shouldn't be deleting all of those blobs based on partial name match anyway, so I think the need for this normalization goes away. So I'll just plan to remove it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
