snazy commented on code in PR #3256: URL: https://github.com/apache/polaris/pull/3256#discussion_r2642395923
########## storage/files/api/src/main/java/org/apache/polaris/storage/files/api/FileOperations.java: ########## @@ -0,0 +1,170 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.polaris.storage.files.api; + +import jakarta.annotation.Nonnull; +import java.util.stream.Stream; + +/** + * Object storage file operations, used to find files below a given prefix, to purge files, to + * identify referenced files, etc. + * + * <p>All functions of this interface rather yield incomplete results and continue over throwing + * exceptions. + */ +public interface FileOperations { + /** + * Find files that match the given prefix and filter. + * + * <p>Whether existing but inaccessible files are included in the result depends on the object + * store. + * + * <p>Call sites should consider rate-limiting the scan operations, for example, by using Guava's + * {@code RateLimiter} via a {@code Stream.map(x -> { rateLimiter.acquire(); return x; }} step on + * the returned stream. + * + * @param prefix full object storage URI prefix, including scheme and bucket. + * @param filter file filter + * @return a stream of file specs with the {@link FileSpec#createdAtMillis()} and {@link + * FileSpec#size()} attributes populated with the information provided by the object store. + * The {@link FileSpec#fileType() file type} attribute is not populated, it may be {@link + * FileSpec#guessTypeFromName() guessed}. + */ + Stream<FileSpec> findFiles(@Nonnull String prefix, @Nonnull FileFilter filter); + + /** + * Identifies all files referenced by the given table-metadata. + * + * <p>In case "container" files, like the metadata, manifest-list or manifest files, are not + * readable, the returned stream will just not include those. + * + * <p>Rate-limiting the returned stream is recommended when identifying multiple tables and/or + * views. Rate-limiting on a single invocation may not be effective as expected. + * + * @param tableMetadataLocation Iceberg table-metadata location + * @param deduplicate if true, attempt to deduplicate files by their location, adding additional + * heap pressure to the operation. Implementations may ignore this parameter or may not Review Comment: There is already a currently hard-coded limit of 100,000 objects "currently considered for deduplication" - aka deduplication works on the last seen 100,000 objects. We can certainly think of a better way to implement that in a follow-up. Just not sure whether every caller should decide, but rather some configuration. That's why I phrased the docs for this parameter pretty vague - to give implementations some freedom. For the purge use cases, whether deduplication happens or not isn't that important - at worst, it's some more work being done, but nothing would break. Other use cases, for example to count the number of distinct data files, may require deduplication across potentially even billions of data files, which they could implement in a heap-friendly way by placing their own deduplication "around the stream" - think: `Stream<FileSpec> files = myDeduplicator(fileOperations.identifyIcebergTableFiles(metadata, false))`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
