Re: [PR] Object storage operations [polaris]

via GitHub Tue, 23 Dec 2025 00:37:50 -0800


snazy commented on code in PR #3256:
URL: https://github.com/apache/polaris/pull/3256#discussion_r2642395923



##########
storage/files/api/src/main/java/org/apache/polaris/storage/files/api/FileOperations.java:
##########
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.polaris.storage.files.api;
+
+import jakarta.annotation.Nonnull;
+import java.util.stream.Stream;
+
+/**
+ * Object storage file operations, used to find files below a given prefix, to 
purge files, to
+ * identify referenced files, etc.
+ *
+ * <p>All functions of this interface rather yield incomplete results and 
continue over throwing
+ * exceptions.
+ */
+public interface FileOperations {
+  /**
+   * Find files that match the given prefix and filter.
+   *
+   * <p>Whether existing but inaccessible files are included in the result 
depends on the object
+   * store.
+   *
+   * <p>Call sites should consider rate-limiting the scan operations, for 
example, by using Guava's
+   * {@code RateLimiter} via a {@code Stream.map(x -> { rateLimiter.acquire(); 
return x; }} step on
+   * the returned stream.
+   *
+   * @param prefix full object storage URI prefix, including scheme and bucket.
+   * @param filter file filter
+   * @return a stream of file specs with the {@link 
FileSpec#createdAtMillis()} and {@link
+   *     FileSpec#size()} attributes populated with the information provided 
by the object store.
+   *     The {@link FileSpec#fileType() file type} attribute is not populated, 
it may be {@link
+   *     FileSpec#guessTypeFromName() guessed}.
+   */
+  Stream<FileSpec> findFiles(@Nonnull String prefix, @Nonnull FileFilter 
filter);
+
+  /**
+   * Identifies all files referenced by the given table-metadata.
+   *
+   * <p>In case "container" files, like the metadata, manifest-list or 
manifest files, are not
+   * readable, the returned stream will just not include those.
+   *
+   * <p>Rate-limiting the returned stream is recommended when identifying 
multiple tables and/or
+   * views. Rate-limiting on a single invocation may not be effective as 
expected.
+   *
+   * @param tableMetadataLocation Iceberg table-metadata location
+   * @param deduplicate if true, attempt to deduplicate files by their 
location, adding additional
+   *     heap pressure to the operation. Implementations may ignore this 
parameter or may not

Review Comment:
   There is already a currently hard-coded limit of 100,000 objects "currently 
considered for deduplication" - aka deduplication works on the last seen 
100,000 objects.
   We can certainly think of a better way to implement that in a follow-up.
   Just not sure whether every caller should decide, but rather some 
configuration.
   That's why I phrased the docs for this parameter pretty vague - to give 
implementations some freedom.
   
   For the purge use cases, whether deduplication happens or not isn't that 
important - at worst, it's some more work being done, but nothing would break.
   Other use cases, for example to count the number of distinct data files, may 
require deduplication across potentially even billions of data files, which 
they could implement in a heap-friendly way by placing their own deduplication 
"around the stream" - think: `Stream<FileSpec> files = 
myDeduplicator(fileOperations.identifyIcebergTableFiles(metadata, false))`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Object storage operations [polaris]

Reply via email to