collado-mike commented on code in PR #312:
URL: https://github.com/apache/polaris/pull/312#discussion_r1790600534
##########
polaris-service/src/main/java/org/apache/polaris/service/PolarisApplication.java:
##########
@@ -211,6 +212,9 @@ public void run(PolarisApplicationConfig configuration,
Environment environment)
taskExecutor.addTaskHandler(
new ManifestFileCleanupTaskHandler(
fileIOSupplier, Executors.newVirtualThreadPerTaskExecutor()));
+ taskExecutor.addTaskHandler(
+ new TableContentCleanupTaskHandler(
+ fileIOSupplier, Executors.newVirtualThreadPerTaskExecutor()));
Review Comment:
It's worth considering whether these ought to share the same executor
service... I haven't put much thought into it, so... we can just leave a
comment for now if there's not a strong argument either way.
##########
polaris-core/src/main/java/org/apache/polaris/core/entity/AsyncTaskType.java:
##########
@@ -23,7 +23,8 @@
public enum AsyncTaskType {
ENTITY_CLEANUP_SCHEDULER(1),
- FILE_CLEANUP(2);
+ FILE_CLEANUP(2),
+ TABLE_CONTENT_CLEANUP(3);
Review Comment:
The `FILE_CLEANUP` type was poorly named. We should make it specific, as
only the `ManifestFileCleanupTaskHandler` deals with it. That said, the actual
serialized value is only the integer (see
https://github.com/collado-mike/polaris/blob/8cb6b44bf57dc597dab612d109d3eb534aef5715/polaris-core/src/main/java/org/apache/polaris/core/entity/AsyncTaskType.java#L34-L37
), so we can rename it and maintain backward compatibility.
With that in mind, I think we should rename `FILE_CLEANUP` ->
`MANIFEST_FILE_CLEANUP` and add _two_ enums for the new file types:
`METADATA_LOG_ENTRY_CLEANUP` and `STATISTICS_FILE_CLEANUP`. Your task handler
can look for instances of both types. I think that gives us flexibility in the
future if we need to handle the different file types differently.
##########
polaris-service/src/main/java/org/apache/polaris/service/task/TableContentCleanupTaskHandler.java:
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.polaris.service.task;
+
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.function.Function;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.FileIO;
+import org.apache.polaris.core.entity.AsyncTaskType;
+import org.apache.polaris.core.entity.TaskEntity;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * {@link TaskHandler} responsible for deleting previous metadata and
statistics files of a table.
+ */
+public class TableContentCleanupTaskHandler implements TaskHandler {
Review Comment:
Oof, sorry for missing this comment - I don't think we should have merged
these two classes. At most, I think a common base class would be fine, but I'd
prefer to avoid overloading the same class and cluttering it with if/else
statements
##########
polaris-service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java:
##########
@@ -49,6 +51,7 @@ public class TableCleanupTaskHandler implements TaskHandler {
private final TaskExecutor taskExecutor;
private final MetaStoreManagerFactory metaStoreManagerFactory;
private final Function<TaskEntity, FileIO> fileIOSupplier;
+ private static final int BATCH_SIZE = 10;
Review Comment:
Let's use the `PolarisConfigurationStore` to control this value
##########
polaris-service/src/main/java/org/apache/polaris/service/task/TableContentCleanupTaskHandler.java:
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.polaris.service.task;
+
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.function.Function;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.FileIO;
+import org.apache.polaris.core.entity.AsyncTaskType;
+import org.apache.polaris.core.entity.TaskEntity;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * {@link TaskHandler} responsible for deleting previous metadata and
statistics files of a table.
+ */
+public class TableContentCleanupTaskHandler implements TaskHandler {
Review Comment:
I'd rename this to something like `BatchFileCleanupTaskHandler` and make the
javadoc more generic. We don't really need to know what kind of files these
are, as we treat all of them the same (unlike the
`ManifestFileCleanupTaskHandler`, which has to read the manifests).
##########
polaris-service/src/main/java/org/apache/polaris/service/task/ManifestFileCleanupTaskHandler.java:
##########
@@ -68,58 +71,110 @@ public boolean canHandleTask(TaskEntity task) {
@Override
public boolean handleTask(TaskEntity task) {
ManifestCleanupTask cleanupTask = task.readData(ManifestCleanupTask.class);
- ManifestFile manifestFile =
decodeManifestData(cleanupTask.getManifestFileData());
TableIdentifier tableId = cleanupTask.getTableId();
try (FileIO authorizedFileIO = fileIOSupplier.apply(task)) {
-
- // if the file doesn't exist, we assume that another task execution was
successful, but failed
- // to drop the task entity. Log a warning and return success
- if (!TaskUtils.exists(manifestFile.path(), authorizedFileIO)) {
+ if (cleanupTask.getManifestFileData() != null) {
+ ManifestFile manifestFile =
decodeManifestData(cleanupTask.getManifestFileData());
+ return cleanUpManifestFile(manifestFile, authorizedFileIO, tableId);
+ } else if (cleanupTask.getMetadataFiles() != null) {
Review Comment:
I don't want to belabor this, but I don't want to overload this class with
logic for handling many different file types. As I mentioned, the
`AsyncTaskType.FILE_CLEANUP` enum is poorly named, making it sound like this
class can just handle any kind of file clean up.
In order to avoid bogging down this PR too much, can we just add a second
task type for the metadata files and predicate the logic here on the task type
rather than testing the presence of `getManifestFileData` or
`getMetadataFiles`? A future PR can refactor this code to use a common base
class for two handlers to avoid duplicating logic while maintaining a clear
separation of responsibilities.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]