[jira] [Commented] (HADOOP-18679) Add API for bulk/paged object deletion

ASF GitHub Bot (Jira) Fri, 19 Apr 2024 05:01:05 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838974#comment-17838974
 ]


ASF GitHub Bot commented on HADOOP-18679:
-----------------------------------------

steveloughran commented on code in PR #6726:
URL: https://github.com/apache/hadoop/pull/6726#discussion_r1572230146


##########
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/bulkdelete.md:
##########
@@ -161,124 +105,20 @@ store.hasPathCapability(path, 
"fs.capability.bulk.delete")
 
 ### Invocation through Reflection.
 
-The need for many Libraries to compile against very old versions of Hadoop
+The need for many libraries to compile against very old versions of Hadoop
 means that most of the cloud-first Filesystem API calls cannot be used except
 through reflection -And the more complicated The API and its data types are,
 The harder that reflection is to implement.
 
-To assist this, the class `org.apache.hadoop.fs.FileUtil` has two methods
+To assist this, the class `org.apache.hadoop.io.wrappedio.WrappedIO` has few 
methods
 which are intended to provide simple access to the API, especially
 through reflection.
 
 ```java
-  /**
-   * Get the maximum number of objects/files to delete in a single request.
-   * @param fs filesystem
-   * @param path path to delete under.
-   * @return a number greater than or equal to zero.
-   * @throws UnsupportedOperationException bulk delete under that path is not 
supported.
-   * @throws IllegalArgumentException path not valid.
-   * @throws IOException problems resolving paths
-   */
+
   public static int bulkDeletePageSize(FileSystem fs, Path path) throws 
IOException;
   
-  /**
-   * Delete a list of files/objects.
-   * <ul>
-   *   <li>Files must be under the path provided in {@code base}.</li>
-   *   <li>The size of the list must be equal to or less than the page 
size.</li>
-   *   <li>Directories are not supported; the outcome of attempting to delete
-   *       directories is undefined (ignored; undetected, listed as 
failures...).</li>
-   *   <li>The operation is not atomic.</li>
-   *   <li>The operation is treated as idempotent: network failures may
-   *        trigger resubmission of the request -any new objects created under 
a
-   *        path in the list may then be deleted.</li>
-   *    <li>There is no guarantee that any parent directories exist after this 
call.
-   *    </li>
-   * </ul>
-   * @param fs filesystem
-   * @param base path to delete under.
-   * @param paths list of paths which must be absolute and under the base path.
-   * @return a list of all the paths which couldn't be deleted for a reason 
other than "not found" and any associated error message.
-   * @throws UnsupportedOperationException bulk delete under that path is not 
supported.
-   * @throws IOException IO problems including networking, authentication and 
more.
-   * @throws IllegalArgumentException if a path argument is invalid.
-   */
-  public static List<Map.Entry<Path, String>> bulkDelete(FileSystem fs, Path 
base, List<Path> paths)
-```
+  public static int bulkDeletePageSize(FileSystem fs, Path path) throws 
IOException;
 
-## S3A Implementation
-
-The S3A client exports this API.

Review Comment:
   this needs to be covered, along with the default implementation "maps to 
delete(path, false)"



##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DefaultBulkDeleteOperation.java:
##########
@@ -0,0 +1,84 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.fs;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.hadoop.util.functional.Tuples;
+
+import static java.util.Objects.requireNonNull;
+import static org.apache.hadoop.fs.BulkDeleteUtils.validateBulkDeletePaths;
+import static org.apache.hadoop.util.Preconditions.checkArgument;
+
+/**
+ * Default implementation of the {@link BulkDelete} interface.
+ */
+public class DefaultBulkDeleteOperation implements BulkDelete {
+
+    private final int pageSize;
+
+    private final Path basePath;
+
+    private final FileSystem fs;
+
+    public DefaultBulkDeleteOperation(int pageSize,
+                                      Path basePath,
+                                      FileSystem fs) {
+        checkArgument(pageSize == 1, "Page size must be equal to 1");
+        this.pageSize = pageSize;
+        this.basePath = requireNonNull(basePath);
+        this.fs = fs;
+    }
+
+    @Override
+    public int pageSize() {
+        return pageSize;
+    }
+
+    @Override
+    public Path basePath() {
+        return basePath;
+    }
+
+    @Override
+    public List<Map.Entry<Path, String>> bulkDelete(Collection<Path> paths)
+            throws IOException, IllegalArgumentException {
+        validateBulkDeletePaths(paths, pageSize, basePath);
+        List<Map.Entry<Path, String>> result = new ArrayList<>();
+        // this for loop doesn't make sense as pageSize must be 1.
+        for (Path path : paths) {

Review Comment:
   there's only even going to be 1 entry here



##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DefaultBulkDeleteOperation.java:
##########
@@ -0,0 +1,84 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.fs;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.hadoop.util.functional.Tuples;
+
+import static java.util.Objects.requireNonNull;
+import static org.apache.hadoop.fs.BulkDeleteUtils.validateBulkDeletePaths;
+import static org.apache.hadoop.util.Preconditions.checkArgument;
+
+/**
+ * Default implementation of the {@link BulkDelete} interface.
+ */
+public class DefaultBulkDeleteOperation implements BulkDelete {
+
+    private final int pageSize;
+
+    private final Path basePath;
+
+    private final FileSystem fs;
+
+    public DefaultBulkDeleteOperation(int pageSize,
+                                      Path basePath,
+                                      FileSystem fs) {
+        checkArgument(pageSize == 1, "Page size must be equal to 1");
+        this.pageSize = pageSize;
+        this.basePath = requireNonNull(basePath);
+        this.fs = fs;
+    }
+
+    @Override
+    public int pageSize() {
+        return pageSize;
+    }
+
+    @Override
+    public Path basePath() {
+        return basePath;
+    }
+
+    @Override
+    public List<Map.Entry<Path, String>> bulkDelete(Collection<Path> paths)
+            throws IOException, IllegalArgumentException {
+        validateBulkDeletePaths(paths, pageSize, basePath);
+        List<Map.Entry<Path, String>> result = new ArrayList<>();
+        // this for loop doesn't make sense as pageSize must be 1.
+        for (Path path : paths) {
+            try {
+                fs.delete(path, false);
+                // What to do if this return false?
+                // I think we should add the path to the result list with 
value "Not Deleted".

Review Comment:
   good q. I'd say yes. or actually do a getFileStatus() cal and see what is 
there for
   * file doesn't exist (not an error)
   * path is a directory: add to result
   
   key is that file not found isn't escalated



##########
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractBulkDeleteTest.java:
##########
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *       http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ */
+
+package org.apache.hadoop.fs.contract;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import org.assertj.core.api.Assertions;
+import org.junit.Before;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.hadoop.fs.CommonPathCapabilities;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.wrappedio.WrappedIO;
+
+import static org.apache.hadoop.fs.contract.ContractTestUtils.skip;
+import static org.apache.hadoop.fs.contract.ContractTestUtils.touch;
+import static org.apache.hadoop.io.wrappedio.WrappedIO.bulkDelete;
+import static org.apache.hadoop.test.LambdaTestUtils.intercept;
+
+/**
+ * Contract tests for bulk delete operation.
+ */
+public abstract class AbstractContractBulkDeleteTest extends 
AbstractFSContractTestBase {
+
+    private static final Logger LOG =
+            LoggerFactory.getLogger(AbstractContractBulkDeleteTest.class);
+
+    protected int pageSize;
+
+    protected Path basePath;
+
+    protected FileSystem fs;
+
+    @Before
+    public void setUp() throws Exception {
+        fs = getFileSystem();
+        basePath = path(getClass().getName());
+        pageSize = WrappedIO.bulkDeletePageSize(getFileSystem(), basePath);
+        fs.mkdirs(basePath);
+    }
+
+    public Path getBasePath() {
+        return basePath;
+    }
+
+    /**
+     * Validate the page size for bulk delete operation. Different stores can 
have different
+     * implementations for bulk delete operation thus different page size.
+     */
+    @Test
+    public void validatePageSize() throws Exception {
+        Assertions.assertThat(pageSize)
+                .describedAs("Page size should be 1 by default for all stores")
+                .isEqualTo(1);
+    }
+
+    @Test
+    public void testPathsSizeEqualsPageSizePrecondition() throws Exception {
+        List<Path> listOfPaths = createListOfPaths(pageSize, basePath);
+        // Bulk delete call should pass with no exception.
+        bulkDelete(getFileSystem(), basePath, listOfPaths);
+    }
+
+    @Test
+    public void testPathsSizeGreaterThanPageSizePrecondition() throws 
Exception {
+        List<Path> listOfPaths = createListOfPaths(pageSize + 1, basePath);
+        intercept(IllegalArgumentException.class,
+                () -> bulkDelete(getFileSystem(), basePath, listOfPaths));
+    }
+
+    @Test
+    public void testPathsSizeLessThanPageSizePrecondition() throws Exception {
+        List<Path> listOfPaths = createListOfPaths(pageSize - 1, basePath);
+        // Bulk delete call should pass with no exception.
+        bulkDelete(getFileSystem(), basePath, listOfPaths);
+    }
+
+    @Test
+    public void testBulkDeleteSuccessful() throws Exception {
+        List<Path> listOfPaths = createListOfPaths(pageSize, basePath);
+        for (Path path : listOfPaths) {
+            touch(fs, path);
+        }
+        FileStatus[] fileStatuses = fs.listStatus(basePath);
+        Assertions.assertThat(fileStatuses)
+                .describedAs("File count after create")
+                .hasSize(pageSize);
+        assertSuccessfulBulkDelete(
+            bulkDelete(getFileSystem(), basePath, listOfPaths));
+        FileStatus[] fileStatusesAfterDelete = fs.listStatus(basePath);
+        Assertions.assertThat(fileStatusesAfterDelete)
+                .describedAs("File statuses should be empty after delete")
+                .isEmpty();
+    }
+
+    @Test
+    public void validatePathCapabilityDeclared() throws Exception {
+        Assertions.assertThat(fs.hasPathCapability(basePath, 
CommonPathCapabilities.BULK_DELETE))
+                .describedAs("Path capability BULK_DELETE should be declared")
+                .isTrue();
+    }
+
+    @Test
+    public void testDeletePathsNotUnderBase() throws Exception {
+        List<Path> paths = new ArrayList<>();
+        Path pathNotUnderBase = path("not-under-base");
+        paths.add(pathNotUnderBase);
+        // Should fail as path is not under the base path.
+        intercept(IllegalArgumentException.class,
+                () -> bulkDelete(getFileSystem(), basePath, paths));
+    }
+
+    @Test
+    public void testDeletePathsNotAbsolute() throws Exception {
+        List<Path> paths = new ArrayList<>();
+        Path pathNotAbsolute = new Path("not-absolute");
+        paths.add(pathNotAbsolute);
+        // Should fail as path is not absolute.
+        intercept(IllegalArgumentException.class,
+                () -> bulkDelete(getFileSystem(), basePath, paths));
+    }
+
+    @Test
+    public void testDeletePathsNotExists() throws Exception {
+        List<Path> paths = new ArrayList<>();
+        Path pathNotExists = new Path(basePath, "not-exists");
+        paths.add(pathNotExists);
+        // bulk delete call doesn't verify if a path exist or not before 
deleting.
+        assertSuccessfulBulkDelete(bulkDelete(getFileSystem(), basePath, 
paths));
+    }
+
+    @Test
+    public void testDeletePathsDirectory() throws Exception {
+        List<Path> paths = new ArrayList<>();
+        Path dirPath = new Path(basePath, "dir");
+        fs.mkdirs(dirPath);
+        paths.add(dirPath);
+        Path filePath = new Path(dirPath, "file");
+        touch(fs, filePath);
+        paths.add(filePath);
+        pageSizePreconditionForTest(paths.size());
+        // Outcome is undefined. But call shouldn't fail. In case of S3 
directories will still be present.
+        assertSuccessfulBulkDelete(bulkDelete(getFileSystem(), basePath, 
paths));
+    }
+
+    @Test
+    public void testDeleteEmptyDirectory() throws Exception {
+        List<Path> paths = new ArrayList<>();
+        Path emptyDirPath = new Path(basePath, "empty-dir");
+        fs.mkdirs(emptyDirPath);
+        paths.add(emptyDirPath);
+        // Should pass as empty directory.
+        assertSuccessfulBulkDelete(bulkDelete(getFileSystem(), basePath, 
paths));
+    }
+
+    @Test
+    public void testDeleteEmptyList() throws Exception {
+        List<Path> paths = new ArrayList<>();
+        // Empty list should pass.
+        assertSuccessfulBulkDelete(bulkDelete(getFileSystem(), basePath, 
paths));
+    }
+
+    @Test
+    public void testDeleteSamePathsMoreThanOnce() throws Exception {
+        List<Path> paths = new ArrayList<>();
+        Path path = new Path(basePath, "file");
+        touch(fs, path);
+        paths.add(path);
+        paths.add(path);
+        Path another = new Path(basePath, "another-file");
+        touch(fs, another);
+        paths.add(another);
+        pageSizePreconditionForTest(paths.size());

Review Comment:
   build the paths and this precondition before the touch; saves any network IO 
before skipping



##########
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/bulkdelete.md:
##########
@@ -0,0 +1,124 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# <a name="BulkDelete"></a> interface `BulkDelete`
+
+<!

> Add API for bulk/paged object deletion
> --------------------------------------
>
>                 Key: HADOOP-18679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18679
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.3.5
>            Reporter: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>
> iceberg and hbase could benefit from being able to give a list of individual 
> files to delete -files which may be scattered round the bucket for better 
> read peformance. 
> Add some new optional interface for an object store which allows a caller to 
> submit a list of paths to files to delete, where
> the expectation is
> * if a path is a file: delete
> * if a path is a dir, outcome undefined
> For s3 that'd let us build these into DeleteRequest objects, and submit, 
> without any probes first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-18679) Add API for bulk/paged object deletion

Reply via email to