[ 
https://issues.apache.org/jira/browse/HADOOP-17833?focusedWorklogId=778886&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778886
 ]

ASF GitHub Bot logged work on HADOOP-17833:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Jun/22 07:09
            Start Date: 07/Jun/22 07:09
    Worklog Time Spent: 10m 
      Work Description: mehakmeet commented on code in PR #3289:
URL: https://github.com/apache/hadoop/pull/3289#discussion_r890808566


##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/statistics/PutTrackerStatistics.java:
##########
@@ -0,0 +1,29 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a.statistics;
+
+/**
+ * Interface for put tracking.
+ * It is sublclassed by {@link BlockOutputStreamStatistics},

Review Comment:
   nit: subclassed



##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:
##########
@@ -1599,69 +1604,88 @@ public FSDataOutputStream create(Path f, FsPermission 
permission,
       boolean overwrite, int bufferSize, short replication, long blockSize,
       Progressable progress) throws IOException {
     final Path path = qualify(f);
+
     // the span will be picked up inside the output stream
     return trackDurationAndSpan(INVOCATION_CREATE, path, () ->
-        innerCreateFile(path, permission, overwrite, bufferSize, replication,
-            blockSize, progress));
+        innerCreateFile(path,
+            progress,
+            getActiveAuditSpan(),
+            overwrite
+                ? OPTIONS_CREATE_FILE_OVERWRITE
+                : OPTIONS_CREATE_FILE_NO_OVERWRITE));
   }
 
   /**
    * Create an FSDataOutputStream at the indicated Path with write-progress
    * reporting; in the active span.
    * Retry policy: retrying, translated on the getFileStatus() probe.
    * No data is uploaded to S3 in this call, so no retry issues related to 
that.
+   * The "performance" flag disables safety checks for the path being a file,
+   * parent directory existing, and doesn't attempt to delete
+   * dir markers, irrespective of FS settings.
+   * If true, this method call does no IO at all.
+   * @param flags flags for the operation
+   * @param performance fast creation over safety

Review Comment:
   nit: Params no longer used, need to remove from javadocs



##########
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/performance/ITestCreateFileCost.java:
##########
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a.performance;
+
+import java.io.IOException;
+
+import org.assertj.core.api.Assertions;
+import org.junit.Test;
+
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FSDataOutputStreamBuilder;
+import org.apache.hadoop.fs.FileAlreadyExistsException;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.contract.ContractTestUtils;
+import org.apache.hadoop.fs.s3a.S3AFileSystem;
+
+
+import static java.util.Objects.requireNonNull;
+import static org.apache.hadoop.fs.contract.ContractTestUtils.toChar;
+import static org.apache.hadoop.fs.s3a.Constants.FS_S3A_CREATE_HEADER;
+import static org.apache.hadoop.fs.s3a.Constants.FS_S3A_CREATE_PERFORMANCE;
+import static org.apache.hadoop.fs.s3a.Constants.XA_HEADER_PREFIX;
+import static org.apache.hadoop.fs.s3a.Statistic.OBJECT_BULK_DELETE_REQUEST;
+import static org.apache.hadoop.fs.s3a.Statistic.OBJECT_DELETE_REQUEST;
+import static 
org.apache.hadoop.fs.s3a.performance.OperationCost.CREATE_FILE_NO_OVERWRITE;
+import static 
org.apache.hadoop.fs.s3a.performance.OperationCost.CREATE_FILE_OVERWRITE;
+import static 
org.apache.hadoop.fs.s3a.performance.OperationCost.FILE_STATUS_DIR_PROBE;
+import static 
org.apache.hadoop.fs.s3a.performance.OperationCost.FILE_STATUS_FILE_PROBE;
+import static 
org.apache.hadoop.fs.s3a.performance.OperationCost.GET_FILE_STATUS_ON_DIR_MARKER;
+import static 
org.apache.hadoop.fs.s3a.performance.OperationCost.GET_FILE_STATUS_ON_FILE;
+import static 
org.apache.hadoop.fs.s3a.performance.OperationCost.HEAD_OPERATION;
+import static 
org.apache.hadoop.fs.s3a.performance.OperationCost.NO_HEAD_OR_LIST;
+
+/**
+ * Assert cost of createFile operations, especially
+ * with the FS_S3A_CREATE_PERFORMANCE option.
+ */
+@SuppressWarnings("resource")
+public class ITestCreateFileCost extends AbstractS3ACostTest {
+
+  /**
+   * Create with markers kept, always.
+   */
+  public ITestCreateFileCost() {
+    super(false);
+  }
+
+  @Test
+  public void testCreateNoOverwrite() throws Throwable {
+    describe("Test file creation with outoverwrite");

Review Comment:
   nit: "without overwrite"



##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/CommitConstants.java:
##########
@@ -222,13 +226,18 @@ private CommitConstants() {
   /**
    * Number of threads in committers for parallel operations on files
    * (upload, commit, abort, delete...): {@value}.
+   * Two thread pools this size are created, one for the outer
+   * task-level parallelism, and one for parallel execution
+   * within tasks (POSTs to commit individual uploads)
+   * If the value is negative, it is inverted and then multiplied
+   * by the number of cores in the CPU.
    */
   public static final String FS_S3A_COMMITTER_THREADS =
       "fs.s3a.committer.threads";
   /**
    * Default value for {@link #FS_S3A_COMMITTER_THREADS}: {@value}.
    */
-  public static final int DEFAULT_COMMITTER_THREADS = 8;
+  public static final int DEFAULT_COMMITTER_THREADS = -4;

Review Comment:
   Is it fine to create these big threadpools in machines with a high no.of 
cores, or is it something we should keep constant for all machines and change 
on-demand from the config itself?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 778886)
    Time Spent: 9h 20m  (was: 9h 10m)

> Improve Magic Committer Performance
> -----------------------------------
>
>                 Key: HADOOP-17833
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17833
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 3.3.1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> Magic committer tasks can be slow because every file created with 
> overwrite=false triggers a HEAD (verify there's no file) and a LIST (that 
> there's no dir). And because of delayed manifestations, it may not behave as 
> expected.
> ParquetOutputFormat is one example of a library which does this.
> we could fix parquet to use overwrite=true, but (a) there may be surprises in 
> other uses (b) it'd still leave the list and (c) do nothing for other formats 
> call
> Proposed: createFile() under a magic path to skip all probes for file/dir at 
> end of path
> Only a single task attempt Will be writing to that directory and it should 
> know what it is doing. If there is conflicting file names and parts across 
> tasks that won't even get picked up at this point. Oh and none of the 
> committers ever check for this: you'll get the last file manifested (s3a) or 
> renamed (file)
> If we skip the checks we will save 2 HTTP requests/file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to