[
https://issues.apache.org/jira/browse/HADOOP-17833?focusedWorklogId=778886&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778886
]
ASF GitHub Bot logged work on HADOOP-17833:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 07/Jun/22 07:09
Start Date: 07/Jun/22 07:09
Worklog Time Spent: 10m
Work Description: mehakmeet commented on code in PR #3289:
URL: https://github.com/apache/hadoop/pull/3289#discussion_r890808566
##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/statistics/PutTrackerStatistics.java:
##########
@@ -0,0 +1,29 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a.statistics;
+
+/**
+ * Interface for put tracking.
+ * It is sublclassed by {@link BlockOutputStreamStatistics},
Review Comment:
nit: subclassed
##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:
##########
@@ -1599,69 +1604,88 @@ public FSDataOutputStream create(Path f, FsPermission
permission,
boolean overwrite, int bufferSize, short replication, long blockSize,
Progressable progress) throws IOException {
final Path path = qualify(f);
+
// the span will be picked up inside the output stream
return trackDurationAndSpan(INVOCATION_CREATE, path, () ->
- innerCreateFile(path, permission, overwrite, bufferSize, replication,
- blockSize, progress));
+ innerCreateFile(path,
+ progress,
+ getActiveAuditSpan(),
+ overwrite
+ ? OPTIONS_CREATE_FILE_OVERWRITE
+ : OPTIONS_CREATE_FILE_NO_OVERWRITE));
}
/**
* Create an FSDataOutputStream at the indicated Path with write-progress
* reporting; in the active span.
* Retry policy: retrying, translated on the getFileStatus() probe.
* No data is uploaded to S3 in this call, so no retry issues related to
that.
+ * The "performance" flag disables safety checks for the path being a file,
+ * parent directory existing, and doesn't attempt to delete
+ * dir markers, irrespective of FS settings.
+ * If true, this method call does no IO at all.
+ * @param flags flags for the operation
+ * @param performance fast creation over safety
Review Comment:
nit: Params no longer used, need to remove from javadocs
##########
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/performance/ITestCreateFileCost.java:
##########
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a.performance;
+
+import java.io.IOException;
+
+import org.assertj.core.api.Assertions;
+import org.junit.Test;
+
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FSDataOutputStreamBuilder;
+import org.apache.hadoop.fs.FileAlreadyExistsException;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.contract.ContractTestUtils;
+import org.apache.hadoop.fs.s3a.S3AFileSystem;
+
+
+import static java.util.Objects.requireNonNull;
+import static org.apache.hadoop.fs.contract.ContractTestUtils.toChar;
+import static org.apache.hadoop.fs.s3a.Constants.FS_S3A_CREATE_HEADER;
+import static org.apache.hadoop.fs.s3a.Constants.FS_S3A_CREATE_PERFORMANCE;
+import static org.apache.hadoop.fs.s3a.Constants.XA_HEADER_PREFIX;
+import static org.apache.hadoop.fs.s3a.Statistic.OBJECT_BULK_DELETE_REQUEST;
+import static org.apache.hadoop.fs.s3a.Statistic.OBJECT_DELETE_REQUEST;
+import static
org.apache.hadoop.fs.s3a.performance.OperationCost.CREATE_FILE_NO_OVERWRITE;
+import static
org.apache.hadoop.fs.s3a.performance.OperationCost.CREATE_FILE_OVERWRITE;
+import static
org.apache.hadoop.fs.s3a.performance.OperationCost.FILE_STATUS_DIR_PROBE;
+import static
org.apache.hadoop.fs.s3a.performance.OperationCost.FILE_STATUS_FILE_PROBE;
+import static
org.apache.hadoop.fs.s3a.performance.OperationCost.GET_FILE_STATUS_ON_DIR_MARKER;
+import static
org.apache.hadoop.fs.s3a.performance.OperationCost.GET_FILE_STATUS_ON_FILE;
+import static
org.apache.hadoop.fs.s3a.performance.OperationCost.HEAD_OPERATION;
+import static
org.apache.hadoop.fs.s3a.performance.OperationCost.NO_HEAD_OR_LIST;
+
+/**
+ * Assert cost of createFile operations, especially
+ * with the FS_S3A_CREATE_PERFORMANCE option.
+ */
+@SuppressWarnings("resource")
+public class ITestCreateFileCost extends AbstractS3ACostTest {
+
+ /**
+ * Create with markers kept, always.
+ */
+ public ITestCreateFileCost() {
+ super(false);
+ }
+
+ @Test
+ public void testCreateNoOverwrite() throws Throwable {
+ describe("Test file creation with outoverwrite");
Review Comment:
nit: "without overwrite"
##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/CommitConstants.java:
##########
@@ -222,13 +226,18 @@ private CommitConstants() {
/**
* Number of threads in committers for parallel operations on files
* (upload, commit, abort, delete...): {@value}.
+ * Two thread pools this size are created, one for the outer
+ * task-level parallelism, and one for parallel execution
+ * within tasks (POSTs to commit individual uploads)
+ * If the value is negative, it is inverted and then multiplied
+ * by the number of cores in the CPU.
*/
public static final String FS_S3A_COMMITTER_THREADS =
"fs.s3a.committer.threads";
/**
* Default value for {@link #FS_S3A_COMMITTER_THREADS}: {@value}.
*/
- public static final int DEFAULT_COMMITTER_THREADS = 8;
+ public static final int DEFAULT_COMMITTER_THREADS = -4;
Review Comment:
Is it fine to create these big threadpools in machines with a high no.of
cores, or is it something we should keep constant for all machines and change
on-demand from the config itself?
Issue Time Tracking
-------------------
Worklog Id: (was: 778886)
Time Spent: 9h 20m (was: 9h 10m)
> Improve Magic Committer Performance
> -----------------------------------
>
> Key: HADOOP-17833
> URL: https://issues.apache.org/jira/browse/HADOOP-17833
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/s3
> Affects Versions: 3.3.1
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 9h 20m
> Remaining Estimate: 0h
>
> Magic committer tasks can be slow because every file created with
> overwrite=false triggers a HEAD (verify there's no file) and a LIST (that
> there's no dir). And because of delayed manifestations, it may not behave as
> expected.
> ParquetOutputFormat is one example of a library which does this.
> we could fix parquet to use overwrite=true, but (a) there may be surprises in
> other uses (b) it'd still leave the list and (c) do nothing for other formats
> call
> Proposed: createFile() under a magic path to skip all probes for file/dir at
> end of path
> Only a single task attempt Will be writing to that directory and it should
> know what it is doing. If there is conflicting file names and parts across
> tasks that won't even get picked up at this point. Oh and none of the
> committers ever check for this: you'll get the last file manifested (s3a) or
> renamed (file)
> If we skip the checks we will save 2 HTTP requests/file.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]