Re: [PR] HADOOP-19354. S3A: S3AInputStream to be created by factory under S3AStore [hadoop]

via GitHub Wed, 12 Feb 2025 05:01:53 -0800


ahmarsuhail commented on code in PR #7214:
URL: https://github.com/apache/hadoop/pull/7214#discussion_r1952602438



##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:
##########
@@ -1877,100 +1868,41 @@ private FSDataInputStream executeOpen(
     fileInformation.applyOptions(readContext);
     LOG.debug("Opening '{}'", readContext);
 
-    if (this.prefetchEnabled) {
-      Configuration configuration = getConf();
-      initLocalDirAllocatorIfNotInitialized(configuration);
-      return new FSDataInputStream(
-          new S3APrefetchingInputStream(
-              readContext.build(),
-              createObjectAttributes(path, fileStatus),
-              createInputStreamCallbacks(auditSpan),
-              inputStreamStats,
-              configuration,
-              directoryAllocator));
-    } else {
-      return new FSDataInputStream(
-          new S3AInputStream(
-              readContext.build(),
-              createObjectAttributes(path, fileStatus),
-              createInputStreamCallbacks(auditSpan),
-                  inputStreamStats,
-                  new SemaphoredDelegatingExecutor(
-                          boundedThreadPool,
-                          vectoredActiveRangeReads,
-                          true,
-                          inputStreamStats)));
-    }
-  }
-
-  /**
-   * Override point: create the callbacks for S3AInputStream.
-   * @return an implementation of the InputStreamCallbacks,
-   */
-  private S3AInputStream.InputStreamCallbacks createInputStreamCallbacks(
+    // what does the stream need
+    final StreamFactoryRequirements requirements =
+        getStore().factoryRequirements();
+
+    // calculate the permit count.
+    final int permitCount = requirements.streamThreads()
+        + requirements.vectoredIOContext().getVectoredActiveRangeReads();
+    // create an executor which is a subset of the
+    // bounded thread pool.
+    final SemaphoredDelegatingExecutor pool = new SemaphoredDelegatingExecutor(
+        boundedThreadPool,
+        permitCount,
+        true,
+        inputStreamStats);
+
+    // do not validate() the parameters as the store
+    // completes this.
+    ObjectReadParameters parameters = new ObjectReadParameters()

Review Comment:
   @steveloughran just realised, in our internal integration, we used to do 
`s3SeekableInputStreamFactory.createStream()` before the 
`extractOrFetchSimpleFileStatus()` call in this `executeOpen()` method.
   
   AAL has a metadata cache, and so this ensures we don't make repeated HEADs 
for the same key. Important (though not sure what the perf impact is), because 
Spark opens the same file multiple times in a task, once to read the footer, 
and then to read the column data. So S3A default currently does atleast 2 HEADs 
per file. 
   
   Now that the stream initialisation happens after  
extractOrFetchSimpleFileStatus(), S3A does the head even though it's not 
required as it's already in the AAL cache. 
   
   We should discuss what we can do here (maybe wire up S3A to AAL's metadata 
cache regardless of the stream it's using?), and do it as a follow up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HADOOP-19354. S3A: S3AInputStream to be created by factory under S3AStore [hadoop]

Reply via email to