[jira] [Commented] (HADOOP-19348) S3A: Add initial support for analytics-accelerator-s3

ASF GitHub Bot (Jira) Tue, 25 Feb 2025 05:12:51 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930329#comment-17930329
 ]


ASF GitHub Bot commented on HADOOP-19348:
-----------------------------------------

ahmarsuhail commented on code in PR #7334:
URL: https://github.com/apache/hadoop/pull/7334#discussion_r1969742806


##########
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/ITestS3AContractMultipartUploader.java:
##########
@@ -127,6 +128,12 @@ public void 
testMultipartUploadReverseOrderNonContiguousPartNumbers() throws Exc
   @Override
   public void testConcurrentUploads() throws Throwable {
     assumeNotS3ExpressFileSystem(getFileSystem());
+    // Currently analytics accelerator does not support reading of files that 
have been overwritten.
+    // This is because the analytics accelerator library caches metadata and 
data, and when a file is
+    // overwritten, the old data continues to be used, until it is removed 
from the cache over
+    // time. This will be fixed in 
https://github.com/awslabs/analytics-accelerator-s3/issues/218.
+    skipIfAnalyticsAcceleratorEnabled(getContract().getConf(),

Review Comment:
   this one has two concurrent uploads, upload1 and upload2 for the same path. 
   
   It will complete upload1, then verify it's contents so read the whole file, 
and then complete upload2 and verify. But AAL still has contents from upload1 
in the cache, so the verify for upload2 fails 





> S3A: Add initial support for analytics-accelerator-s3
> -----------------------------------------------------
>
>                 Key: HADOOP-19348
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19348
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.4.2
>            Reporter: Ahmar Suhail
>            Priority: Major
>              Labels: pull-request-available
>
> S3 recently released [Analytics Accelerator Library for Amazon 
> S3|https://github.com/awslabs/analytics-accelerator-s3] as an Alpha release, 
> which is an input stream, with an initial goal of improving performance for 
> Apache Spark workloads on Parquet datasets. 
> For example, it implements optimisations such as footer prefetching, and so 
> avoids the multiple GETS S3AInputStream currently makes for the footer bytes 
> and PageIndex structures.
> The library also tracks columns currently being read by a query using the 
> parquet metadata, and then prefetches these bytes when parquet files with the 
> same schema are opened. 
> This ticket tracks the work required for the basic initial integration. There 
> is still more work to be done, such as VectoredIO support etc, which we will 
> identify and follow up with. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19348) S3A: Add initial support for analytics-accelerator-s3

Reply via email to