[jira] [Updated] (HADOOP-19348) Add support for analytics-accelerator-s3

Ahmar Suhail (Jira) Tue, 26 Nov 2024 07:36:40 -0800


     [ 
https://issues.apache.org/jira/browse/HADOOP-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ahmar Suhail updated HADOOP-19348:
----------------------------------
    Description: 
S3 recently released [Analytics Accelerator Library for Amazon 
S3|https://github.com/awslabs/analytics-accelerator-s3] as an Alpha release, 
which is an input stream, with an initial goal of improving performance for 
Apache Spark workloads on Parquet datasets. 

For example, it implements optimisations such as footer prefetching, and so 
avoids the multiple GETS S3AInputStream currently makes for the footer bytes 
and PageIndex structures.

The library also tracks columns currently being read by a query using the 
parquet metadata, and then prefetches these bytes when parquet files with the 
same schema are opened. 

This ticket tracks the work required for the basic initial integration. There 
is still more work to be done, such as VectoredIO support etc, which we will 
identify and follow up with. 

  was:
S3 recently released [https://github.com/awslabs/analytics-accelerator-s3 
|https://github.com/awslabs/analytics-accelerator-s3,] as an Alpha release, 
which is an input stream, with an initial goal of improving performance for 
Apache Spark workloads on Parquet datasets. 

For example, it implements optimisations such as footer prefetching, and so 
avoids the multiple GETS S3AInputStream currently makes for the footer bytes 
and PageIndex structures.

The library also tracks columns currently being read by a query using the 
parquet metadata, and then prefetches these bytes when parquet files with the 
same schema are opened. 

This ticket tracks the work required for the basic initial integration. There 
is still more work to be done, such as VectoredIO support etc, which we will 
identify and follow up with. 


> Add support for analytics-accelerator-s3
> ----------------------------------------
>
>                 Key: HADOOP-19348
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19348
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Ahmar Suhail
>            Priority: Major
>
> S3 recently released [Analytics Accelerator Library for Amazon 
> S3|https://github.com/awslabs/analytics-accelerator-s3] as an Alpha release, 
> which is an input stream, with an initial goal of improving performance for 
> Apache Spark workloads on Parquet datasets. 
> For example, it implements optimisations such as footer prefetching, and so 
> avoids the multiple GETS S3AInputStream currently makes for the footer bytes 
> and PageIndex structures.
> The library also tracks columns currently being read by a query using the 
> parquet metadata, and then prefetches these bytes when parquet files with the 
> same schema are opened. 
> This ticket tracks the work required for the basic initial integration. There 
> is still more work to be done, such as VectoredIO support etc, which we will 
> identify and follow up with. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-19348) Add support for analytics-accelerator-s3

Reply via email to