[jira] [Updated] (HADOOP-17400) Optimize S3A for maximum performance in directory listings

Steve Loughran (Jira) Wed, 02 Dec 2020 02:46:05 -0800


     [ 
https://issues.apache.org/jira/browse/HADOOP-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran updated HADOOP-17400:
------------------------------------
    Description: 
Make listing in applications as fast as we can get it especially for query 
planning.

* All operations used in listing directories for query planning etc to be 
optimized for their primary use: being passed directories (not files) and so 
make that faster even at the expense of  more remote IO when handed files or 
empty directories.
* remove needless calls to S3 wherever possible (e.g. getFileStatus("/"), 
making bucket existence probes optional)
* Support/enable Asynchronous IO where possible.
 

Review higher level APIs (glob status) and uses on the FsShell and optimize 
their use by minimising invocations or FS API calls, with bonus goal of 
reduce/minimize risk of 404 caching.

Work with downstream projects to move to FS APIs which work best in this world 
-primarily the recursive listing operations and those which return 
RemoteIterator<FileStatus> -and so make any asynchronous page fetching 
operations useful. 

  was:
Make listing in applications as fast as we can get it especially for query 
planning.

* All operations used in listing directories for query planning etc to be 
optimized for their primary use: being passed directories (not files) and so 
make that faster even at the expense of  more remote IO when handed files or 
empty directories.
* remove needless calls to S3 wherever possible (e.g. getFileStatus(/), making 
bucket existence probes optional)
* Support/enable Asynchronous IO where possible.
 

Review higher level APIs (glob status) and uses on the FsShell and optimize 
their use by minimising invocations or FS API calls, with bonus goal of 
reduce/minimize risk of 404 caching.

Work with downstream projects to move to FS APIs which work best in this world 
-primarily the recursive listing operations and those which return 
RemoteIterator<FileStatus> -and so make any asynchronous page fetching 
operations useful. 


> Optimize S3A for maximum performance in directory listings
> ----------------------------------------------------------
>
>                 Key: HADOOP-17400
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17400
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 3.3.0
>            Reporter: Steve Loughran
>            Assignee: Mukund Thakur
>            Priority: Major
>
> Make listing in applications as fast as we can get it especially for query 
> planning.
> * All operations used in listing directories for query planning etc to be 
> optimized for their primary use: being passed directories (not files) and so 
> make that faster even at the expense of  more remote IO when handed files or 
> empty directories.
> * remove needless calls to S3 wherever possible (e.g. getFileStatus("/"), 
> making bucket existence probes optional)
> * Support/enable Asynchronous IO where possible.
>  
> Review higher level APIs (glob status) and uses on the FsShell and optimize 
> their use by minimising invocations or FS API calls, with bonus goal of 
> reduce/minimize risk of 404 caching.
> Work with downstream projects to move to FS APIs which work best in this 
> world -primarily the recursive listing operations and those which return 
> RemoteIterator<FileStatus> -and so make any asynchronous page fetching 
> operations useful. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-17400) Optimize S3A for maximum performance in directory listings

Reply via email to