[jira] [Updated] (HADOOP-12444) Consider implementing lazy seek in S3AInputStream

Rajesh Balamohan (JIRA) Tue, 05 Apr 2016 07:17:52 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-12444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rajesh Balamohan updated HADOOP-12444:
--------------------------------------
    Attachment: HADOOP-12444-006.patch

Changes in 006:
1. Modified patch after applying HADOOP-12994. Removed validateTargetPosition 
and relying on validatePositionedReadArgs from FSInputStream
2. Changed "nextReadPos > contentLength-1" to "nextReadPos >= contentLength" 
for readability
3. Removed unwanted checks in seekInStream and reopen
4. Simplified readFully. Makes use of FSInputStream's readFully and updates 
stats now. This would not cause issues as lazySeek is implemented. Added 
comment related to this in the code.

Ran tests related to S3A on local system with additional tests in HADOOP-12994. 
Please let me know if the results should be uploaded. 

> Consider implementing lazy seek in S3AInputStream
> -------------------------------------------------
>
>                 Key: HADOOP-12444
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12444
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.7.1
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: HADOOP-12444-004.patch, HADOOP-12444-005.patch, 
> HADOOP-12444-006.patch, HADOOP-12444.1.patch, HADOOP-12444.2.patch, 
> HADOOP-12444.3.patch, HADOOP-12444.WIP.patch, hadoop-aws-test-reports.tar.gz
>
>
> - Currently, "read(long position, byte[] buffer, int offset, int length)" is 
> not implemented in S3AInputStream (unlike DFSInputStream). So, 
> "readFully(long position, byte[] buffer, int offset, int length)" in 
> S3AInputStream goes through the default implementation of seek(), read(), 
> seek() in FSInputStream. 
> - However, seek() in S3AInputStream involves re-opening of connection to S3 
> everytime 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L115).
>   
> - It would be good to consider having a lazy seek implementation to reduce 
> connection overheads to S3. (e.g Presto implements lazy seek. 
> https://github.com/facebook/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java#L623)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HADOOP-12444) Consider implementing lazy seek in S3AInputStream

Reply via email to