[ 
https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327636#comment-15327636
 ] 

Steve Loughran commented on HADOOP-13203:
-----------------------------------------

I'm thinking of something more sophisticated, which I'm willing to do, it's 
just a phase III kind of problem (post hadoop 2.8)

# we have the notion of a read block size, say 64KB. This block size should be 
consistent with the block sizes used in the aws code/httpclient
# we always read aligned with the block size.
# for a simple seek()/read() at position P, the read would be from P to the 
next block size > P.
# if the number of bytes to be read is known {{seek+read(bytes), read-fully, 
read-positioned, positioned read-fully}}, we'd read up to the next block size, 
or, if the full read would span a block, up to the next block past the final 
read position.
# During a read, an EOF exception would trigger a new read (and/or the latest 
block size is tracked and managed in the S3AInputStream
# whenever a seek/positioned read in a new location is needed, the data up to 
the end of the next block is read in.
# for forward seeks, where the data is in the current block, skip the bytes
# for forward seeks where the data is in a later block, read to the end of the 
block, then to a read from the next location to the end of that block.

It means that for forward scan through the file, the number of blocks read in a 
file is {{file/blocksize}}.
For backward seeks of any kind, the amount of data read is the remainder of the 
current block + the data read. 
For forward seeks, if the data is in block, the amount of data is 
{{readLocation-currentLocation}}
For forward seeks, if the data is not in the block, the cost of a seek equals 
that of a backward seek.

So: short hop forward seeks and sequential reading is not very expensive; 
backwards and long-distance forward seeks have a predictable cost —one which is 
the same irrespective of the destination of the seek.

> S3a: Consider reducing the number of connection aborts by setting correct 
> length in s3 request
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13203
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13203
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13203-branch-2-001.patch, 
> HADOOP-13203-branch-2-002.patch, HADOOP-13203-branch-2-003.patch
>
>
> Currently file's "contentLength" is set as the "requestedStreamLen", when 
> invoking S3AInputStream::reopen().  As a part of lazySeek(), sometimes the 
> stream had to be closed and reopened. But lots of times the stream was closed 
> with abort() causing the internal http connection to be unusable. This incurs 
> lots of connection establishment cost in some jobs.  It would be good to set 
> the correct value for the stream length to avoid connection aborts. 
> I will post the patch once aws tests passes in my machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to