[
https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327636#comment-15327636
]
Steve Loughran commented on HADOOP-13203:
-----------------------------------------
I'm thinking of something more sophisticated, which I'm willing to do, it's
just a phase III kind of problem (post hadoop 2.8)
# we have the notion of a read block size, say 64KB. This block size should be
consistent with the block sizes used in the aws code/httpclient
# we always read aligned with the block size.
# for a simple seek()/read() at position P, the read would be from P to the
next block size > P.
# if the number of bytes to be read is known {{seek+read(bytes), read-fully,
read-positioned, positioned read-fully}}, we'd read up to the next block size,
or, if the full read would span a block, up to the next block past the final
read position.
# During a read, an EOF exception would trigger a new read (and/or the latest
block size is tracked and managed in the S3AInputStream
# whenever a seek/positioned read in a new location is needed, the data up to
the end of the next block is read in.
# for forward seeks, where the data is in the current block, skip the bytes
# for forward seeks where the data is in a later block, read to the end of the
block, then to a read from the next location to the end of that block.
It means that for forward scan through the file, the number of blocks read in a
file is {{file/blocksize}}.
For backward seeks of any kind, the amount of data read is the remainder of the
current block + the data read.
For forward seeks, if the data is in block, the amount of data is
{{readLocation-currentLocation}}
For forward seeks, if the data is not in the block, the cost of a seek equals
that of a backward seek.
So: short hop forward seeks and sequential reading is not very expensive;
backwards and long-distance forward seeks have a predictable cost —one which is
the same irrespective of the destination of the seek.
> S3a: Consider reducing the number of connection aborts by setting correct
> length in s3 request
> ----------------------------------------------------------------------------------------------
>
> Key: HADOOP-13203
> URL: https://issues.apache.org/jira/browse/HADOOP-13203
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Priority: Minor
> Attachments: HADOOP-13203-branch-2-001.patch,
> HADOOP-13203-branch-2-002.patch, HADOOP-13203-branch-2-003.patch
>
>
> Currently file's "contentLength" is set as the "requestedStreamLen", when
> invoking S3AInputStream::reopen(). As a part of lazySeek(), sometimes the
> stream had to be closed and reopened. But lots of times the stream was closed
> with abort() causing the internal http connection to be unusable. This incurs
> lots of connection establishment cost in some jobs. It would be good to set
> the correct value for the stream length to avoid connection aborts.
> I will post the patch once aws tests passes in my machine.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]