[ 
https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313391#comment-15313391
 ] 

Rajesh Balamohan commented on HADOOP-13203:
-------------------------------------------

Thanks for the comments @cnauroth.  In Hive, there can be lots of random seeks. 
In cases of backwards seek, it had to close the stream and reopen it. As a part 
of closing the stream, it has to make a decision on
whether the connection can be re-used or to abort the connection. If it aborts 
the connection, it becomes un-usable later and subsequent calls have to go 
through the expensive process of re-establishing the connection. 

For e.g, assume it is reading a 1 MB file and the current position is in 512 
KB. If it has to seek back to 128-th KB, it would end up calling close stream. 
As a part of earlier logic, since file {{contentLength}} was set as the 
{{requestedStreamLen}} , it would end up computing as {{(length - pos > 
CLOSE_THRESHOLD)}} (length would be requestedStreamLen in this case). This 
ended up aborting the connection. So apparently for any backwards seek, it 
would end up aborting the connection. Time taken to 
establish the connection was far expensive than reading small amount of data 
being requested for. 

> S3a: Consider reducing the number of connection aborts by setting correct 
> length in s3 request
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13203
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13203
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13203-branch-2-001.patch, 
> HADOOP-13203-branch-2-002.patch
>
>
> Currently file's "contentLength" is set as the "requestedStreamLen", when 
> invoking S3AInputStream::reopen().  As a part of lazySeek(), sometimes the 
> stream had to be closed and reopened. But lots of times the stream was closed 
> with abort() causing the internal http connection to be unusable. This incurs 
> lots of connection establishment cost in some jobs.  It would be good to set 
> the correct value for the stream length to avoid connection aborts. 
> I will post the patch once aws tests passes in my machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to