[jira] [Updated] (HADOOP-13203) S3a: Consider reducing the number of connection aborts by setting correct length in s3 request

Rajesh Balamohan (JIRA) Wed, 25 May 2016 03:14:13 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rajesh Balamohan updated HADOOP-13203:
--------------------------------------
    Attachment: HADOOP-13203-branch-2-001.patch

Yes [~steve_l]. In workloads like hive, there are lots of random seeks and lots 
of times the internal connection had to be aborted. It was a lot cheaper to 
reuse the connection with this patch.  Amount of data to be requested for in 
the request can be determined by "Math.max(targetPos + readahead, (targetPos + 
length))".  

>From the unit tests perspective for aws, following issues were there

Test timeout failures:
- TestS3ADeleteManyFiles.testBulkRenameAndDelete
- org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp.largeFilesToRemote, 
largeFilesFromRemote
- org.apache.hadoop.fs.s3a.scale.TestS3ADeleteManyFiles.testBulkRenameAndDelete


Other failures
- org.apache.hadoop.fs.contract.s3a.TestS3AContractRootDir (Root directory 
operation rejected) - This is already tracked in another jira.

- 
org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance.testReadAheadDefault/testReadBigBlocksBigReadahead
 (earlier this expected 1 open, but now it can be multiple requestedStreamLen 
would no longer be the file's length. At the max, we would be able to save a 
single read ahead call. For rest, it has to open multiple times.
But this is ok compared with the connection restablishments in real workloads 
where it can be completely random set of ranges being requested for. E.g 
hive.).  I have not updated the patch to fix this failure. Based on inputs, I 
can revise the patch. 

> S3a: Consider reducing the number of connection aborts by setting correct 
> length in s3 request
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13203
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13203
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13203-branch-2-001.patch
>
>
> Currently file's "contentLength" is set as the "requestedStreamLen", when 
> invoking S3AInputStream::reopen().  As a part of lazySeek(), sometimes the 
> stream had to be closed and reopened. But lots of times the stream was closed 
> with abort() causing the internal http connection to be unusable. This incurs 
> lots of connection establishment cost in some jobs.  It would be good to set 
> the correct value for the stream length to avoid connection aborts. 
> I will post the patch once aws tests passes in my machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-13203) S3a: Consider reducing the number of connection aborts by setting correct length in s3 request

Reply via email to