[jira] [Resolved] (HADOOP-14303) Review retry logic on all S3 SDK calls, implement where needed

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved HADOOP-14303.
--
   Resolution: Duplicate
Fix Version/s: (was: 3.1.0)

> Review retry logic on all S3 SDK calls, implement where needed
> --
>
> Key: HADOOP-14303
> URL: https://issues.apache.org/jira/browse/HADOOP-14303
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> AWS S3, IAM, KMS, DDB etc all throttle callers: the S3A code needs to handle 
> this without failing, as if it slows down its requests it can recover.
> 1. Look at all the places where we are calling S3A via the AWS SDK and make 
> sure we are retrying with some backoff & jitter policy, ideally something 
> unified. This must be more systematic than the case-by-case, 
> problem-by-problem strategy we are implicitly using.
> 2. Many of the AWS S3 SDK calls do implement retry (e.g PUT/multipart PUT), 
> but we need to check the other parts of the process: login, initiate/complete 
> MPU, ...
> Related
> HADOOP-13811 Failed to sanitize XML document destined for handler class
> HADOOP-13664 S3AInputStream to use a retry policy on read failures
> This stuff is all hard to test. A key need is to be able to differentiate 
> recoverable throttle & network failures from unrecoverable problems like: 
> auth, network config (e.g bad endpoint), etc.
> May be the opportunity to add a faulting subclass of Amazon S3 client which 
> can be configured in IT Tests to fail at specific points. Ryan Blue's mcok S3 
> client does this in HADOOP-13786, but it is for 100% mock. I'm thinking of 
> something with similar fault raising, but in front of the real S3A client 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Resolved] (HADOOP-14303) Review retry logic on all S3 SDK calls, implement where needed

2017-11-22 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved HADOOP-14303.
-
   Resolution: Fixed
Fix Version/s: 3.1.0

Fixed in HADOOP-13786, with

* Java 8 lambdas API for invoking S3A operations with retry and error 
translation
* All methods calling of the S3 client marked up with their (current) retry 
logic to make clear what's happening and when you don't need to add retry code 
around retry code.
* metrics & stats to track retries
* testing through fault injection
* What seems a good initial Policy (S3ARetryPolicy). Always scope for tuning 
there, especially "what to do about the 400 error code?" For now: treating as 
retryable on all call types (idempotent/non-idempotent) in the hope its 
transient. Fail fast, or at least "fail medium" may be better though.

> Review retry logic on all S3 SDK calls, implement where needed
> --
>
> Key: HADOOP-14303
> URL: https://issues.apache.org/jira/browse/HADOOP-14303
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Fix For: 3.1.0
>
>
> AWS S3, IAM, KMS, DDB etc all throttle callers: the S3A code needs to handle 
> this without failing, as if it slows down its requests it can recover.
> 1. Look at all the places where we are calling S3A via the AWS SDK and make 
> sure we are retrying with some backoff & jitter policy, ideally something 
> unified. This must be more systematic than the case-by-case, 
> problem-by-problem strategy we are implicitly using.
> 2. Many of the AWS S3 SDK calls do implement retry (e.g PUT/multipart PUT), 
> but we need to check the other parts of the process: login, initiate/complete 
> MPU, ...
> Related
> HADOOP-13811 Failed to sanitize XML document destined for handler class
> HADOOP-13664 S3AInputStream to use a retry policy on read failures
> This stuff is all hard to test. A key need is to be able to differentiate 
> recoverable throttle & network failures from unrecoverable problems like: 
> auth, network config (e.g bad endpoint), etc.
> May be the opportunity to add a faulting subclass of Amazon S3 client which 
> can be configured in IT Tests to fail at specific points. Ryan Blue's mcok S3 
> client does this in HADOOP-13786, but it is for 100% mock. I'm thinking of 
> something with similar fault raising, but in front of the real S3A client 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org