[
https://issues.apache.org/jira/browse/HADOOP-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076805#comment-16076805
]
Steve Loughran commented on HADOOP-13761:
-----------------------------------------
we need to implement retry logic in all AWS calls which bypass the xfer
manager, so that transient failures (503/throttle, connection timeout) can get
retried. The core code is in the HADOOP-13786 branch; it just needs rollout to
the existing methods and policies to deal with s3guard failures: when to fail,
when to retry. And, for DDB: when to fall back to the blobstore, which is a
different recovery strategy to the rest
> S3Guard: implement retries
> ---------------------------
>
> Key: HADOOP-13761
> URL: https://issues.apache.org/jira/browse/HADOOP-13761
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: HADOOP-13345
> Reporter: Aaron Fabbri
>
> Following the S3AFileSystem integration patch in HADOOP-13651, we need to add
> retry logic.
> In HADOOP-13651, I added TODO comments in most of the places retry loops are
> needed, including:
> - open(path). If MetadataStore reflects recent create/move of file path, but
> we fail to read it from S3, retry.
> - delete(path). If deleteObject() on S3 fails, but MetadataStore shows the
> file exists, retry.
> - rename(src,dest). If source path is not visible in S3 yet, retry.
> - listFiles(). Skip for now. Not currently implemented in S3Guard. I will
> create a separate JIRA for this as it will likely require interface changes
> (i.e. prefix or subtree scan).
> We may miss some cases initially and we should do failure injection testing
> to make sure we're covered. Failure injection tests can be a separate JIRA
> to make this easier to review.
> We also need basic configuration parameters around retry policy. There
> should be a way to specify maximum retry duration, as some applications would
> prefer to receive an error eventually, than waiting indefinitely. We should
> also be keeping statistics when inconsistency is detected and we enter a
> retry loop.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]