[ 
https://issues.apache.org/jira/browse/HADOOP-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366433#comment-16366433
 ] 

Aaron Fabbri commented on HADOOP-13761:
---------------------------------------

[~ste...@apache.org] on your question about changing the retry() to enclose 
lazySeek() instead of around stream.read(), that is not sufficient with the 
current failure model (i.e. how I'm injecting failures).  I think the failure 
injection needs work.

{noformat}
2018-02-15 15:27:00,907 [JUnit-testOpenFailOnRead] ERROR 
s3a.AbstractS3ATestBase (ITestS3AInconsistency.java:testOpenFailOnRead(129)) - 
Error:
java.io.FileNotFoundException: read(b, 0, 4) on key 
test/ancestor/file-to-read-DELAY_LISTING_ME failed: injecting error 3/5 for 
test.
        at 
org.apache.hadoop.fs.s3a.InconsistentS3Object$InconsistentS3InputStream.readFailpoint(InconsistentS3Object.java:169)
        at 
org.apache.hadoop.fs.s3a.InconsistentS3Object$InconsistentS3InputStream.read(InconsistentS3Object.java:159)
        at 
org.apache.hadoop.fs.s3a.S3AInputStream.lambda$read$56(S3AInputStream.java:426)
        at 
org.apache.hadoop.fs.s3a.S3AInputStream$$Lambda$31/1279551328.execute(Unknown 
Source)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$7(Invoker.java:260)
        at 
org.apache.hadoop.fs.s3a.Invoker$$Lambda$12/999989609.execute(Unknown Source)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:317)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:256)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:231)
        at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:438)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at 
org.apache.hadoop.fs.s3a.ITestS3AInconsistency.testOpenFailOnRead(ITestS3AInconsistency.java:126)
{noformat}

If we can agree on where to inject failures I think I can come up with a good 
solution.

Maybe:

- We need both lazySeek() and the stream.read() retries?
- The failure injection for InconsistentS3InputStream() should also have a 
failpoint in skip(), which would have exposed the lack of retry in lazySeek().
- AmazonS3Client.getObject() currently does not fail, but returns an 
InconsistentS3Object with the read/skip/etc. failpoints mentioned.  Seems like 
getObject() needs to fail as, looking at the SDK code, it actually does the GET 
request I believe.

Does this sound right?

Also: My current approach of failing read 5 times then succeeding (5<20 which 
is retry max) is not going to work to expose all the codepaths that fail.  I 
need a loop runs multiple tests and either (1) increases max failures, or 
failure offset, by one each duration (how traceroute uses TTLs to probe router 
hops; I use to probe failure points) or (2) uses randomization to be likely to 
hit different failure points.

#1 actually seems more deterministic.



> S3Guard: implement retries for DDB failures and throttling; translate 
> exceptions
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-13761
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13761
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1
>            Reporter: Aaron Fabbri
>            Assignee: Aaron Fabbri
>            Priority: Blocker
>         Attachments: HADOOP-13761-004-to-005.patch, HADOOP-13761-005.patch, 
> HADOOP-13761.001.patch, HADOOP-13761.002.patch, HADOOP-13761.003.patch, 
> HADOOP-13761.004.patch
>
>
> Following the S3AFileSystem integration patch in HADOOP-13651, we need to add 
> retry logic.
> In HADOOP-13651, I added TODO comments in most of the places retry loops are 
> needed, including:
> - open(path).  If MetadataStore reflects recent create/move of file path, but 
> we fail to read it from S3, retry.
> - delete(path).  If deleteObject() on S3 fails, but MetadataStore shows the 
> file exists, retry.
> - rename(src,dest).  If source path is not visible in S3 yet, retry.
> - listFiles(). Skip for now. Not currently implemented in S3Guard. I will 
> create a separate JIRA for this as it will likely require interface changes 
> (i.e. prefix or subtree scan).
> We may miss some cases initially and we should do failure injection testing 
> to make sure we're covered.  Failure injection tests can be a separate JIRA 
> to make this easier to review.
> We also need basic configuration parameters around retry policy.  There 
> should be a way to specify maximum retry duration, as some applications would 
> prefer to receive an error eventually, than waiting indefinitely.  We should 
> also be keeping statistics when inconsistency is detected and we enter a 
> retry loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to