[ https://issues.apache.org/jira/browse/HADOOP-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366433#comment-16366433 ]
Aaron Fabbri commented on HADOOP-13761: --------------------------------------- [~ste...@apache.org] on your question about changing the retry() to enclose lazySeek() instead of around stream.read(), that is not sufficient with the current failure model (i.e. how I'm injecting failures). I think the failure injection needs work. {noformat} 2018-02-15 15:27:00,907 [JUnit-testOpenFailOnRead] ERROR s3a.AbstractS3ATestBase (ITestS3AInconsistency.java:testOpenFailOnRead(129)) - Error: java.io.FileNotFoundException: read(b, 0, 4) on key test/ancestor/file-to-read-DELAY_LISTING_ME failed: injecting error 3/5 for test. at org.apache.hadoop.fs.s3a.InconsistentS3Object$InconsistentS3InputStream.readFailpoint(InconsistentS3Object.java:169) at org.apache.hadoop.fs.s3a.InconsistentS3Object$InconsistentS3InputStream.read(InconsistentS3Object.java:159) at org.apache.hadoop.fs.s3a.S3AInputStream.lambda$read$56(S3AInputStream.java:426) at org.apache.hadoop.fs.s3a.S3AInputStream$$Lambda$31/1279551328.execute(Unknown Source) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$7(Invoker.java:260) at org.apache.hadoop.fs.s3a.Invoker$$Lambda$12/999989609.execute(Unknown Source) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:317) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:256) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:231) at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:438) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.fs.s3a.ITestS3AInconsistency.testOpenFailOnRead(ITestS3AInconsistency.java:126) {noformat} If we can agree on where to inject failures I think I can come up with a good solution. Maybe: - We need both lazySeek() and the stream.read() retries? - The failure injection for InconsistentS3InputStream() should also have a failpoint in skip(), which would have exposed the lack of retry in lazySeek(). - AmazonS3Client.getObject() currently does not fail, but returns an InconsistentS3Object with the read/skip/etc. failpoints mentioned. Seems like getObject() needs to fail as, looking at the SDK code, it actually does the GET request I believe. Does this sound right? Also: My current approach of failing read 5 times then succeeding (5<20 which is retry max) is not going to work to expose all the codepaths that fail. I need a loop runs multiple tests and either (1) increases max failures, or failure offset, by one each duration (how traceroute uses TTLs to probe router hops; I use to probe failure points) or (2) uses randomization to be likely to hit different failure points. #1 actually seems more deterministic. > S3Guard: implement retries for DDB failures and throttling; translate > exceptions > -------------------------------------------------------------------------------- > > Key: HADOOP-13761 > URL: https://issues.apache.org/jira/browse/HADOOP-13761 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.0.0-beta1 > Reporter: Aaron Fabbri > Assignee: Aaron Fabbri > Priority: Blocker > Attachments: HADOOP-13761-004-to-005.patch, HADOOP-13761-005.patch, > HADOOP-13761.001.patch, HADOOP-13761.002.patch, HADOOP-13761.003.patch, > HADOOP-13761.004.patch > > > Following the S3AFileSystem integration patch in HADOOP-13651, we need to add > retry logic. > In HADOOP-13651, I added TODO comments in most of the places retry loops are > needed, including: > - open(path). If MetadataStore reflects recent create/move of file path, but > we fail to read it from S3, retry. > - delete(path). If deleteObject() on S3 fails, but MetadataStore shows the > file exists, retry. > - rename(src,dest). If source path is not visible in S3 yet, retry. > - listFiles(). Skip for now. Not currently implemented in S3Guard. I will > create a separate JIRA for this as it will likely require interface changes > (i.e. prefix or subtree scan). > We may miss some cases initially and we should do failure injection testing > to make sure we're covered. Failure injection tests can be a separate JIRA > to make this easier to review. > We also need basic configuration parameters around retry policy. There > should be a way to specify maximum retry duration, as some applications would > prefer to receive an error eventually, than waiting indefinitely. We should > also be keeping statistics when inconsistency is detected and we enter a > retry loop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org