[ 
https://issues.apache.org/jira/browse/HADOOP-16490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901028#comment-16901028
 ] 

Steve Loughran commented on HADOOP-16490:
-----------------------------------------

Going to worry about this a bit more

Should we *always* have an extended retry policy for rename() calls, where the 
LIST has returned a file but we can't see it.

Rationale:

* that FNFE on directory rename is a recurrent stack trace, breaking commits, 
distcp, copy from local, etc. A delayed visibility issue here fails the entire 
operation
* If the cause is LIST is correct and S3 lagging, then retries here will allow 
time for S3 to catch up
* If the cause is LIST is out of data and S3 correct, then, well, some retries 
will delay operations but otherwise be harmless

We will have less confidence in the correctness of LIST vs S3Guard listings, so 
I don't think we should do anything for open() operations, or for single file 
renames, but we can be more forgiving of failures during directory rename, as a 
big directory tree will be so slow to rename that spinning for one file to 
appear is not a major factor


> S3GuardExistsRetryPolicy handle FNFE eventual consistency better
> ----------------------------------------------------------------
>
>                 Key: HADOOP-16490
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16490
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.3.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>
> If S3Guard is encountering delayed consistency (FNFE from tombstone; failure 
> to open file) then 
> * it only retries with the same times as everything else. We should make it 
> differently configurable
> * when an FNFE is finally thrown, rename() treats it as being caused by the 
> original source path missing, when in fact its something else. Proposed: 
> somehow propagate the failure up differently, probably in the 
> S3AFileSystem.copyFile() code



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to