[ 
https://issues.apache.org/jira/browse/HADOOP-13447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417767#comment-15417767
 ] 

Chris Nauroth commented on HADOOP-13447:
----------------------------------------

[~fabbri], thank you for looking.  Great questions again.

bq. The downsides to this, as we mentioned in the design doc, is that the 
s3a-internal calls like getFileStatus() cannot utilize the MetadataStore.

My eventual (no pun intended!) plan was going to be to evolve the interfaces 
and separation of responsibilities for {{AbstractS3AccessPolicy}} and 
{{S3Store}} such that {{S3Store}} never makes its own internal metadata calls 
(like the internal {{getFileStatus}} calls you mentioned).  I didn't take that 
leap in this patch, because it was already quite large.

Taking the example of {{create}}, this might look like the policy layer 
fetching the {{FileStatus}} and then passing it down to {{S3Store#create}}, so 
that {{S3Store#create}} doesn't have to do an internal S3 call to fetch it.  
For the "direct" policy, that would just be a sequence of 
{{S3Store#getFileStatus}} + {{S3Store#create}}.  For a caching policy, that 
could be a metadata store fetch instead.

For some operations, there are greater challenges related to lazy fetch vs. 
eager fetch for these internal metadata operations.  Considering {{rename}}, 
there are multiple (I think 3 now?) {{getFileStatus}} calls possible, but 
fetching them all eagerly in the policy would harm performance if it turns out 
{{S3Store#rename}} doesn't really need to use them all.  Working out a lazy 
fetch strategy will bring some additional complexity into the code, so that's a 
risk.

bq. I feel like a cleaner mapping to the problem is to have the client 
(S3AFileSystem) contain a MetadataStore and/or some sort of policy object which 
specifies behavior.

I considered this, but I discarded it, because I thought it would introduce 
complicated control flow in a lot of the {{S3AFileSystem}} methods.  However, 
refactorings like this are always subjective, and it's entirely possible that I 
was wrong.  If you prefer, we can go that way and later revisit the larger 
split I proposed here in patch 001 if it's deemed necessary.  I'm happy to get 
us rolling either way.  Let me know your thoughts.

Really the only portion of patch 001 that I consider an absolute must is the 
{{S3ClientFactory}} work.  I think it's vital to the project that we have the 
ability to mock S3 interactions to simulate eventual consistency.

> S3Guard: Refactor S3AFileSystem to support introduction of separate metadata 
> repository and tests.
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13447
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13447
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: HADOOP-13447-HADOOP-13446.001.patch
>
>
> The scope of this issue is to refactor the existing {{S3AFileSystem}} into 
> multiple coordinating classes.  The goal of this refactoring is to separate 
> the {{FileSystem}} API binding from the AWS SDK integration, make code 
> maintenance easier while we're making changes for S3Guard, and make it easier 
> to mock some implementation details so that tests can simulate eventual 
> consistency behavior in a deterministic way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to