[
https://issues.apache.org/jira/browse/HADOOP-13447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417767#comment-15417767
]
Chris Nauroth commented on HADOOP-13447:
----------------------------------------
[~fabbri], thank you for looking. Great questions again.
bq. The downsides to this, as we mentioned in the design doc, is that the
s3a-internal calls like getFileStatus() cannot utilize the MetadataStore.
My eventual (no pun intended!) plan was going to be to evolve the interfaces
and separation of responsibilities for {{AbstractS3AccessPolicy}} and
{{S3Store}} such that {{S3Store}} never makes its own internal metadata calls
(like the internal {{getFileStatus}} calls you mentioned). I didn't take that
leap in this patch, because it was already quite large.
Taking the example of {{create}}, this might look like the policy layer
fetching the {{FileStatus}} and then passing it down to {{S3Store#create}}, so
that {{S3Store#create}} doesn't have to do an internal S3 call to fetch it.
For the "direct" policy, that would just be a sequence of
{{S3Store#getFileStatus}} + {{S3Store#create}}. For a caching policy, that
could be a metadata store fetch instead.
For some operations, there are greater challenges related to lazy fetch vs.
eager fetch for these internal metadata operations. Considering {{rename}},
there are multiple (I think 3 now?) {{getFileStatus}} calls possible, but
fetching them all eagerly in the policy would harm performance if it turns out
{{S3Store#rename}} doesn't really need to use them all. Working out a lazy
fetch strategy will bring some additional complexity into the code, so that's a
risk.
bq. I feel like a cleaner mapping to the problem is to have the client
(S3AFileSystem) contain a MetadataStore and/or some sort of policy object which
specifies behavior.
I considered this, but I discarded it, because I thought it would introduce
complicated control flow in a lot of the {{S3AFileSystem}} methods. However,
refactorings like this are always subjective, and it's entirely possible that I
was wrong. If you prefer, we can go that way and later revisit the larger
split I proposed here in patch 001 if it's deemed necessary. I'm happy to get
us rolling either way. Let me know your thoughts.
Really the only portion of patch 001 that I consider an absolute must is the
{{S3ClientFactory}} work. I think it's vital to the project that we have the
ability to mock S3 interactions to simulate eventual consistency.
> S3Guard: Refactor S3AFileSystem to support introduction of separate metadata
> repository and tests.
> --------------------------------------------------------------------------------------------------
>
> Key: HADOOP-13447
> URL: https://issues.apache.org/jira/browse/HADOOP-13447
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Attachments: HADOOP-13447-HADOOP-13446.001.patch
>
>
> The scope of this issue is to refactor the existing {{S3AFileSystem}} into
> multiple coordinating classes. The goal of this refactoring is to separate
> the {{FileSystem}} API binding from the AWS SDK integration, make code
> maintenance easier while we're making changes for S3Guard, and make it easier
> to mock some implementation details so that tests can simulate eventual
> consistency behavior in a deterministic way.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]