[ 
https://issues.apache.org/jira/browse/HADOOP-13447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418098#comment-15418098
 ] 

Aaron Fabbri commented on HADOOP-13447:
---------------------------------------

Great discussion, I'm enjoying it (although an hour with a whiteboard would be 
even better).

{quote}
My eventual (no pun intended!) plan was going to be to evolve the interfaces 
and separation of responsibilities for AbstractS3AccessPolicy and S3Store such 
that S3Store never makes its own internal metadata calls (like the internal 
getFileStatus calls you mentioned). 
{quote}

Yeah, I was trying to come up with a way to do this.  My goals are clean / 
minimal code, and near-optimal performance (want upstream s3a to be very fast).

The crux of the problem seems that the top level logic of things like 
mkdirs()/rename() etc. need to call getFileStatus(), and those internal 
getFileStatus() calls are subject to the same source-of-truth and retry 
policies as the public API.  In this case, the only way I can think to really 
separate the top-level logic for FileSystem ops (e.g. mkdir) from the policy on 
MetadataStore, is to build some sort of execution plan in top level logic then 
pass to a policy layer to execute it.  You'd have three layers, roughly, 
FileSystem top-level, Raw S3 I/O, Policy / execution. This seems like it would 
be slower and way over-complicated (some steps in execution are conditional, 
you end up with a pseudo-language).  There is also the AOP approach that s3mper 
took, but I think we can do better since we can modify the upstream code, and 
our goals are a bit more ambitious (more than just listStatus(), plus ability 
to use MetadataStore as source of truth).

{quote}
 I thought it would introduce complicated control flow in a lot of the 
S3AFileSystem methods. However, refactorings like this are always subjective, 
and it's entirely possible that I was wrong. 
{quote}

This is true.  Seems like the classic performance vs. complexity tradeoff.  I 
don't think you are wrong at all, just a question of priorities.  I'm willing 
to sacrifice a little code separation for significant performance benefit, and 
possibly a more complete solution (e.g. if internal getFileStatus() call misses 
a file above mkdirs(path) because that file creation was eventually consistent).

{quote}
If you prefer, we can go that way
{quote}

That is my preference at this stage.  It was not as bad as I thought it would 
be when we prototyped it.  Also I like doing refactoring / generalization after 
seeing some concrete cases in code (i.e. start by complicating top level 
S3AFileSystem logic).

My preference would probably change if I could think of a clean way to handle 
the internal getFileStatus() calls.

{quote} and later revisit the larger split I proposed here in patch 001 if it's 
deemed necessary.{quote}

I think we do need to break up S3AFileSystem.  I'll be very supportive of 
future refactorization here, in general.

{quote}
Really the only portion of patch 001 that I consider an absolute must is the 
S3ClientFactory work. I think it's vital to the project that we have the 
ability to mock S3 interactions to simulate eventual consistency.
{quote}

Yes, I like this part of it.  Being able to mock the S3 service out would be 
awesome, and I didn't notice any real downside.

> S3Guard: Refactor S3AFileSystem to support introduction of separate metadata 
> repository and tests.
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13447
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13447
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: HADOOP-13447-HADOOP-13446.001.patch, 
> HADOOP-13447-HADOOP-13446.002.patch
>
>
> The scope of this issue is to refactor the existing {{S3AFileSystem}} into 
> multiple coordinating classes.  The goal of this refactoring is to separate 
> the {{FileSystem}} API binding from the AWS SDK integration, make code 
> maintenance easier while we're making changes for S3Guard, and make it easier 
> to mock some implementation details so that tests can simulate eventual 
> consistency behavior in a deterministic way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to