[ 
https://issues.apache.org/jira/browse/HADOOP-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606497#comment-16606497
 ] 

Aaron Fabbri commented on HADOOP-15621:
---------------------------------------

Thanks for the v1 patch. You've done some nice work here, and I'm glad to see 
someone new mastering this part of the codebase. Looks pretty good overall.

I think we should consider a couple of changes, now you've explored the details 
of the implementation.  Let me know if you agree / disagree.
 # Don't do the "online" prune / delete here. We can do that in a later jira if 
we want. We have the prune CLI ("offline") and the fs.s3a.s3guard.cli.prune.age 
config today.  I think that is good enough for now, and want us to be able to 
performance test this without that extra variable.
 #  It seems simpler to do TTL filtering in the FS instead of each metadata 
store.
 Pros:

 - All Metadtata Stores behave the same.
 - Less code duplication (filtering logic implemented once in FS).
 - S3A would need logic to implement parts of TTL anyways (to deal with
 getFileStatus() not being authoritative if last updated timestamp in
 PathMetadata is older than TTL).  This could be done later as a better 
solution to HADOOP-14468.
 - Clearer MetadataStore API semantics (MS behavior not dependent on external 
Configuration
 API)
 - Fewer config knobs. fs.s3a.metadatastore.authoritative.ttl: How long an 
entry in the MS is considered as authoritative before it will be refreshed.
 - Easier to test.
 - Future-proof for metadata caching. A FS can choose cache policy on a per-file
 basis, e.g. from coherency hints at open() or create() time. The FS controls 
it.

Cons:
 - Would need some convenience wrappers around MetadataStore API in S3A.
 - Would require changes to MetadataStore API (include last_updated field in 
PathMetadata,
 DirListingMetadata)
 - Would require changes to LocalMetataStore (though could be quite easy--just 
store the lastUpdated field on PathMetadata and DirListingMetadata. Local MS 
can still have its own separate TTL value which is used limit memory usage.. 
just keep the two separate).

Other thoughts:
 Cool test cases, thanks.  We should also probably add an integration test that 
uses FS and S3guard all together. E.g.:
{noformat}
set auth mode = true
configure s3a auth ttl = x seconds
s3afs.mkdir(test/)
s3afs.touch(test/file)
s3afs.listStatus(test)  // this should write full dir into MS with auth=true
assert is_authoritative(s3afs.getMetadataStore().listChildren(test))  // A

*fast forward time, via sleep() or s3afs.test_time_offset += 2x* or a 
fs.getTime() mock?*

assert ! is_authoritative(s3afs.getMetadataStore().listChildren(test))  // B

{noformat}
Also maybe even do this next:
{noformat}
s3afs.listStatus(test)  // this should again write full dir into MS with 
auth=true
assert is_authoritative(s3afs.getMetadataStore().listChildren(test))  // C
{noformat}
 
 So, we the "refresh MS on TTL expiry" behavior. A cache refresh. We have shown 
that TTL expiry clears the auth bit and makes listStatus() re-load a new, 
fresh, listing back into the MS with auth=true and a new TTL time.

Does that make sense?

Other thoughts:
{noformat}
   public DDBPathMetadata(FileStatus fileStatus, Tristate isEmptyDir,
<snip>
+    this.lastUpdated = getInitialLastUpdated();
{noformat}
Wondering if we can do this lazily. Or, just init to 0, and make FS set it (in 
the putWithTTL() wrapper you'd add, e.g in S3Guard.java)? Getting system time 
is cheaper than it used to be (vsyscalls), but still nice to avoid until 
necessary.
{noformat}
+  void checkIsEmptyDirectory(ItemCollection<QueryOutcome> items) {
{noformat}
Maybe call this setIsEmptyDirectory instead of check?
{noformat}
+    return Time.monotonicNow();
{noformat}
Reminder to make sure we are being consistent throughout S3A.. using the same 
clock source. Not sure we need monotonic here but we should probably follow 
what the rest of the code uses. S3AFileStatus, for example, uses 
System.currentTimeMillis().
{noformat}
+        if(entry == null) {
{noformat}
spacing nit

Overall impressive v1 patch. Thank you for being flexible and working with my 
code reviews.

> s3guard: implement time-based (TTL) expiry for DynamoDB Metadata Store
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-15621
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15621
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1
>            Reporter: Aaron Fabbri
>            Assignee: Gabor Bota
>            Priority: Minor
>         Attachments: HADOOP-15621.001.patch
>
>
> Similar to HADOOP-13649, I think we should add a TTL (time to live) feature 
> to the Dynamo metadata store (MS) for S3Guard.
> Think of this as the "online algorithm" version of the CLI prune() function, 
> which is the "offline algorithm".
> Why: 
> 1. Self healing (soft state): since we do not implement transactions around 
> modification of the two systems (s3 and metadata store), certain failures can 
> lead to inconsistency between S3 and the metadata store (MS) state.  Having a 
> time to live (TTL) on each entry in S3Guard means that any inconsistencies 
> will be time bound.  Thus "wait and restart your job" becomes a valid, if 
> ugly, way to get around any issues with FS client failure leaving things in a 
> bad state.
> 2. We could make manual invocation of `hadoop s3guard prune ...` unnecessary, 
> depending on the implementation.
> 3. Makes it possible to fix the problem that dynamo MS prune() doesn't prune 
> directories due to the lack of true modification time.
> How:
> I think we need a new column in the dynamo table "entry last written time".  
> This is updated each time the entry is written to dynamo.
> After that we can either
> 1. Have the client simply ignore / elide any entries that are older than the 
> configured TTL.
> 2. Have the client delete entries older than the TTL.
> The issue with #2 is it will increase latency if done inline in the context 
> of an FS operation. We could mitigate this some by using an async helper 
> thread, or probabilistically doing it "some times" to amortize the expense of 
> deleting stale entries (allowing some batching as well).
> Caveats:
> - Clock synchronization as usual is a concern. Many clusters already keep 
> clocks close enough via NTP. We should at least document the requirement 
> along with the configuration knob that enables the feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to