[ 
https://issues.apache.org/jira/browse/HADOOP-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595930#comment-16595930
 ] 

Aaron Fabbri commented on HADOOP-15621:
---------------------------------------

Hey, thank you for working on this.

{quote}

The current implementation uses {{mod_time}} field when using prune. It would 
be wise to use the same because this is an online version of prune. Thus, we 
don't need to add a new field to the item.

{quote}

mod_time is currently used to persist the same field in the FileStatus.  S3A 
*does* persist mod_time for files, just not directories.  So, I do not see how 
we can use mod time to express "table entry last written time" without breaking 
mod_time for the FileStatus.

There are two reasons to expire entries from the table

(1) because it wastes space, and after 24 hours (prune time), we assume S3 is 
consistent.

(2) because we want cache to be frequently refreshed.  That is, we want soft 
state (auto-healing over time) to make short-circuit listings from dynamo 
(auth. mode) safer in case Dynamo and S3 go out of sync; in this case, after 
TTL expires, the problem goes away as S3A will fetch the listing again from S3 
an write back a new, fresh copy of the listing.

mod_time basically works, only for files, for #1, but not for #2.  We don't 
store mod_time for directories because of the way directories are emulated on 
S3.

Thinking about this more, I'm thinking that "prune time" should be "time after 
which we believe s3 will be consistent" and TTL should be a shorter time that 
is the max lifetime of an authoritative dir listing in Dynamo.

So, for example, if prune time = 24 hours and TTL = 1 second:

After 24 hours, entries are deleted from table.  S3 is consistent so they are 
not needed.

After 1 second, a directory is no longer considered authoritative.  We might 
also disable the short-circuit behavior on getFileStatus() after the dynamo 
entry is older than TTL.

This implies that when an a row in Dynamo is older than TTL

(a) if it is a directory, we clear the auth bit before returning results to the 
FS client (S3A).

(b) if it is a file, we may want to check both S3 and Dynamo instead of 
skipping S3 which is the current behavior.

I think (b) could be followup work done later on.

Let me know if this makes sense. 

> s3guard: implement time-based (TTL) expiry for DynamoDB Metadata Store
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-15621
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15621
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1
>            Reporter: Aaron Fabbri
>            Assignee: Gabor Bota
>            Priority: Minor
>
> Similar to HADOOP-13649, I think we should add a TTL (time to live) feature 
> to the Dynamo metadata store (MS) for S3Guard.
> Think of this as the "online algorithm" version of the CLI prune() function, 
> which is the "offline algorithm".
> Why: 
> 1. Self healing (soft state): since we do not implement transactions around 
> modification of the two systems (s3 and metadata store), certain failures can 
> lead to inconsistency between S3 and the metadata store (MS) state.  Having a 
> time to live (TTL) on each entry in S3Guard means that any inconsistencies 
> will be time bound.  Thus "wait and restart your job" becomes a valid, if 
> ugly, way to get around any issues with FS client failure leaving things in a 
> bad state.
> 2. We could make manual invocation of `hadoop s3guard prune ...` unnecessary, 
> depending on the implementation.
> 3. Makes it possible to fix the problem that dynamo MS prune() doesn't prune 
> directories due to the lack of true modification time.
> How:
> I think we need a new column in the dynamo table "entry last written time".  
> This is updated each time the entry is written to dynamo.
> After that we can either
> 1. Have the client simply ignore / elide any entries that are older than the 
> configured TTL.
> 2. Have the client delete entries older than the TTL.
> The issue with #2 is it will increase latency if done inline in the context 
> of an FS operation. We could mitigate this some by using an async helper 
> thread, or probabilistically doing it "some times" to amortize the expense of 
> deleting stale entries (allowing some batching as well).
> Caveats:
> - Clock synchronization as usual is a concern. Many clusters already keep 
> clocks close enough via NTP. We should at least document the requirement 
> along with the configuration knob that enables the feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to