I have been recently looking at implementing object expiration in rgw. First, a 
brief description of the feature:

S3 provides mechanisms to expire objects, and/or to transition them into 
different storage class. The feature works at the bucket level. Rules can be 
set as to which objects will expire and/or transitioned , and when. Objects are 
specified by using prefixes, the configuration is not per-object. Time is set 
in days (since object creation), and events are always rounded to the start of 
the next day.
The rules can also work in conjuction with object versioning. When a versioned 
object (a current object) expires, a delete marker is created. Non-current 
versioned objects can be set to be removed after a specific amount of time 
since the point where they became non-current.
As mentioned before, objects can be configured to transition to a different 
storage class (e.g., Amazon Glacier). It is possible to configure an object to 
be transitioned after a specific period, and after another period to be 
completely removed.
When reading object information, it will specify when it is scheduled for 
removal. It is not yet clear to me whether an object can be accessed after that 
time, or whether it appears as gone immediately (either when trying to access 
it, or when listing the bucket).
Rules cannot intersect. Each object cannot be affected by more than one rule.

Swift provides a completely different object expiration system. In swift the 
expiration is set per object, and with an explicit time for it to be removed.

In accordance with previous work, I'll currently focus on an S3 implementation. 
We do not yet support object transition to a different storage class, so either 
we implement that first, or out first lifecycle implementation will not include 
that.

1. Lifecycle rules will be configured on the bucket instance info

We hold the bucket instance info whenever we read an object, and it is cached. 
Since rules are configured to affect specific object prefixes, it will be quick 
and easy to determine whether an object is affected by any lifecycle rule.

2. New bucket index objclass operation to list objects that need to be expired 
/ transitioned

The operation will get the existing rules as input, and will return the list of 
objects that need to be handled. The request will be paged. Note that number of 
rules is constrained, so we only need to limit the number of returned entries.

3. Maintain a (sharded) list of bucket instances that have had lifecycle set on 
them

Whenever creating a new lifecycle rule on a bucket, update that list. It will 
be kept as omap on objects in the log pool

4. A new thread that will run daily to handle object expiration / transition

The (potentially more than one) thread will go over the lifecycle objects in 
the log pool, try to set a lease on one, if successful then it'll start 
processing it:
 - get list of buckets
 - for each bucket:
  - read rules
  - get list of objects affected by rules
  - for each object:
    - expire / transition
    - renew lease if needed
 - unlock log object

Note that this is racy. If a rule is removed after we read the rules, we're 
still going to apply it. Reading through the Amazon api, they have similar 
issues as far as I can tell. We can reduce the race window by verifying that 
the rule is still in effect before removing each object. This information 
should be cached, so there's not much overhead.

5. when reading object, check whether its bucket has a rule that affects it. If 
so reflect that in the response headers.

6. extend RESTful api to support rules creation, and removal, as well as 
reading the list of rules per bucket.

7. (optional) don't allow access to objects that have been expired.

8. (optional) don't list objects that have been expired.

Not sure we need or want (7) and (8).


Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to