[ 
https://issues.apache.org/jira/browse/HDDS-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482345#comment-17482345
 ] 

Stephen O'Donnell commented on HDDS-6199:
-----------------------------------------

I said in the last community sync that I would write up my ideas around 
Versioned buckets and how they might be able to provide a snapshot like feature 
with less complexity:
h1. Versioned Buckets

AWS S3 has a concept of versioned buckets. When versioning is enabled on a 
bucket the following things change:

A delete of a key does not really delete it. Rather it stores a delete marker 
internally.
Any put on a key automatically has a version ID attached to it.
If you put a key that already exists, it just adds a new version, the old one 
is retained.

If you attempt to get a key and don’t supply a version, the latest version will 
be returned. If you want an older version you can request it with the version.

I believe a standard listing on a bucket returns only the current version, but 
you can list all versions of all objects using the “versions” resource.

It is also possible to “really delete” a specific object version by issuing a 
delete on its key+version.
h2. Versions Buckets Look A Lot Like Snapshots

If you restrict the delete function, and then perhaps provide a way to remove 
all not-current versions of an object that are older than some retention 
period, a versioned bucket looks very like a snapshot.

What is missing from AWS S3, is the ability to view the entire bucket as of 
some time in the past. I am also not sure if the version IDs are time based or 
some internal value, which may let you implement this via the listing.

The normal use case for a versioned bucket would be to access the most recent 
version of a key without specifying a version. Similarly, listing a bucket 
would normally want to list the most recent versions.

Considering that we use RocksDB in Ozone, there are a couple of ways I can 
think of implementing this.
h3. Single Table Approach

We create a composite key to store in RocksDB and a Custom Comparator. The 
composite key will be the key name bytes and a timestamp encoded in two fields. 
The comparator would ensure that the newest version (biggest timestamp) is 
ordered first.

Using that and a scan with a prefix filter, you should be able to efficiently 
access the most recent version of any given key. CockroachDB uses a technique 
like this:

[https://cockroachlabs.com/blog/cockroachdb-on-rocksd/#blitzing-through-more-rocksdb-features]

[https://cockroachlabs.com/blog/cockroachdb-on-rocksd/#fast-scans]

Point lookups for a key + version would be fast too.

Finding the first version of a key older than a given time would also be fast 
via a scan, even with lots of key versions.

The big negative, is that listings will need to iterate over all versions of 
all keys starting from whatever prefix is provided, and then filter out all but 
the most recent. This will make the normal listing of the current state slow.

If you want to look at a time in the past, you need to list objects starting 
from a prefix and filter out any versions newer than the time you care about, 
and keep only the latest version from before the time you care about, filter 
all older ones.

The listing that remains is a “point in time” view of the bucket. The time take 
to provide this scan will be proportional to the prefix list, and the number of 
objects in the bucket. Each version of a key counts a an object, so 10 versions 
of 1 key would count as 10.
h2. Two Table Approach

Similar to above with the composite key, we store all keys in two tables, 
CURRENT and HISTORICAL.

When a new version is created, we delete the rocksDB entry from CURRENT and add 
a new entry for the new version. HISTORICAL gets the new key too. This means 
HISTORICAL has everything, but CURRENT is like an index for the most recent 
version of each key.

This makes puts and deletes slightly more expensive, but a list on the live 
version has the same performance as before, and this would be the common use 
case.

To list all versions of a key, you query only HISTORICAL.

If you consider snapshot use cases:

 * Get a historical version of a key with a known version -> point lookup in 
HISTORICAL
 * Get the latest version of a key before a given time -> single row scan 
starting at key:<time>
 * Get a list of key versions for a single key -> Reasonably tight range scan.
 * Get the most current version of a key -> point lookup in CURRENT.
 * Listing of current files -> performs as well as non-versioned buckets.
 * Listing of all versions -> performs as well as possible, and proportional to 
number of objects times versions.
 * Listing at a point in time -> potentially expensive, but depends on the 
prefix range requested. AWS does not provide this feature directly (so far as I 
can tell).

Snapshot diffs could be created trivially, by having a secondary index of 
timestamp -> key, which would let us efficiently see everything that has 
happened to a bucket between two points in time.

Questions:

 * How quickly can RocksDB provide rows for large scans to filter out all the 
old versions?
 * Could something like this work with FileSystem buckets, or would it only 
work for object store buckets? Filesystem buckets might be more tricky, as we 
would need to version all keys under a directory and handle the entire 
directory being deleted, moved etc. These are some of the things that have 
proven difficult (and resulted in bug) in HDFS snapshots.

> Support for object store Snapshots in Ozone
> -------------------------------------------
>
>                 Key: HDDS-6199
>                 URL: https://issues.apache.org/jira/browse/HDDS-6199
>             Project: Apache Ozone
>          Issue Type: New Feature
>            Reporter: Neil Joshi
>            Assignee: Neil Joshi
>            Priority: Major
>         Attachments: ozone-snapshot-objectstore-design.pdf
>
>
> Support for object storage snapshots in ozone.  Snapshots taken at the bucket 
> level.  Supporting instantaneous snapshots, snapshot diffs, 1000s of 
> concurrent snapshots, admin control for snapshot CRUD, snapshots restored in 
> any order, among others (see requirements in design doc). 
> Use cases include data backup, disaster recovery and protection against user 
> errors.
>  
> google doc design doc (pdf in attachments),
> https://docs.google.com/document/d/18cicsI5085zQp8KQeYlj-gGgoQosNtXq/edit?usp=sharing&ouid=100641055930545452800&rtpof=true&sd=true



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to