[
https://issues.apache.org/jira/browse/HDDS-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482345#comment-17482345
]
Stephen O'Donnell commented on HDDS-6199:
-----------------------------------------
I said in the last community sync that I would write up my ideas around
Versioned buckets and how they might be able to provide a snapshot like feature
with less complexity:
h1. Versioned Buckets
AWS S3 has a concept of versioned buckets. When versioning is enabled on a
bucket the following things change:
A delete of a key does not really delete it. Rather it stores a delete marker
internally.
Any put on a key automatically has a version ID attached to it.
If you put a key that already exists, it just adds a new version, the old one
is retained.
If you attempt to get a key and don’t supply a version, the latest version will
be returned. If you want an older version you can request it with the version.
I believe a standard listing on a bucket returns only the current version, but
you can list all versions of all objects using the “versions” resource.
It is also possible to “really delete” a specific object version by issuing a
delete on its key+version.
h2. Versions Buckets Look A Lot Like Snapshots
If you restrict the delete function, and then perhaps provide a way to remove
all not-current versions of an object that are older than some retention
period, a versioned bucket looks very like a snapshot.
What is missing from AWS S3, is the ability to view the entire bucket as of
some time in the past. I am also not sure if the version IDs are time based or
some internal value, which may let you implement this via the listing.
The normal use case for a versioned bucket would be to access the most recent
version of a key without specifying a version. Similarly, listing a bucket
would normally want to list the most recent versions.
Considering that we use RocksDB in Ozone, there are a couple of ways I can
think of implementing this.
h3. Single Table Approach
We create a composite key to store in RocksDB and a Custom Comparator. The
composite key will be the key name bytes and a timestamp encoded in two fields.
The comparator would ensure that the newest version (biggest timestamp) is
ordered first.
Using that and a scan with a prefix filter, you should be able to efficiently
access the most recent version of any given key. CockroachDB uses a technique
like this:
[https://cockroachlabs.com/blog/cockroachdb-on-rocksd/#blitzing-through-more-rocksdb-features]
[https://cockroachlabs.com/blog/cockroachdb-on-rocksd/#fast-scans]
Point lookups for a key + version would be fast too.
Finding the first version of a key older than a given time would also be fast
via a scan, even with lots of key versions.
The big negative, is that listings will need to iterate over all versions of
all keys starting from whatever prefix is provided, and then filter out all but
the most recent. This will make the normal listing of the current state slow.
If you want to look at a time in the past, you need to list objects starting
from a prefix and filter out any versions newer than the time you care about,
and keep only the latest version from before the time you care about, filter
all older ones.
The listing that remains is a “point in time” view of the bucket. The time take
to provide this scan will be proportional to the prefix list, and the number of
objects in the bucket. Each version of a key counts a an object, so 10 versions
of 1 key would count as 10.
h2. Two Table Approach
Similar to above with the composite key, we store all keys in two tables,
CURRENT and HISTORICAL.
When a new version is created, we delete the rocksDB entry from CURRENT and add
a new entry for the new version. HISTORICAL gets the new key too. This means
HISTORICAL has everything, but CURRENT is like an index for the most recent
version of each key.
This makes puts and deletes slightly more expensive, but a list on the live
version has the same performance as before, and this would be the common use
case.
To list all versions of a key, you query only HISTORICAL.
If you consider snapshot use cases:
* Get a historical version of a key with a known version -> point lookup in
HISTORICAL
* Get the latest version of a key before a given time -> single row scan
starting at key:<time>
* Get a list of key versions for a single key -> Reasonably tight range scan.
* Get the most current version of a key -> point lookup in CURRENT.
* Listing of current files -> performs as well as non-versioned buckets.
* Listing of all versions -> performs as well as possible, and proportional to
number of objects times versions.
* Listing at a point in time -> potentially expensive, but depends on the
prefix range requested. AWS does not provide this feature directly (so far as I
can tell).
Snapshot diffs could be created trivially, by having a secondary index of
timestamp -> key, which would let us efficiently see everything that has
happened to a bucket between two points in time.
Questions:
* How quickly can RocksDB provide rows for large scans to filter out all the
old versions?
* Could something like this work with FileSystem buckets, or would it only
work for object store buckets? Filesystem buckets might be more tricky, as we
would need to version all keys under a directory and handle the entire
directory being deleted, moved etc. These are some of the things that have
proven difficult (and resulted in bug) in HDFS snapshots.
> Support for object store Snapshots in Ozone
> -------------------------------------------
>
> Key: HDDS-6199
> URL: https://issues.apache.org/jira/browse/HDDS-6199
> Project: Apache Ozone
> Issue Type: New Feature
> Reporter: Neil Joshi
> Assignee: Neil Joshi
> Priority: Major
> Attachments: ozone-snapshot-objectstore-design.pdf
>
>
> Support for object storage snapshots in ozone. Snapshots taken at the bucket
> level. Supporting instantaneous snapshots, snapshot diffs, 1000s of
> concurrent snapshots, admin control for snapshot CRUD, snapshots restored in
> any order, among others (see requirements in design doc).
> Use cases include data backup, disaster recovery and protection against user
> errors.
>
> google doc design doc (pdf in attachments),
> https://docs.google.com/document/d/18cicsI5085zQp8KQeYlj-gGgoQosNtXq/edit?usp=sharing&ouid=100641055930545452800&rtpof=true&sd=true
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]