[
https://issues.apache.org/jira/browse/HDDS-8342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841064#comment-17841064
]
Ritesh Shukla commented on HDDS-8342:
-------------------------------------
This is an excellent feature. I would prefer more focused detailed designs in
markdown as PR, makes it a lot easier to give pointed comments and discuss
A few points
# We do not need to make retention specific to S3, objects ingested via Hadoop
APIs should inherit the feature.
# The actual loop to scan objects for retention will need more detailed
design. OM scales to billions and making sure the implementation of scanner is
efficient will be an important aspect. It is also possible to defer the actual
walk of the objects in Recon and have recon invoke the OM API to revalidate the
configuration for an object for a bucket. So this way Recon can walk its copy
of OM's data and even if it is stale the final validation will happen in OM.
Just a thought on the top of my mind.
> AWS S3 Lifecycle Configurations
> -------------------------------
>
> Key: HDDS-8342
> URL: https://issues.apache.org/jira/browse/HDDS-8342
> Project: Apache Ozone
> Issue Type: New Feature
> Components: OM, S3
> Reporter: Mohanad Elsafty
> Assignee: Mohanad Elsafty
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2023-03-31-12-42-46-971.png
>
>
> I had the need for a retention solution in my cluster (delete keys in
> specific paths after some time). The idea was very similar to AWS S3
> Lifecycle configurations (Expiration part).
> [https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html]
> I made a design and already Implemented most of it, and would like to
> contribute back to Apache Ozone community.
> h2. Here is what included
> # User should be able to create/remove/fetch lifecycle configurations for a
> specific S3 bucket.
> # The lifecycle configurations will be executed periodically.
> # Depending on the rules of the lifecycle configuration there could be
> different actions or even multiple actions.
> # At the moment only expiration is supported (keys get deleted).
> # The lifecycle configurations supports all buckets not only S3 buckets.
>
> h1. Design
> !image-2023-03-31-12-42-46-971.png!
>
> h2. Components
> # Lifecycle configurations (will be stored in DB) consists of volumeName,
> bucketName and a list of rules
> ** A rule contains prefix (string), Expiration and an optional Filter.
> ** Expiration contains either days (integer) or Date (long)
> ** Filter contains prefix (string).
> # S3G bucket endpoint needs few updates to accept ?/lifecycle
> # ClientProtocol and all implementers provides (get, list, delete and
> create) lifecycle configuration
> # RetentionManager will be running periodically.
> ** Fetches a lifecycle configurations list with the help of OM
> ** Executes each lifecycle configuration on a specific bucket
> ** Lifecycle configurations will be running on parallel (each one against
> different bucket).
> h2. Flow
> # Users PUT/GET/DELETE lifecycle configurations via S3Gateway.
> # The lifecycle configurations details will be sent to some handler to be
> processed.
> # The lifecycle configurations will be saved to/fetched from the DB.
> # RetentionManager will be running periodically in the Leader OM to execute
> these lifecycle configurations.
> # RetentionManager will be issuing deletions for eligible keys.
>
> h2. Not a complete solution
> The solution lacks some interesting features for example:
> * The filter doesn't support `AND` yet.
> * Only expiration is supported.
> * A CLI to manage lifecycle configurations for all the buckets (At the
> moment S3G is the only supported entry).
> But these kind of features can be added in the future.
>
>
> *I made some decisions that must be discussed before contributing (Current
> design)*
> Lifecycle configurations will be stored in its own column family in the DB
> instead being a filed in the {*}OmBucketInfo{*}.
> I preferred the lifecycle configuration to have its own table for two reasons:
> # No need to modify OmBucketInfo table.
> # The way the Retention manager Works, this way It will query only the
> buckets that has an attached lifecycle configuration. if the lifecycle is a
> filed in OmBucketInfo it will have to query all the buckets and filter the
> ones that has a LifecycleConfiguration.
> If the other way is preferred, then I will get rid of
> LifecycleConfigurationsManager & the new codec.
>
> To summarize this:
>
> ||A new table for lifecycle configurations||A new field in OmBucketInfo||
> |A new table|Existing table|
> |Efficient query|Less efficient|
> |A new manager (lifecycle manager)|No need|
> |A new codec |No need|
> |No need to alter existing design|Need to update the existing design|
> |Need to update Bucket Deletion. Delete
> the linked lifecycle configuration when
> the bucket is deleted. |No need for updates|
> | |Needs updates to create, get, list
> and delete lifecycle configuration
> in the BucketManager.|
>
>
> h2. Plan for contribution
> The implementation is not small enough for review. I believe it needs to be
> split into few merge requests for better review. Here is my suggested
> breakdown.
> # Basic building blocks (lifecycle configuration, rule, expiration, ...) And
> the related table (if needed).
> # ClientProtocol & OzoneManager new operations (create, get, list, delete)
> lifecycle configurations (protobuf messages as well)
> # S3G endpoints updates.
> # The retention manager.
> # All of them to be merged into a new branch (Let's call it X)
> # Then merge branch X into master.
>
> Please feel free to review the design and ask for more clarifications if
> needed.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]