abhishekrb19 opened a new issue, #14330: URL: https://github.com/apache/druid/issues/14330
### Motivation While Druid's [rules](https://druid.apache.org/docs/latest/operations/rule-configuration.html) (load, drop, and broadcast rules) and [kill tasks](https://druid.apache.org/docs/latest/data-management/delete.html) are powerful, they can be complex to use and understand, especially in the context of retention. Druid users need to think about the lifecycle of segments (used/unused), map to tiered replicants, and add the appropriate imperative rules in the correct order to the rule chain. ### Proposed changes At a high level, users can define a storage policy for the hot tier (aka historical tier) and the deep storage. To that effect, introduce a storage policy API that translates user-defined policies to one or more load and drop rules under the hoods. #### New API `/druid/coordinator/v1/storagePolicy/<dataSource>` The API will accept two parameters in the create payload: - `hot`: Defines how long to keep the data in the hot tier(s) (aka historical tiers) - `retain`: Defines how long to retain the data before it's cleaned up permanently, including data from the deep storage and metadata store #### Translation of storage policy to load & drop rules A few use cases along with the storage policy payloads and the corresponding internal load/drop rules is shown below: |Intent | Storage Policy | Load/Drop Rule | |------ | ------ | --------- | |Keep the most recent hour of data <br> in the hot tier and permanently delete <br> all data older than 30 days.|<pre>{<br> "hot": {<br> "type": "period",<br> "period": "PT1H"<br> },<br> "retain": {<br> "type": "period",<br> "period": "P30D"<br> }<br>}</pre>|<pre>[<br> {<br> "type": "loadByPeriod",<br> "period": "PT1H",<br> "tieredReplicants": {<br> "_default_tier": 1<br> }<br> },<br> {<br> "type": "dropBeforeByPeriod",<br> "period": "P30D"<br> },<br> {<br> "type": "loadForever",<br> "tieredReplicants": {<br> "_default_tier": 0<br> & nbsp; }<br> }<br>]</pre>| | Drop all data older <br> than 30 days from the hot tier.| <pre>{<br> "hot": {<br> "type": "period",<br> "period": "P30D"<br> }<br>}</pre> |<pre>[<br> {<br> "type": "loadByPeriod",<br> "period": "P30D",<br> "tieredReplicants": {<br> "_default_tier": 1<br> }<br> },<br> {<br> "type": "loadForever",<br> "tieredReplicants": {<br> "_default_tier": 0<br> }<br> }<br>]</pre>| | Delete all data older <br> than 60 days.| <pre>{<br> "retain": {<br> "type": "period",<br> "period": "P60D"<br> }<br>}</pre> |<pre>[<br> {<br> "type": "dropBeforeByPeriod",<br> "period": "P60D"<br> },<br> {<br> "type": "loadForever"<br> }<br>]</pre>| #### Extensibility & Maintainability Similar to the above period-based policies, we can add interval-based and custom tiered-policies for more advanced users. For example: a. Interval-based policy: ```json { "hot": { "type": "intervals", "intervals": ["2020-01-01/2022-01-01", "2023-01-01/9999-01-01"] } } ``` b. Custom-tiered policy: ```json { "hot": { "type": "tiered", "tiers": { "hot1": {"type": "period", "period": "P60D"}, "hot2": {"type": "period", "period": "P90D"} } }, "retain": {"type": "period", "period": "P1Y"} } ``` The API will need to translate user-defined storage policies to rules as we extend support to cover more complex use cases. #### High-level implementation The API implementation will support `POST`, `GET` and `DELETE` operations to create, retrieve and delete any configured storage policy per data source. Similar to the `rules` endpoint, this new endpoint should be on the coordinator and should return appropriate error/status codes to the user. The implementation of the API will: - Validate ISO 8601 periods - Validate Interval strings - Check that `hot.period` cannot be larger than `retain.period` - Disallow `retain` if auto-kill configuration is disabled ### Rationale The main benefit of the API is that it abstracts away the complex inner workings of load, drop and kill rules. It provides a declarative interface to think about retention like many systems offer. ### Operational impact Since this API-only change leverages the existing load/drop rule functionality, nothing needs to be deprecated in short order. If it makes sense to deprecate the rules API at some point because the new API is equally powerful, then we may consider that. ### Future work In environments with multiple hot tiers, users must manually enumerate the tiers in the tieredReplicants if they use load rules. We can extend the storage policy API to automatically list all the tiers by default if it's not supplied. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
