[GitHub] [druid] abhishekrb19 opened a new issue, #14330: New storage / retention policy API in Druid

via GitHub Mon, 22 May 2023 21:50:38 -0700


abhishekrb19 opened a new issue, #14330:
URL: https://github.com/apache/druid/issues/14330


   ### Motivation
   
   While Druid's 
[rules](https://druid.apache.org/docs/latest/operations/rule-configuration.html)
 (load, drop, and broadcast rules) and [kill 
tasks](https://druid.apache.org/docs/latest/data-management/delete.html) are 
powerful, they can be complex to use and understand, especially in the context 
of retention. Druid users need to think about the lifecycle of segments 
(used/unused), map to tiered replicants, and add the appropriate imperative 
rules in the correct order to the rule chain.
   
   ### Proposed changes
   At a high level, users can define a storage policy for the hot tier (aka 
historical tier) and the deep storage. To that effect, introduce a storage 
policy API that translates user-defined policies to one or more load and drop 
rules under the hoods. 
   
   #### New API `/druid/coordinator/v1/storagePolicy/<dataSource>`
   
   The API will accept two parameters in the create payload:
   - `hot`: Defines how long to keep the data in the hot tier(s) (aka 
historical tiers)
   - `retain`: Defines how long to retain the data before it's cleaned up 
permanently, including data from the deep storage and metadata store
   
   
   #### Translation of storage policy to load & drop rules
   A few use cases along with the storage policy payloads and the corresponding 
internal load/drop rules is shown below:
   
   |Intent | Storage Policy | Load/Drop Rule  |
   |------ | ------ | --------- |
   |Keep the most recent hour of data <br> in the hot tier and permanently 
delete <br> all data older than 30 days.|<pre>{<br>&nbsp;&nbsp;"hot": 
{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": 
"period",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": 
"PT1H"<br>&nbsp;&nbsp;},<br>&nbsp;&nbsp;"retain": 
{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": 
"period",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": 
"P30D"<br>&nbsp;&nbsp;}<br>}</pre>|<pre>[<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type":
 "loadByPeriod",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": 
"PT1H",<br>&nbsp;&nbsp;&nbsp;&nbsp;"tieredReplicants": 
{<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"_default_tier": 
1<br>&nbsp;&nbsp;&nbsp;&nbsp;}<br>&nbsp;&nbsp;},<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type":
 "dropBeforeByPeriod",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": 
"P30D"<br>&nbsp;&nbsp;},<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": 
"loadForever",<br>&nbsp;&nbsp;&nbsp;&nbsp;"tieredReplicants": 
{<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"_default_tier": 0<br>&nbsp;&nbsp;&
 nbsp;&nbsp;}<br>&nbsp;&nbsp;}<br>]</pre>|
   | Drop all data older <br> than 30 days from the hot tier.| 
<pre>{<br>&nbsp;&nbsp;"hot": {<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": 
"period",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": 
"P30D"<br>&nbsp;&nbsp;}<br>}</pre>  
|<pre>[<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": 
"loadByPeriod",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": 
"P30D",<br>&nbsp;&nbsp;&nbsp;&nbsp;"tieredReplicants": 
{<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"_default_tier": 
1<br>&nbsp;&nbsp;&nbsp;&nbsp;}<br>&nbsp;&nbsp;},<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type":
 "loadForever",<br>&nbsp;&nbsp;&nbsp;&nbsp;"tieredReplicants": 
{<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"_default_tier": 
0<br>&nbsp;&nbsp;&nbsp;&nbsp;}<br>&nbsp;&nbsp;}<br>]</pre>|
   | Delete all data older <br> than 60 days.| <pre>{<br>&nbsp;&nbsp;"retain": 
{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": 
"period",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": 
"P60D"<br>&nbsp;&nbsp;}<br>}</pre> 
|<pre>[<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": 
"dropBeforeByPeriod",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": 
"P60D"<br>&nbsp;&nbsp;},<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": 
"loadForever"<br>&nbsp;&nbsp;}<br>]</pre>|
   
   
   #### Extensibility & Maintainability
   Similar to the above period-based policies, we can add interval-based and 
custom tiered-policies for more advanced users. For example:
   a. Interval-based policy:
   ```json
   {
     "hot": {
       "type": "intervals",
       "intervals": ["2020-01-01/2022-01-01", "2023-01-01/9999-01-01"]
     }
   }
   ```
   b. Custom-tiered policy:
   ```json
   {
     "hot": {
       "type": "tiered",
       "tiers": {
         "hot1": {"type": "period", "period": "P60D"},
         "hot2": {"type": "period", "period": "P90D"}
       }
     },
     "retain": {"type": "period", "period": "P1Y"}
   }
   ```
   The API will need to translate user-defined storage policies to rules as we 
extend  support to cover more complex use cases.
   
   #### High-level implementation
   
   The API implementation will support `POST`, `GET` and `DELETE` operations to 
create, retrieve and delete any configured storage policy per data source. 
Similar to the `rules` endpoint, this new endpoint should be on the coordinator 
and should return appropriate error/status codes to the user. The 
implementation of the API will:
   - Validate ISO 8601 periods
   - Validate Interval strings
   - Check that `hot.period` cannot be larger than `retain.period`
   - Disallow `retain` if auto-kill configuration is disabled
   
   ### Rationale
   
   The main benefit of the API is that it abstracts away the complex inner 
workings of load, drop and kill rules. It provides a declarative interface to 
think about retention like many systems offer.
   
   ### Operational impact
   
   Since this API-only change leverages the existing load/drop rule 
functionality, nothing needs to be deprecated in short order. If it makes sense 
to deprecate the rules API at some point because the new API is equally 
powerful, then we may consider that. 
   
   ### Future work
   
   In environments with multiple hot tiers, users must manually enumerate the 
tiers in the tieredReplicants if they use load rules. We can extend the storage 
policy API to automatically list all the tiers by default if it's not supplied.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] abhishekrb19 opened a new issue, #14330: New storage / retention policy API in Druid

Reply via email to