maytasm opened a new issue #11173:
URL: https://github.com/apache/druid/issues/11173


   ### Motivation
   
   We want to be able to control the size of Druid's metadata store in the face 
of high frequency creation/deletion of datasources and it’s metadata (such as 
load/drop rules, auto compaction config, etc). These datasources entities can 
be very short lived as they are being created and deleted in a short span of 
time resulting in a high churn of datasource. We have to make sure that 
performance of Druid does not degrade as the number of datasource churned 
(created and deleted) increases. Currently, metadata store size increases as 
the number of datasource being created increases, even after those datasource 
are deleted.
   
   ### Proposed changes
   
   Druid currently only supports druid.indexer.logs.kill.* to clean up task 
logs in the deep storage and entries in druid_tasks. We should add similar 
capability for other metadata tables that grows with creation/deletion of data 
sources. These metadata tables include druid_audit, druid_supervisors, 
compaction config in druid_config, druid_rules, and druid_datasources. These 
auto cleanup will run as part of Coordinator Duty. Each individual table 
cleanup and be enabled/disabled and configured separately depending on user 
needs.  Each of the table cleanup will have the following properties:
   `druid.coordinator.kill.*.on` - For enabling/disabling the cleanup for a 
particular metadata table
   `druid.coordinator.kill.*.period` - For configuring how often to run the 
cleanup duty
   `druid.coordinator.kill.*.durationToRetain` - For configuring duration of 
entity to be retained from created time
   
   Note that all of these configurations already exist for the task log cleanup.
   
   ### Rationale
   
   This features/fixes would allow customer to have high churn of various 
entities in Druid (such as datasource, rules, configs, etc.) out of the box. 
Customer will not run into unexpected poor cluster performance if they have 
high churn of various entities (since we do not have any guardrails to prevent 
high churn from occurring). This would provide better experience for customer 
using Druid and eliminates any potential support ticket related to high churn 
entities. 
   
   The rationale for the solution is to keep the UI/UX similar to the existing 
implementation of druid.indexer.logs.kill.* to clean up task logs. This is to 
make the operation and usage of the new features coherent with the existing 
clean up of the task logs for ease of use.  
   
   ### Operational impact
   
   There will not be anything deprecated or removed by this change and hence no 
migration path that cluster operators need to be aware of. All of the features 
introduced will be behind feature flags and hence, by default, will not impact 
rolling upgrade and rolling downgrade. With these changes, a Druid operator 
knowing that they have high frequency creation/deletion of data sources can set 
some cluster properties to enable this change. Once this is enabled, user will 
not have to worry about managing any external system / script / job to clean up 
the metadata store. User will be able to have high churn of various entities in 
Druid (such as datasource, rules, configs, etc.) without having to do any 
maintain (everything will automatically be managed in Druid). 
   
   ### Test plan (optional)
   
   A test is put together to go through the scenario of running through a loop 
of datasource/auto-compact/post supervisor/post 3 rules/delete 
supervisor/delete auto-compact/delete datasource (all with different datasource 
names).  While running this, the test also monitors the druid metadata tables 
in derby, sys tables (probably the same as the md tables), and also zk metrics. 
The test will be run with 500 iterations on a brand-new local cluster for about 
13 hours, we can then verify the growth of metadata table sizes. 
   Some additional tests we can also do include:
   - Rolling churn with many datasources (i.e. have 200 datasources at all time 
while churning through one at a time)
   - Churn multiple at a time (i.e. have 200 datasources at all time while 
churning multiple at a time)
   - Churn of rules entities (this is missing from earlier test)
   - Churn datasource without delete supervisor
   - Churn datasource without delete compaction rule
   - Churn supervisor/compaction/rule without delete datasource 
   - Churn datasource without delete rules


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to