maytasm opened a new issue #11173: URL: https://github.com/apache/druid/issues/11173
### Motivation We want to be able to control the size of Druid's metadata store in the face of high frequency creation/deletion of datasources and it’s metadata (such as load/drop rules, auto compaction config, etc). These datasources entities can be very short lived as they are being created and deleted in a short span of time resulting in a high churn of datasource. We have to make sure that performance of Druid does not degrade as the number of datasource churned (created and deleted) increases. Currently, metadata store size increases as the number of datasource being created increases, even after those datasource are deleted. ### Proposed changes Druid currently only supports druid.indexer.logs.kill.* to clean up task logs in the deep storage and entries in druid_tasks. We should add similar capability for other metadata tables that grows with creation/deletion of data sources. These metadata tables include druid_audit, druid_supervisors, compaction config in druid_config, druid_rules, and druid_datasources. These auto cleanup will run as part of Coordinator Duty. Each individual table cleanup and be enabled/disabled and configured separately depending on user needs. Each of the table cleanup will have the following properties: `druid.coordinator.kill.*.on` - For enabling/disabling the cleanup for a particular metadata table `druid.coordinator.kill.*.period` - For configuring how often to run the cleanup duty `druid.coordinator.kill.*.durationToRetain` - For configuring duration of entity to be retained from created time Note that all of these configurations already exist for the task log cleanup. ### Rationale This features/fixes would allow customer to have high churn of various entities in Druid (such as datasource, rules, configs, etc.) out of the box. Customer will not run into unexpected poor cluster performance if they have high churn of various entities (since we do not have any guardrails to prevent high churn from occurring). This would provide better experience for customer using Druid and eliminates any potential support ticket related to high churn entities. The rationale for the solution is to keep the UI/UX similar to the existing implementation of druid.indexer.logs.kill.* to clean up task logs. This is to make the operation and usage of the new features coherent with the existing clean up of the task logs for ease of use. ### Operational impact There will not be anything deprecated or removed by this change and hence no migration path that cluster operators need to be aware of. All of the features introduced will be behind feature flags and hence, by default, will not impact rolling upgrade and rolling downgrade. With these changes, a Druid operator knowing that they have high frequency creation/deletion of data sources can set some cluster properties to enable this change. Once this is enabled, user will not have to worry about managing any external system / script / job to clean up the metadata store. User will be able to have high churn of various entities in Druid (such as datasource, rules, configs, etc.) without having to do any maintain (everything will automatically be managed in Druid). ### Test plan (optional) A test is put together to go through the scenario of running through a loop of datasource/auto-compact/post supervisor/post 3 rules/delete supervisor/delete auto-compact/delete datasource (all with different datasource names). While running this, the test also monitors the druid metadata tables in derby, sys tables (probably the same as the md tables), and also zk metrics. The test will be run with 500 iterations on a brand-new local cluster for about 13 hours, we can then verify the growth of metadata table sizes. Some additional tests we can also do include: - Rolling churn with many datasources (i.e. have 200 datasources at all time while churning through one at a time) - Churn multiple at a time (i.e. have 200 datasources at all time while churning multiple at a time) - Churn of rules entities (this is missing from earlier test) - Churn datasource without delete supervisor - Churn datasource without delete compaction rule - Churn supervisor/compaction/rule without delete datasource - Churn datasource without delete rules -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
