soullkk opened a new issue, #16486:
URL: https://github.com/apache/druid/issues/16486

   Overlord's CPU continues to occupy 90%
   
   ### Affected Version
   
   28.0.1
   
   ### Description
   
   - Cluster size
   1 node
   - Configurations in use
   `druid.coordinator.asOverlord.enabled`=true
   `druid.coordinator.asOverlord.overlordService`=druid/overlord
   `druid.worker.capacity`=50
   - Steps to reproduce the problem
   Use local mode to continuously run a large number of tasks, with a quantity 
greater than `druid.worker.capacity`
   - The error message or stack traces encountered. Providing more context, 
such as nearby log messages or even entire logs, can be helpful.
   
   Overlord's CPU  is very high
   
![image](https://github.com/apache/druid/assets/55041925/35ced57f-d4b5-494e-93a9-ce962b159282)
   
   1、Failed to get storage slot for task
   ```
   coordinator_20240521223046551.zip:2024-05-21 22:30:43,377 WARN  
[forking-task-runner-7][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_static_resource_8db78ddde673b19_lhpeonok], cannot 
schedule.
   coordinator_20240521223046551.zip:2024-05-21 22:30:43,468 WARN  
[forking-task-runner-28][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_forwarding_res_raw__DEFAULT_9203ec7d04035af_hjgphojj],
 cannot schedule.
   coordinator_20240521223046551.zip:2024-05-21 22:30:43,558 WARN  
[forking-task-runner-40][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_ne_raw__DEFAULT_eec2a36b14271f8_caongljd],
 cannot schedule.
   coordinator_20240521223046551.zip:2024-05-21 22:30:43,649 WARN  
[forking-task-runner-31][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_queue_raw_1b3a42fff2170e2_cehoaoce], cannot schedule.
   coordinator_20240521223046551.zip:2024-05-21 22:30:44,114 WARN  
[forking-task-runner-26][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_queue_raw_1b3a42fff2170e2_dbfmcief], cannot schedule.
   coordinator_20240521223046551.zip:2024-05-21 22:30:44,230 WARN  
[forking-task-runner-42][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_details_1m__DEFAULT_664980334e7a24d_noekhila],
 cannot schedule.
   coordinator_20240521223046551.zip:2024-05-21 22:30:44,323 WARN  
[forking-task-runner-8][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_details_1h__DEFAULT_3e9c6f0640d02de_ajdkfkdn],
 cannot schedule.
   coordinator_20240521223127220.zip:2024-05-21 22:31:23,327 WARN  
[forking-task-runner-25][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_static_resource_97bc5759a9bf04c_mffbbkng], cannot 
schedule.
   coordinator_20240521223127220.zip:2024-05-21 22:31:23,414 WARN  
[forking-task-runner-36][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_forwarding_res_raw__DEFAULT_ed81832451d12e0_kfanbohg],
 cannot schedule.
   coordinator_20240521223127220.zip:2024-05-21 22:31:23,484 WARN  
[forking-task-runner-18][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_ne_raw__DEFAULT_f3c306d58e3e663_cfnihfel],
 cannot schedule.
   coordinator_20240521223127220.zip:2024-05-21 22:31:23,561 WARN  
[forking-task-runner-22][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_queue_raw_07f2a03569028fc_jekmllan], cannot schedule.
   coordinator_20240521223127220.zip:2024-05-21 22:31:23,634 WARN  
[forking-task-runner-46][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_queue_raw_07f2a03569028fc_khccbcbj], cannot schedule.
   coordinator_20240521223127220.zip:2024-05-21 22:31:23,704 WARN  
[forking-task-runner-17][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_details_1h__DEFAULT_c84ecf67861d7bf_chbdkpcm],
 cannot schedule.
   coordinator_20240521223127220.zip:2024-05-21 22:31:23,774 WARN  
[forking-task-runner-24][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_details_1m__DEFAULT_146aa05938239ab_fagkhdam],
 cannot schedule.
   coordinator_20240521223143913.zip:2024-05-21 22:31:42,314 WARN  
[forking-task-runner-33][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_static_resource_b3a662951c9f100_fmfbpcoi], cannot 
schedule.
   coordinator_20240521223143913.zip:2024-05-21 22:31:42,753 WARN  
[forking-task-runner-12][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_forwarding_res_raw__DEFAULT_340f95d089fbf49_lbgkjlej],
 cannot schedule.
   coordinator_20240521223143913.zip:2024-05-21 22:31:42,831 WARN  
[forking-task-runner-15][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_details_1m__DEFAULT_f000b6a33926bd2_mcihdgke],
 cannot schedule.
   coordinator_20240521223143913.zip:2024-05-21 22:31:42,931 WARN  
[forking-task-runner-43][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_ne_raw__DEFAULT_f270f123147f49a_gpnjeake],
 cannot schedule.
   coordinator_20240521223143913.zip:2024-05-21 22:31:43,005 WARN  
[forking-task-runner-47][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_queue_raw_53acc095f420570_pfdeeinh], cannot schedule.
   coordinator_20240521223143913.zip:2024-05-21 22:31:43,083 WARN  
[forking-task-runner-30][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_queue_raw_53acc095f420570_cdpnnkfm], cannot schedule.
   coordinator_20240521223143913.zip:2024-05-21 22:31:43,163 WARN  
[forking-task-runner-37][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_details_1h__DEFAULT_65ffb07f4133c55_mleikpbd],
 cannot schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:26,801 WARN  
[forking-task-runner-45][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_static_resource_e3e8f4efff4f827_elphhhbd], cannot 
schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:28,091 WARN  
[forking-task-runner-39][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_forwarding_res_raw__DEFAULT_dac66b069b1a02b_dojecloo],
 cannot schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:29,129 WARN  
[forking-task-runner-14][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_details_1h__DEFAULT_3dc4f408912b55e_bdanbceo],
 cannot schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:36,020 WARN  
[forking-task-runner-23][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_ne_raw__DEFAULT_b8f70364e8d5cf6_iomgeelh],
 cannot schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:39,199 WARN  
[forking-task-runner-48][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_tcam_acl_1_min_stats_0b3636a13461911_cekmibak], cannot 
schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:43,918 WARN  
[forking-task-runner-3][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_queue_raw_7ded0c7dfe27fb4_hmbkhojk], cannot schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:44,693 WARN  
[forking-task-runner-6][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_queue_raw_7ded0c7dfe27fb4_ipkhibog], cannot schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:47,109 WARN  
[forking-task-runner-7][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_forwarding_res_1_min_stats_cbbfc5bb9d459d2_pggfobha], 
cannot schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:52,942 WARN  
[forking-task-runner-28][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_details_1m__DEFAULT_9bdc8b8a9320226_eajohkap],
 cannot schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:53,374 WARN  
[forking-task-runner-40][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_prot_1m__DEFAULT_1833c7ffe2051ce_kopebkdo],
 cannot schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:56,099 WARN  
[forking-task-runner-31][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_port_raw__DEFAULT_e4e03c13983f9b4_bohejebd],
 cannot schedule.
   coordinator_20240521223328096.zip:2024-05-21 22:32:59,644 WARN  
[forking-task-runner-26][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_tcam_acl_raw__DEFAULT_37a8b4b45b5372b_ldnckfdh],
 cannot schedule.
   coordinator_20240521223401836.zip:2024-05-21 22:33:54,166 WARN  
[forking-task-runner-22][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_prot_1m__DEFAULT_1833c7ffe2051ce_mlhehmlb],
 cannot schedule.
   coordinator_20240521223401836.zip:2024-05-21 22:33:54,293 WARN  
[forking-task-runner-46][ROOT][org.apache.druid.indexing.overlord.BaseRestorableTaskRunner]
 Failed to get storage slot for task 
[index_kafka_fi_dc_kpi_tcam_acl_1_min_stats_0b3636a13461911_cldiagkn], cannot 
schedule.
   ```
   2、Asking taskRunner to clean up tasks.
   ```
   coordinator_20240521223046551.zip:2024-05-21 22:30:44,436 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,457 tasks.
   coordinator_20240521223052553.zip:2024-05-21 22:30:51,260 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,464 tasks.
   coordinator_20240521223127220.zip:2024-05-21 22:31:23,855 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,464 tasks.
   coordinator_20240521223134683.zip:2024-05-21 22:31:30,659 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,471 tasks.
   coordinator_20240521223143913.zip:2024-05-21 22:31:43,246 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,471 tasks.
   coordinator_20240521223205310.zip:2024-05-21 22:31:50,991 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,478 tasks.
   coordinator_20240521223328096.zip:2024-05-21 22:33:00,050 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,478 tasks.
   coordinator_20240521223401836.zip:2024-05-21 22:33:55,596 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,490 tasks.
   coordinator_20240521223439044.zip:2024-05-21 22:34:34,775 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,498 tasks.
   coordinator_20240521223445053.zip:2024-05-21 22:34:43,946 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,506 tasks.
   coordinator_20240521223519225.zip:2024-05-21 22:35:01,414 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,506 tasks.
   coordinator_20240521223526864.zip:2024-05-21 22:35:24,589 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,506 tasks.
   coordinator_20240521223539653.zip:2024-05-21 22:35:38,440 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,518 tasks.
   coordinator_20240521223553389.zip:2024-05-21 22:35:48,787 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,521 tasks.
   coordinator_20240521223621517.zip:2024-05-21 22:36:19,519 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,532 tasks.
   coordinator_20240521223627719.zip:2024-05-21 22:36:27,110 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,543 tasks.
   coordinator_20240521223647856.zip:2024-05-21 22:36:44,155 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,545 tasks.
   coordinator_20240521223712723.zip:2024-05-21 22:36:53,064 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,553 tasks.
   coordinator_20240521223836555.zip:2024-05-21 22:38:23,722 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,553 tasks.
   coordinator_20240521223859542.zip:2024-05-21 22:38:56,796 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,561 tasks.
   ```
   
   3、ForkingTaskRunner to shutdown tasks
   ```
   2024-05-21 22:30:44,436 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.TaskQueue] Asking 
taskRunner to clean up 115,457 tasks.
   2024-05-21 22:30:44,436 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_dc_kpi_static_resource_ba0d344ae3f0aa4_kooomhnf] 
because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,436 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_microburst_stats__DEFAULT_4d70410d6b24cfc_hjpnjlic]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,436 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_if_1m__DEFAULT_51d7fdd9fd7c45c_dhbmgfgb]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,436 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_nta_flow_host_1m_479a1348c46e557_dgdlgang] because: 
[Task is not in knownTaskIds]
   2024-05-21 22:30:44,436 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_microburst_stats__DEFAULT_09340abebb67177_moailndc]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,436 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_tcam_acl_raw__DEFAULT_9c66f4c0df21cc9_mhejcdpd]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,436 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_dc_kpi_abn_evt_detail_c4d6b09f14602a5_mlhlklkb] 
because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_dc_kpi_queue_raw_470306ebae6ee2c_comhpeni] because: 
[Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_if_1m__DEFAULT_c78ce11bf0bd9a9_acjnhnac]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_forwarding_res_raw__DEFAULT_a0e0c351804f18c_debdjgll]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_dc_kpi_queue_raw_b6ab5354e455235_obncnoad] because: 
[Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_microburst_stats__DEFAULT_f6b82910a084175_ceeeifgi]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_microburst_stats__DEFAULT_4b1779aff7adb72_hpnfjhef]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_prot_1h__DEFAULT_35b238a352e511c_pkojkikb]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_microburst_stats__DEFAULT_93fd41ecdd1b047_picbgepo]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_microburst_stats__DEFAULT_e5ecbd6bb526139_efnhahcl]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_ne_raw__DEFAULT_9aee19a2619e579_pfgiaglb]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_dc_kpi_static_resource_2c3dd81f86c1f38_emopajnn] 
because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_vni_1m__DEFAULT_8a1649ba73a743a_fmjfepmn]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[compact_ODAEDATASET__DEFAULT_fi_nta_flow_conv_1h__DEFAULT_ophefeoc_2024-05-21T09:57:35.037Z]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_microburst_stats__DEFAULT_48470e1b674d47b_ofipfpdl]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_dc_kpi_queue_raw_c6b997e36106b3e_lfkkkcdo] because: 
[Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_dc_kpi_abn_evt_detail_a5794c4cbee7f6b_aeicodgb] 
because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_dc_kpi_queue_raw_54fea11cbabcc2e_ijjmpcka] because: 
[Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_ne_raw__DEFAULT_2b337433bb4cd6e_egjeigmf]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_tcam_acl_raw__DEFAULT_8e3cf50df463184_lilkmmie]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_microburst_stats__DEFAULT_4274878bf3c5e30_dhjfkpof]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_dc_kpi_port_raw__DEFAULT_4e2c33aa9d3a012_iejebgdj]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_nta_flow_host_1h_top10000_5030b705112b3c3_kjkajbdd] 
because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown 
[index_kafka_ODAEDATASET__DEFAULT_fi_nta_flow_conv_1m__DEFAULT_141e3cc5c76093b_agcaomcl]
 because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_dc_kpi_tcam_acl_1_min_stats_ed6593efe85c36d_ofenihhl] 
because: [Task is not in knownTaskIds]
   2024-05-21 22:30:44,437 INFO  
[TaskQueue-Manager][ROOT][org.apache.druid.indexing.overlord.ForkingTaskRunner] 
Shutdown [index_kafka_fi_dc_kpi_issue_summary_08dd10ea5f4ca96_djdebklg] 
because: [Task is not in knownTaskIds]
   ```
   - Any debugging that you have already done
   
   ForkingTaskRunner: 
   When there are no task slots, `tasks` will only continue to be added without 
being cleared
   ```
     public ListenableFuture<TaskStatus> run(final Task task)
     {
       synchronized (tasks) {
         tasks.computeIfAbsent(
             task.getId(), k ->
             new ForkingTaskRunnerWorkItem(
               task,
               exec.submit(
                 new Callable<TaskStatus>() {
                   @Override
                   public TaskStatus call()
                   {
                     final TaskStorageDirTracker.StorageSlot storageSlot;
                     try {
                       storageSlot = getTracker().pickStorageSlot(task.getId());
                     }
                     catch (RuntimeException e) {
                       LOG.warn(e, "Failed to get storage slot for task [%s], 
cannot schedule.", task.getId());
                       return TaskStatus.failure(
                           task.getId(),
                           StringUtils.format("Failed to get storage slot due 
to error [%s]", e.getMessage())
                       );
                     }
   ```
   TaskQueue:
   `runnerTaskFutures` is a task from `ForkingTaskRunner.tasks`, it has a 
significant difference from `knownTaskId`. A large number of tasks in 
`ForkingTaskRunner. tasks` fail due to the absence of task slots, but they 
never get cleaned up
   ```
     void manageInternal()
     {
       Set<String> knownTaskIds = new HashSet<>();
       Map<String, ListenableFuture<TaskStatus>> runnerTaskFutures = new 
HashMap<>();
   
       giant.lock();
   
       try {
         manageInternalCritical(knownTaskIds, runnerTaskFutures);
       }
       finally {
         giant.unlock();
       }
   
       manageInternalPostCritical(knownTaskIds, runnerTaskFutures);
     }
   
     private void manageInternalPostCritical(
         final Set<String> knownTaskIds,
         final Map<String, ListenableFuture<TaskStatus>> runnerTaskFutures
     )
     {
       // Kill tasks that shouldn't be running
       final Set<String> tasksToKill = 
Sets.difference(runnerTaskFutures.keySet(), knownTaskIds);
       if (!tasksToKill.isEmpty()) {
         log.info("Asking taskRunner to clean up %,d tasks.", 
tasksToKill.size());
   
         // On large installations running several thousands of tasks,
         // concatenating the list of known task ids can be compupationally 
expensive.
         final boolean logKnownTaskIds = log.isDebugEnabled();
         final String reason = logKnownTaskIds
                 ? StringUtils.format("Task is not in knownTaskIds[%s]", 
knownTaskIds)
                 : "Task is not in knownTaskIds";
   
         for (final String taskId : tasksToKill) {
           try {
             taskRunner.shutdown(taskId, reason);
           }
           catch (Exception e) {
             log.warn(e, "TaskRunner failed to clean up task: %s", taskId);
           }
         }
       }
     }
   
   ```
   
   I think we should fix it this way:add `tasks.remove(task.getId())` to the 
catch block when the allocation of task slots fails
   ```
     public ListenableFuture<TaskStatus> run(final Task task)
     {
       synchronized (tasks) {
         tasks.computeIfAbsent(
             task.getId(), k ->
             new ForkingTaskRunnerWorkItem(
               task,
               exec.submit(
                 new Callable<TaskStatus>() {
                   @Override
                   public TaskStatus call()
                   {
                     final TaskStorageDirTracker.StorageSlot storageSlot;
                     try {
                       storageSlot = getTracker().pickStorageSlot(task.getId());
                     }
                     catch (RuntimeException e) {
                       tasks.remove(task.getId());
                       LOG.warn(e, "Failed to get storage slot for task [%s], 
cannot schedule.", task.getId());
                       return TaskStatus.failure(
                           task.getId(),
                           StringUtils.format("Failed to get storage slot due 
to error [%s]", e.getMessage())
                       );
                     }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to