josh-pritchard-fcx opened a new issue #10185: URL: https://github.com/apache/druid/issues/10185
### Affected Version 0.1.8 ### Description At this point I am nearly positive I have encountered a new bug. I have kafka tasks that run through their duration and pause but the middleManager never publishes a success back to the overlord. They just pause and eventually fail with no word back from the middleManager at any point for 30 Minutes. This ONLY happens when kill tasks are also active. I even separated the kill tasks to a different middleManager tier just to make sure they were not conflicting on the same hardware somehow. Just now I had some stuck for over 20 minutes but as soon as I cleared a couple kill tasks that were running they went right through. No part of my cluster is CPU bound during this. Zookeeper, Overlord, Coordinator, Middle Managers are all practically idle. The meta store is running in postgres and is quite a large instance. I have no issues querying tables and the CPU usage is very low. Something is obviously conflicting somewhere but I have no idea what. In the case I just observed the kill tasks were not even the same datastores as the kafka tasks. I need to run kill tasks as we have a of unused segments to cleanup but I can't use them if they make my realtime tasks fail. This is a common log that I get right before shutdown. Note the task is actually in the **runnerTaskFutures** list _[2020-07-14T12:32:11,070] [INFO] [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [**index_kafka_Aggregation_5c27aea6cb975ae_pbhkdjjd**] because: [task is not in runnerTaskFutures[[index_kafka_AlertHistory_1a7b03058a83204_bmnhdfdn, index_kafka_Continuous_5833cf81059eaca_lfiajhpf, index_kafka_Test_9700e5e16605098_khikomij, index_kafka_TaskHistory_b58a78bb2e4f710_jipffpkl, index_kafka_Aggregation_63291934c698c0c_dlljoall, index_BE9412DC-ED97-40C8-B3C5-AE57B74311B6_Transitional_jgllifaf_2020-07-14T12:31:58.181Z, **index_kafka_Aggregation_5c27aea6cb975ae_pbhkdjjd**, index_kafka_AlertHistory_9ead9501e1ba063_kfedfngk, index_kafka_Transitional_5c05e5092fc6c9c_dfplnanb]]]_ Other times the **completionTimeout** is what kills the tasks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
