josh-pritchard-fcx opened a new issue #10185:
URL: https://github.com/apache/druid/issues/10185


   ### Affected Version
   
   0.1.8
   
   ### Description
   
   At this point I am nearly positive I have encountered a new bug. I have 
kafka tasks that run through their duration and pause but the middleManager 
never publishes a success back to the overlord. They just pause and eventually 
fail with no word back from the middleManager at any point for 30 Minutes. This 
ONLY happens when kill tasks are also active. I even separated the kill tasks 
to a different middleManager tier just to make sure they were not conflicting 
on the same hardware somehow. Just now I had some stuck for over 20 minutes but 
as soon as I cleared a couple kill tasks that were running they went right 
through. No part of my cluster is CPU bound during this. Zookeeper, Overlord, 
Coordinator, Middle Managers are all practically idle. The meta store is 
running in postgres and is quite a large instance. I have no issues querying 
tables and the CPU usage is very low. Something is obviously conflicting 
somewhere but I have no idea what. In the case I just observed the kill tasks 
were not even the same datastores as the kafka tasks. I need to run kill tasks 
as we have a of unused segments to cleanup but I can't use them if they make my 
realtime tasks fail.
   
   This is a common log that I get right before shutdown. Note the task is 
actually in the **runnerTaskFutures** list
   _[2020-07-14T12:32:11,070] [INFO] [TaskQueue-Manager] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown 
[**index_kafka_Aggregation_5c27aea6cb975ae_pbhkdjjd**] because: [task is not in 
runnerTaskFutures[[index_kafka_AlertHistory_1a7b03058a83204_bmnhdfdn, 
index_kafka_Continuous_5833cf81059eaca_lfiajhpf, 
index_kafka_Test_9700e5e16605098_khikomij, 
index_kafka_TaskHistory_b58a78bb2e4f710_jipffpkl, 
index_kafka_Aggregation_63291934c698c0c_dlljoall, 
index_BE9412DC-ED97-40C8-B3C5-AE57B74311B6_Transitional_jgllifaf_2020-07-14T12:31:58.181Z,
 **index_kafka_Aggregation_5c27aea6cb975ae_pbhkdjjd**, 
index_kafka_AlertHistory_9ead9501e1ba063_kfedfngk, 
index_kafka_Transitional_5c05e5092fc6c9c_dfplnanb]]]_
   
   Other times the **completionTimeout** is what kills the tasks.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to