pdeva commented on issue #6166: MAJOR issue: Unrecoverable error in KIS 0.12.2 and 0.12.1 URL: https://github.com/apache/incubator-druid/issues/6166#issuecomment-412628004 so i just tried increasing the size of the faulty MM instance to an absolutely absurd amount of cpus, and that seems to have done the trick. Here is what i think is happening: 1. When the node is started, it tries to start all the KI tasks. 2. Since the tasks have been down for a while, they saturate all cpu trying to read and process data from kafka. 3. due to cpu saturation, they cannot answer to overlord's status checks. 4. since overlord's status check fails, it kills the tasks and respawns them, thus creating an endless loop. in fact, even with the mighty powerful cpu box the status check from overlord failed a few times (but it seems to have passed before timing out completely). the solution would be to: 1. increase overlord timeout for individual status pings to tasks (not sure if there is an existing setting for this) 2. increase the total time and number of retries before giving up. this issue seems to be related to #6117 and #5340 for occuring in the first place. connection problems across nodes are a huge issue with druid right now.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org