pdeva commented on issue #6166: MAJOR issue: Unrecoverable error in KIS 0.12.2
and 0.12.1
URL:
https://github.com/apache/incubator-druid/issues/6166#issuecomment-412628004
so i just tried increasing the size of the faulty MM instance to an
absolutely absurd amount of cpus, and that seems to have done the trick.
Here is what i think is happening:
1. When the node is started, it tries to start all the KI tasks.
2. Since the tasks have been down for a while, they saturate all cpu trying
to read and process data from kafka.
3. due to cpu saturation, they cannot answer to overlord's status checks.
4. since overlord's status check fails, it kills the tasks and respawns
them, thus creating an endless loop.
in fact, even with the mighty powerful cpu box the status check from
overlord failed a few times (but it seems to have passed before timing out
completely).
the solution would be to:
1. increase overlord timeout for individual status pings to tasks (not sure
if there is an existing setting for this)
2. increase the total time and number of retries before giving up.
this issue seems to be related to #6117 and #5340 for occuring in the first
place. connection problems across nodes are a huge issue with druid right now.
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
With regards,
Apache Git Services
-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org