pdeva commented on issue #6166: MAJOR issue: Unrecoverable error in KIS 0.12.2 
and 0.12.1
URL: 
https://github.com/apache/incubator-druid/issues/6166#issuecomment-412628004
 
 
   so i just tried increasing the size of the faulty MM instance to an 
absolutely absurd amount of cpus, and that seems to have done the trick.
   
   Here is what i think is happening:
   
   1. When the node is started, it tries to start all the KI tasks.
   2. Since the tasks have been down for a while, they saturate all cpu trying 
to read and process data from kafka.
   3. due to cpu saturation, they cannot answer to overlord's status checks.
   4. since overlord's status check fails, it kills the tasks and respawns 
them, thus creating an endless loop.
   
   in fact, even with the mighty powerful cpu box the status check from 
overlord failed a few times (but it seems to have passed before timing out 
completely).
   
   the solution would be to:
   1. increase overlord timeout for individual status pings to tasks (not sure 
if there is an existing setting for this)
   2. increase the total time and number of retries before giving up.
   
   this issue seems to be related to #6117 and #5340 for occuring in the first 
place. connection problems across nodes are a huge issue with druid right now.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to