[GitHub] pdeva commented on issue #6166: MAJOR issue: Unrecoverable error in KIS 0.12.2 and 0.12.1

2018-08-13 Thread GitBox
pdeva commented on issue #6166: MAJOR issue: Unrecoverable error in KIS 0.12.2 
and 0.12.1
URL: 
https://github.com/apache/incubator-druid/issues/6166#issuecomment-412628004
 
 
   so i just tried increasing the size of the faulty MM instance to an 
absolutely absurd amount of cpus, and that seems to have done the trick.
   
   Here is what i think is happening:
   
   1. When the node is started, it tries to start all the KI tasks.
   2. Since the tasks have been down for a while, they saturate all cpu trying 
to read and process data from kafka.
   3. due to cpu saturation, they cannot answer to overlord's status checks.
   4. since overlord's status check fails, it kills the tasks and respawns 
them, thus creating an endless loop.
   
   in fact, even with the mighty powerful cpu box the status check from 
overlord failed a few times (but it seems to have passed before timing out 
completely).
   
   the solution would be to:
   1. increase overlord timeout for individual status pings to tasks (not sure 
if there is an existing setting for this)
   2. increase the total time and number of retries before giving up.
   
   this issue seems to be related to #6117 and #5340 for occuring in the first 
place. connection problems across nodes are a huge issue with druid right now.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] pdeva commented on issue #6166: MAJOR issue: Unrecoverable error in KIS 0.12.2 and 0.12.1

2018-08-13 Thread GitBox
pdeva commented on issue #6166: MAJOR issue: Unrecoverable error in KIS 0.12.2 
and 0.12.1
URL: 
https://github.com/apache/incubator-druid/issues/6166#issuecomment-412605649
 
 
   @gianm here is a video of the overlord console:
   https://www.screencast.com/t/z4eXgbaGe5fy
   
   as you can see, the tasks dont really product any output at all. which is 
what makes it really tough to debug this issue. (ignore the dripstat agent 
output in the logs in the video, they dont affect the tasks)
   they just start and shut down.
   
   the video shows that the one replica seems to be working, but we had to take 
down the other replica since it kept on spawing new tasks (which fail) almost 
every minute.
   
   the coordinator(+overlord) show that it tries to spawn the tasks, but it 
cannot connect to it, so it seems to kill it. (coordinator logs are in the 
users-group thread i linked to).
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org