splunk-tschetter commented on issue #8139:
URL: https://github.com/apache/druid/issues/8139#issuecomment-801706838


   As @rgidwani-splunk mentions, we ran into this with druid-0.18.  In our 
case, it seems to be some sort of race condition that exists when we assign 
many partitions to a single task.  At least in our case, it also seems to be 
the result of quite a large cocktail of issues.  For the bug that pdeva 
reported, his log line only has a single partition, so I'm not sure if what 
pdeva experienced and what we experienced are really the same thing, but I'll 
go into what I've been able to piece together.
   
   I'll start by saying that this issue and error resulted in the ingestion not 
making any progress (all tasks would fail because of not being able to commit 
offsets).  We worked around that issue by setting the number of tasks equal to 
the number of kafka partitions that we have, this has allowed our tasks to make 
forward progress without hindrance.
   
   After looking at the logs from the overlord and tasks that failed, it 
appears that what happens is
   
   1) At the end of the time of the task (1 hour), the overlord pauses, gets 
the current offsets and sets end offsets for the tasks in the replica sets.  
   2) Each task accepts the intended end offsets and starts processing to try 
to achieve them
   3) After 30 minutes, the tasks still haven't caught up and committed a full 
set of segments.  They get killed.
   4) In the meantime, the "new hour" tasks never get to commit anything 
because the previous tasks haven't actually completed committing their own set 
of tasks yet (i.e. when the new tasks go to commit, the offsets sitting in the 
DB for the most up to date segment are still ones from the previous tasks that 
were trying to work their way up to the final offsets)
   5) Because the previous tasks were killed without ever catching up, the 
system gets into a bad state where the supervisor is firing up new tasks based 
on the offsets that the previous tasks should've worked their way up to, but 
the previous tasks are not around to complete their job.
   
   Now, the cocktail that makes this actually work out is a mix of tuning 
parameters.  Specifically, 
   a) the tasks where we ran into this are committing local persists very 
rapidly such that they are essentially always throttled on waiting for the next 
persist to hit disk.  This means that they actually make rather slow progress 
through the data.  
   b) On top of that, with 8 partitions per task, there's was more of a chance 
for differences in rate of processing of data between the tasks to cause a 
significant backlog that needs to be caught up
   
   By switching to just one partition per task, we are guaranteed that at least 
one of the tasks is actually fully caught up with the final offset, which means 
that our metadata is in order, even though things are still relatively poorly 
configured for the given data that we are ingesting.
   
   It seems like the only recourse for trying to get back to a good state in 
this case might be for the supervisor to somehow detect that it's not making 
progress and adjust the offsets that it uses.  A maybe meaningful heuristic 
could be: Looking for no actively running tasks ++ a mismatch between the 
offsets committed for the ingestion and the offsets that it gives for new tasks.
   
   Additionally, I have not looked to see if this is still an issue in the 
latest version.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to