gianm commented on issue #9011:
URL: https://github.com/apache/druid/issues/9011#issuecomment-1152832935

   Looking at the 3 comments about issues here (from @didip, @applike-ss, & 
@dene14) it seems to me that the issues are probably related but different.
   
   The original report by @dene14 has a really confusing log file. There is a 
message that a call to 
`http://prod-druid-overlord-0.prod-druid-overlord-headless.prod-druid.svc.cluster.local:8090/druid/indexer/v1/action`
 failed. But the following stack trace looks like a failed _startup_ (note the 
`Lifecycle.start`). It should have come at the beginning of the log. And, at 
any rate, the stack trace is about a Coordinator API (note the 
`LookupReferencesManager.fetchLookupsForTier`: that's hitting the Coordinator) 
not an Overlord API. I wonder if the log got chopped up or went out-of-order 
somehow. The Kinesis metric timestamps also do not match the log timestamps, so 
it's hard to correlate these. It's been a long time since this report was 
filed, so I guess all the stuff required to debug it is long gone. This is 
unfortunate.
   
   @applike-ss I'm interested in more information from task log files, if you 
have it. Kinesis metrics would be useful too. If they're in a different time 
zone from the log files, please let us know.
   
   @didip Are you saying what happens is something like this?
   
   1. The Coordinator becomes unresponsive for some reason.
   2. Some time later, a new Kafka or Kinesis task starts up.
   3. The task can't finish starting up because the Coordinator is unresponsive.
   4. Ingestion falls behind, because the tasks can't start up.
   
   Or are you saying that there is an issue where tasks that started up 
suddenly _become unresponsive_ if the Coordinator has a problem?
   
   Basically, I'm asking because I would expect the first case to happen: new 
processes (including tasks) can't start up if the Coordinator is unavailable 
and basic security is enabled. This is because they need to sync the user 
database from the Coordinator.
   
   But, I wouldn't expect the second case to happen. If the Coordinator is 
unresponsive for some time while a task is running, it shouldn't prevent the 
task from making progress, unless the task is at a point where it needs to 
access some Coordinator API. Then, the task should basically sit there and wait 
for the Coordinator to come back.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to