gianm commented on issue #9011: URL: https://github.com/apache/druid/issues/9011#issuecomment-1152832935
Looking at the 3 comments about issues here (from @didip, @applike-ss, & @dene14) it seems to me that the issues are probably related but different. The original report by @dene14 has a really confusing log file. There is a message that a call to `http://prod-druid-overlord-0.prod-druid-overlord-headless.prod-druid.svc.cluster.local:8090/druid/indexer/v1/action` failed. But the following stack trace looks like a failed _startup_ (note the `Lifecycle.start`). It should have come at the beginning of the log. And, at any rate, the stack trace is about a Coordinator API (note the `LookupReferencesManager.fetchLookupsForTier`: that's hitting the Coordinator) not an Overlord API. I wonder if the log got chopped up or went out-of-order somehow. The Kinesis metric timestamps also do not match the log timestamps, so it's hard to correlate these. It's been a long time since this report was filed, so I guess all the stuff required to debug it is long gone. This is unfortunate. @applike-ss I'm interested in more information from task log files, if you have it. Kinesis metrics would be useful too. If they're in a different time zone from the log files, please let us know. @didip Are you saying what happens is something like this? 1. The Coordinator becomes unresponsive for some reason. 2. Some time later, a new Kafka or Kinesis task starts up. 3. The task can't finish starting up because the Coordinator is unresponsive. 4. Ingestion falls behind, because the tasks can't start up. Or are you saying that there is an issue where tasks that started up suddenly _become unresponsive_ if the Coordinator has a problem? Basically, I'm asking because I would expect the first case to happen: new processes (including tasks) can't start up if the Coordinator is unavailable and basic security is enabled. This is because they need to sync the user database from the Coordinator. But, I wouldn't expect the second case to happen. If the Coordinator is unresponsive for some time while a task is running, it shouldn't prevent the task from making progress, unless the task is at a point where it needs to access some Coordinator API. Then, the task should basically sit there and wait for the Coordinator to come back. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
