[GitHub] [druid] gianm commented on issue #9011: OrderedPartitionableRecord buffer full, storing iterator and retrying

GitBox Fri, 10 Jun 2022 18:54:54 -0700


gianm commented on issue #9011:
URL: https://github.com/apache/druid/issues/9011#issuecomment-1152832935

Looking at the 3 comments about issues here (from @didip, @applike-ss, &
@dene14) it seems to me that the issues are probably related but different.

The original report by @dene14 has a really confusing log file. There is a
message that a call to
`http://prod-druid-overlord-0.prod-druid-overlord-headless.prod-druid.svc.cluster.local:8090/druid/indexer/v1/action`
failed. But the following stack trace looks like a failed _startup_ (note the
`Lifecycle.start`). It should have come at the beginning of the log. And, at
any rate, the stack trace is about a Coordinator API (note the
`LookupReferencesManager.fetchLookupsForTier`: that's hitting the Coordinator)
not an Overlord API. I wonder if the log got chopped up or went out-of-order
somehow. The Kinesis metric timestamps also do not match the log timestamps, so
it's hard to correlate these. It's been a long time since this report was
filed, so I guess all the stuff required to debug it is long gone. This is
unfortunate.

@applike-ss I'm interested in more information from task log files, if you
have it. Kinesis metrics would be useful too. If they're in a different time
zone from the log files, please let us know.

@didip Are you saying what happens is something like this?

1. The Coordinator becomes unresponsive for some reason.
2. Some time later, a new Kafka or Kinesis task starts up.
3. The task can't finish starting up because the Coordinator is unresponsive.
4. Ingestion falls behind, because the tasks can't start up.

Or are you saying that there is an issue where tasks that started up
suddenly _become unresponsive_ if the Coordinator has a problem?

Basically, I'm asking because I would expect the first case to happen: new
processes (including tasks) can't start up if the Coordinator is unavailable
and basic security is enabled. This is because they need to sync the user
database from the Coordinator.

But, I wouldn't expect the second case to happen. If the Coordinator is
unresponsive for some time while a task is running, it shouldn't prevent the
task from making progress, unless the task is at a point where it needs to
access some Coordinator API. Then, the task should basically sit there and wait
for the Coordinator to come back.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm commented on issue #9011: OrderedPartitionableRecord buffer full, storing iterator and retrying

Reply via email to