[GitHub] [druid] didip commented on issue #9011: OrderedPartitionableRecord buffer full, storing iterator and retrying

GitBox Fri, 10 Jun 2022 19:27:33 -0700


didip commented on issue #9011:
URL: https://github.com/apache/druid/issues/9011#issuecomment-1152837767


   Hi Gian!
   
   First of all, I want to clarify, my problem revolves around batched
   ingestion. But it's almost the exact same stack trace.
   
   > Or are you saying that there is an issue where tasks that started up
   suddenly *become unresponsive* if the Coordinator has a problem?
   
   From the perspective of batch ingestion, it's actually both. 97% of the
   time when Peon starts up it's fine. But when you have 1-2 thousands of
   Peons, a few of them will fail on boot when Coordinator becomes
   unresponsive.
   Unfortunately that's enough to bombed a few subtasks which in-turned bombed
   the entire index_parallel.
   
   I see that you already replied to me on Slack, we can continue the
   conversation there and later on write up the solutions here.
   
   On Fri, Jun 10, 2022 at 6:54 PM Gian Merlino ***@***.***>
   wrote:
   
   > Looking at the 3 comments about issues here (from @didip
   > <https://github.com/didip>, @applike-ss <https://github.com/applike-ss>,
   > & @dene14 <https://github.com/dene14>) it seems to me that the issues are
   > probably related but different.
   >
   > The original report by @dene14 <https://github.com/dene14> has a really
   > confusing log file. There is a message that a call to
   > 
http://prod-druid-overlord-0.prod-druid-overlord-headless.prod-druid.svc.cluster.local:8090/druid/indexer/v1/action
   > failed. But the following stack trace looks like a failed *startup* (note
   > the Lifecycle.start). It should have come at the beginning of the log.
   > And, at any rate, the stack trace is about a Coordinator API (note the
   > LookupReferencesManager.fetchLookupsForTier: that's hitting the
   > Coordinator) not an Overlord API. I wonder if the log got chopped up or
   > went out-of-order somehow. The Kinesis metric timestamps also do not match
   > the log timestamps, so it's hard to correlate these. It's been a long time
   > since this report was filed, so I guess all the stuff required to debug it
   > is long gone. This is unfortunate.
   >
   > @applike-ss <https://github.com/applike-ss> I'm interested in more
   > information from task log files, if you have it. Kinesis metrics would be
   > useful too. If they're in a different time zone from the log files, please
   > let us know.
   >
   > @didip <https://github.com/didip> Are you saying what happens is
   > something like this?
   >
   >    1. The Coordinator becomes unresponsive for some reason.
   >    2. Some time later, a new Kafka or Kinesis task starts up.
   >    3. The task can't finish starting up because the Coordinator is
   >    unresponsive.
   >    4. Ingestion falls behind, because the tasks can't start up.
   >
   > Or are you saying that there is an issue where tasks that started up
   > suddenly *become unresponsive* if the Coordinator has a problem?
   >
   > Basically, I'm asking because I would expect the first case to happen: new
   > processes (including tasks) can't start up if the Coordinator is
   > unavailable and basic security is enabled. This is because they need to
   > sync the user database from the Coordinator.
   >
   > But, I wouldn't expect the second case to happen. If the Coordinator is
   > unresponsive for some time while a task is running, it shouldn't prevent
   > the task from making progress, unless the task is at a point where it needs
   > to access some Coordinator API. Then, the task should basically sit there
   > and wait for the Coordinator to come back.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/druid/issues/9011#issuecomment-1152832935>, or
   > unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AAARZVS25HON34FVMFCR3WTVOPWWFANCNFSM4JZGF2DA>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] didip commented on issue #9011: OrderedPartitionableRecord buffer full, storing iterator and retrying

Reply via email to