OurNewestMember commented on issue #12458:
URL: https://github.com/apache/druid/issues/12458#issuecomment-1320806594
Coordinator is worth focusing on. Why?
- segments may not be dropped: (coordinator duty to mark used; ...although
historical to execute it...although coordinator can affect health of historical
based on, eg, load/drop workload including segment balancing...and back and
forth and...)
- inconsistent query results on broker (obviously impacted by broker
performance itself, but its metadata can be intensive and has reliance on
coordinator)
- overall historical load: can be heavily dependent on coordinator (eg,
loading/dropping segments, even coordinator -> poor ingestion -> suboptimal
segments -> more query workload, etc, etc)...could prevent proper segment
unloading
- same as "segments may not be dropped" above...but this is from "point
of view" of historical
- it touches the metadata datastore which can be an effective way for
something like ingest (eg, heavy ingests, compaction, etc) to stall the
coordinator (eg, you could heavily fragment an RDBMS with heavy ingest)
So "problem on historical" also a appears very good candidate here.
However, the "inconsistencies" (in/across time...and in space: like on
different historicals and pods, with different segments, different queries,
upon different browser refreshes -> possibly calls to different brokers, etc)
demand more commonality between the failures rather than "persistent set of
coincidences" as the explanation (of course "persistent coincidences" not
impossible). So I'd look at the coordinator as a relevant commonality.
(...And of course coordinator can be affected by other cluster activity, like
heavy ingest destabilizing the overlord running on the same hardware or the
metadata store which is shared with the coordinator...all things are connected)
The point of mentioning all of this is because an upgrade may not fix a
problem like this one. (Actually it could make it worse -- sometimes it
happens, like potentially around 2021-12 [some new feature side effect causing
much higher resource requirement for a recently enhanced in-memory column info,
IIRC] and maybe also around 2022-10 [massive increase in heap requirements for
streaming and batch indexing]...upgrade problems are pretty understandable with
a large, complex system). Not saying you shouldn't upgrade -- just saying that
regardless of that, the system could be running too close to some limits for
your needs, for example. If so, the info above is about examining wherever
that gap may live between desired and actual performance.
Some questions worth mulling over...
How many segments in the cluster? (best to breakdown by
used/unused...because that affects coordinator workload, plus the possibly very
highly relevant workload of the overlord if sharing resources for
computation/network/state/etc)
How smooth is ingest workload (demand) and performance (actual)? (Also
consider compaction, even kill tasks, etc)
Any general observations related to stability and performance? (eg, dying
processes, failed ingest tasks, slow publish times, indexing error messages
about retries/errors in HTTP calls, ongoing logs/alerts on throttling segment
balancing, etc)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]