OurNewestMember commented on issue #12458:
URL: https://github.com/apache/druid/issues/12458#issuecomment-1320806594

   Coordinator is worth focusing on.  Why?
   - segments may not be dropped: (coordinator duty to mark used; ...although 
historical to execute it...although coordinator can affect health of historical 
based on, eg, load/drop workload including segment balancing...and back and 
forth and...)
   - inconsistent query results on broker (obviously impacted by broker 
performance itself, but its metadata can be intensive and has reliance on 
coordinator)
   - overall historical load: can be heavily dependent on coordinator (eg, 
loading/dropping segments, even coordinator -> poor ingestion -> suboptimal 
segments -> more query workload, etc, etc)...could prevent proper segment 
unloading
       - same as "segments may not be dropped" above...but this is from "point 
of view" of historical
   - it touches the metadata datastore which can be an effective way for 
something like ingest (eg, heavy ingests, compaction, etc) to stall the 
coordinator (eg, you could heavily fragment an RDBMS with heavy ingest)
   
   So "problem on historical" also a appears very good candidate here.  
However, the "inconsistencies" (in/across time...and in space: like on 
different historicals and pods, with different segments, different queries, 
upon different browser refreshes -> possibly calls to different brokers, etc) 
demand more commonality between the failures rather than "persistent set of 
coincidences" as the explanation (of course "persistent coincidences" not 
impossible).  So I'd look at the coordinator as a relevant commonality.  
(...And of course coordinator can be affected by other cluster activity, like 
heavy ingest destabilizing the overlord running on the same hardware or the 
metadata store which is shared with the coordinator...all things are connected)
   
   The point of mentioning all of this is because an upgrade may not fix a 
problem like this one.  (Actually it could make it worse -- sometimes it 
happens, like potentially around 2021-12 [some new feature side effect causing 
much higher resource requirement for a recently enhanced in-memory column info, 
IIRC] and maybe also around 2022-10 [massive increase in heap requirements for 
streaming and batch indexing]...upgrade problems are pretty understandable with 
a large, complex system).  Not saying you shouldn't upgrade -- just saying that 
regardless of that, the system could be running too close to some limits for 
your needs, for example.  If so, the info above is about examining wherever 
that gap may live between desired and actual performance.
   
   Some questions worth mulling over...
   
   How many segments in the cluster? (best to breakdown by 
used/unused...because that affects coordinator workload, plus the possibly very 
highly relevant workload of the overlord if sharing resources for 
computation/network/state/etc)
   
   How smooth is ingest workload (demand) and performance (actual)?  (Also 
consider compaction, even kill tasks, etc)
   
   Any general observations related to stability and performance? (eg, dying 
processes, failed ingest tasks, slow publish times, indexing error messages 
about retries/errors in HTTP calls, ongoing logs/alerts on throttling segment 
balancing, etc)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to