acherla opened a new issue, #13692:
URL: https://github.com/apache/druid/issues/13692

   Hi Druid team,
   We manage a cluster that has over 3K+ peons (116 middle managers) and 200+ 
historical nodes (4-5 million segments) with two masters configured as both 
coordinator/overlord.  We run close to around 400+ streaming ingest jobs into 
druid and we noticed that many of our peon tasks are getting stuck via 
"Coordinator handoff scheduled - Still waiting for handoff for X segments".  
This handoff can take anywhere from a few minutes to 30+ minutes depending on 
the configured completionTimeout period (Job Fails, but its still marked as 
sucess in the logs).  
   
   In order to optimize our druid cluster we began modifying the following 
parameters which did help the performance of the coordinator/overlord, but we 
are still seeing handoff times taking 20-30 minutes at times.
   1. Decreased percentOfSegmentsToConsiderPerMove from 70 percent down to 10 
percent.  
   2. Modified our tasks to create larger and fewer segments to bring down the 
overall segment count. 
   3. increased the number of pendingSegment threads on the 
coordinator/overlord from 1 to 10 to handle handing off multiple segments.
   
   Some of the optimization done in 25.0.0 should assist this via the batch 
segment allocation parameter, but im wondering prior to 25.0.0 there are 
strategies that can be employed to improve the segment handoff time on the 
coordinator.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to