jasonk000 commented on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1003452368


   @harinirajendran I recommend you experiment running the overlord with these 
three PRs pulled to your code: #12096 #12097 #12099, and let us know how it 
goes.
   
   I reviewed your analysis, and some of the code, and took a profile on our 
cluster here.
   
   You are correct that during task rollover the overlord gets busy processing 
`RunNotice` notices. I can identify two codepaths where RunNotice hits the 
TaskQueue (in purple):
   
   On our system, with above fixes, TaskQueue is only a fraction of the time
   
![image](https://user-images.githubusercontent.com/3196528/147839068-c1d37146-9f91-44bf-9f7e-ba9f7ba3758b.png)
   
   - `SeekableStreamSupervisor.RunNotice::handle -> 
SeekableStreamSupervisor::runInternal -> 
SeekableStreamSupervisor::createNewTasks -> 
SeekableStreamSupervisor::createTasksForGroup -> TaskQueue::add`
   - `SeekableStreamSupervisor.RunNotice::handle -> 
SeekableStreamSupervisor::runInternal -> 
SeekableStreamSupervisor::checkPendingCompletionTasks -> 
SeekableStreamSupervisor::killTasksInGroup -> 
SeekableStreamSupervisor::killTask -> TaskQueue::shutdown`
   
   Both of these paths hit the lock in TaskQueue; the fixes I present above 
have improved scalability of TaskQueue on our system.
   
   It might also help if you can share which metadata task storage engine you 
are using (SQL vs heap).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to