jasonk000 commented on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1041876818


   @harinirajendran I went looking for this, and I agree with your analysis. 
Stack traces showed that majority of wall clock time in the KafkaSupervisor 
thread was waiting on SQL queries executing as part of the RunNotice. I 
backported the changes in https://github.com/apache/druid/pull/12018 to our 
environment and they worked perfectly. A class histogram showed ~500 
CheckpointNotice tasks sitting idle and ~2500 RunNotice tasks.
   
   There are two ways you can confirm this is happening at any time you have 
slow checkpoint. Replace $pid and $supervisorname as appropriate.
   
   1. Look for a count of class instances that are 
`SeekableStreamSupervisor$RunNotice`
   ```
   jcmd $pid GC.class_histogram | grep SeekableStreamSupervisor
   ```
   2. Look for supervisor thread performing RunNotice calls
   ```
   jstack $pid | grep -A60 KafkaSupervisor-$supervisorname\"
   ```
   
   Thank you to @gianm, your solution was simple and worked perfectly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to