wuguowei1994 commented on PR #18750:
URL: https://github.com/apache/druid/pull/18750#issuecomment-3615163249

   
   > In fact, the negative lag could even be a feature to identify if some 
tasks are particularly slow in returning their offsets. 😛 , and we could 
probably have alerts set up if the negative lag goes below a specific threshold.
   > 
   
   @kfaraz 
   Thanks for the clarification — that makes sense. In our case, though, we’ve 
noticed that negative lag in our large cluster can sometimes persist for over 
five minutes.
   
   We’ve talked about this internally, and if it only happens occasionally (for 
example, under a minute), adjusting the alert thresholds would absolutely work 
for us. But when it lasts longer, it tends to indicate something worth 
investigating.
   
   **We’ve also seen a few situations where negative lag actually pointed to 
issues in the upstream Kafka cluster, so that’s part of why we’re a bit 
cautious here. If we keep the current Druid behavior and treat negative lag as 
normal consumption, there’s a chance we might overlook real problems.**
   
   So overall, having clear and reliable metrics to signal the health of the 
cluster would be really helpful for us.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to