wuguowei1994 commented on PR #18750: URL: https://github.com/apache/druid/pull/18750#issuecomment-3615163249
> In fact, the negative lag could even be a feature to identify if some tasks are particularly slow in returning their offsets. 😛 , and we could probably have alerts set up if the negative lag goes below a specific threshold. > @kfaraz Thanks for the clarification — that makes sense. In our case, though, we’ve noticed that negative lag in our large cluster can sometimes persist for over five minutes. We’ve talked about this internally, and if it only happens occasionally (for example, under a minute), adjusting the alert thresholds would absolutely work for us. But when it lasts longer, it tends to indicate something worth investigating. **We’ve also seen a few situations where negative lag actually pointed to issues in the upstream Kafka cluster, so that’s part of why we’re a bit cautious here. If we keep the current Druid behavior and treat negative lag as normal consumption, there’s a chance we might overlook real problems.** So overall, having clear and reliable metrics to signal the health of the cluster would be really helpful for us. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
