abhishekagarwal87 commented on issue #18090: URL: https://github.com/apache/druid/issues/18090#issuecomment-2975354502
So there are two paths that get impacted Query path - when the processing on pod stops abruptly, do the queries for this pod fail or do they get stuck? Ingestion path - The lag increase in this case is probably a false alarm. The lag may increase on that pod however the lag will be under control on the replica pod. My guess is that we end up reporting higher lag even if the replica is processing fast enough. Right now, by default broker will distribute requests in the same proportion to the data nodes. There is a setting that lets you fan out requests such as broker will pick up the data node with less number of in-flight requests. Though, probably the right way to deal with this is broker detecting high-failure rate on a replica and blacklisting it for some time. That solves the problem for general failure scenarios. For this particular planned activity when node is marked not ready, we could probably do something that triggers the graceful termination code in pod. I am not sure how though. It really depends on what control k8s offer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
