Re: [I] Major Service degradation despite replication when the pod is terminated abruptly due to node not ready issues (druid)

via GitHub Mon, 16 Jun 2025 00:12:38 -0700


abhishekagarwal87 commented on issue #18090:
URL: https://github.com/apache/druid/issues/18090#issuecomment-2975354502


   So there are two paths that get impacted
   Query path - when the processing on pod stops abruptly, do the queries for 
this pod fail or do they get stuck? 
   Ingestion path - The lag increase in this case is probably a false alarm. 
The lag may increase on that pod however the lag will be under control on the 
replica pod. My guess is that we end up reporting higher lag even if the 
replica is processing fast enough. 
   
   Right now, by default broker will distribute requests in the same proportion 
to the data nodes. There is a setting that lets you fan out requests such as 
broker will pick up the data node with less number of in-flight requests. 
Though, probably the right way to deal with this is broker detecting 
high-failure rate on a replica and blacklisting it for some time. That solves 
the problem for general failure scenarios. For this particular planned activity 
when node is marked not ready, we could probably do something that triggers the 
graceful termination code in pod. 
   
   I am not sure how though. It really depends on what control k8s offer. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Major Service degradation despite replication when the pod is terminated abruptly due to node not ready issues (druid)

Reply via email to