[I] Major Service degradation despite replication when the pod is terminated abruptly due to node not ready issues (druid)

via GitHub Fri, 06 Jun 2025 12:44:05 -0700


rbankar7 opened a new issue, #18090:
URL: https://github.com/apache/druid/issues/18090


   ## Affected versions
   - All the users of Kubernetes extension would probably be affected by this
   
   ## Description
   - In the druid k8s extension the data nodes have lifecycle stage 
announcement which is executed before termination
   - this particular stage would result in pod unnanouncement when it is in 
terminating state
   - The effect of this is other druid master nodes and broker being aware of 
the termination and would stop assigning tasks/segments or routing queries to 
this particular pod
   - In case of the node not ready issue the processing on pod stops abruptly, 
which results in no "unannouncement" being made to the other nodes
   - Master nodes and brokers/routers would continue to detect this particular 
node which would result in high latency for the queries and monotonically 
increasing loadQueue count or in case of indexer high ingestion lag which is 
monotonically increasing
   - This would go on till the issue is fixed for the indexers or the retention 
period has passed for the historical
   
   ## reproducing the issue
   - on a test cluster deployed druid where we would deliberately disable the 
unnanouncing the node in the end
   - Also added sleep after the stop() along with increasing the 
gracefulTerminationPeriod which resulted in pod being in terminating state for 
a long time
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Major Service degradation despite replication when the pod is terminated abruptly due to node not ready issues (druid)

Reply via email to