gianm opened a new issue #5709: Broker resiliency to misbehaving historical 
nodes
URL: https://github.com/apache/incubator-druid/issues/5709
 
 
   Sometimes we see  'zombie' nodes that are nominally responsive but are 
having underlying problems. This can be due to bad disks, bad configuration, or 
any number of other causes. Due to the vicissitudes of life, we cannot 
necessarily predict all of these in advance. So two things would be useful as 
general mitigations,
   
   1. An ability for the broker to retry queries to data nodes that fail, on 
the grounds that perhaps another node will succeed.
   2. An ability for the broker to blacklist data nodes that fail too often 
relative to other nodes.
   
   You want (1) to not be too aggressive -- it could lead to doing too much 
work on a query that is doomed to failure anyway (maybe something's wrong with 
the query). You also want (2) to not be too aggressive -- it's senseless to 
blacklist half the cluster, for example.
   
   You also want the list from (2) to be exposed via API somehow, since folks 
might want to build automation that takes those nodes out of service, raises 
alerts about them, replaces them automatically, etc.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to