gianm opened a new issue #5709: Broker resiliency to misbehaving historical nodes URL: https://github.com/apache/incubator-druid/issues/5709 Sometimes we see 'zombie' nodes that are nominally responsive but are having underlying problems. This can be due to bad disks, bad configuration, or any number of other causes. Due to the vicissitudes of life, we cannot necessarily predict all of these in advance. So two things would be useful as general mitigations, 1. An ability for the broker to retry queries to data nodes that fail, on the grounds that perhaps another node will succeed. 2. An ability for the broker to blacklist data nodes that fail too often relative to other nodes. You want (1) to not be too aggressive -- it could lead to doing too much work on a query that is doomed to failure anyway (maybe something's wrong with the query). You also want (2) to not be too aggressive -- it's senseless to blacklist half the cluster, for example. You also want the list from (2) to be exposed via API somehow, since folks might want to build automation that takes those nodes out of service, raises alerts about them, replaces them automatically, etc.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
