peferron commented on issue #5709: Broker resiliency to misbehaving historical nodes URL: https://github.com/apache/incubator-druid/issues/5709#issuecomment-414035827 I'm thinking about working on this. Some help scoping the first PR would be welcome. Keeping it small would be preferred since I haven't contributed to Druid before. Basic idea: make brokers cycle through the list of available servers when retrying. Should support historical replicas, KIS task replicas, etc. Out of scope: - Making brokers report per-server failure metrics via API or emitter. We'll need that to build automation that can identify failing servers and trigger recovery steps, but it looks like it could be done in a separate PR. - Sharing a blacklist across queries within a broker. It doesn't seem absolutely necessary, and it also looks like it could be done in a separate PR, perhaps by adding a new `ServerSelectorStrategy` that takes failure metrics into account when sorting the list. - Responding with partial results. The goal here is to increase resiliency while still returning full results. If the basic idea looks good, I'll come back with a more detailed proposal about the changes to make, and hopefully get some feedback on that as well before working on the code and submitting the PR.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
