peferron commented on issue #5709: Broker resiliency to misbehaving historical 
nodes
URL: 
https://github.com/apache/incubator-druid/issues/5709#issuecomment-414035827
 
 
   I'm thinking about working on this. Some help scoping the first PR would be 
welcome. Keeping it small would be preferred since I haven't contributed to 
Druid before.
   
   Basic idea: make brokers cycle through the list of available servers when 
retrying. Should support historical replicas, KIS task replicas, etc.
   
   Out of scope:
   
   - Making brokers report per-server failure metrics via API or emitter. We'll 
need that to build automation that can identify failing servers and trigger 
recovery steps, but it looks like it could be done in a separate PR.
   - Sharing a blacklist across queries within a broker. It doesn't seem 
absolutely necessary, and it also looks like it could be done in a separate PR, 
perhaps by adding a new `ServerSelectorStrategy` that takes failure metrics 
into account when sorting the list.
   - Responding with partial results. The goal here is to increase resiliency 
while still returning full results.
   
   If the basic idea looks good, I'll come back with a more detailed proposal 
about the changes to make, and hopefully get some feedback on that as well 
before working on the code and submitting the PR.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to