peferron commented on issue #5709: Broker resiliency to misbehaving historical nodes URL: https://github.com/apache/incubator-druid/issues/5709#issuecomment-414190292 Thanks for the feedback @gianm. Don't these issues already exist today when `druid.broker.retryPolicy.numTries > 1`? The idea would be to respect this parameter, but in a smarter way, by trying a different replica (if available) on each try. You're right that it could create more load in some situations, though. For example, a query that would have hit the same bad historical twice and failed immediately (in a scenario where failures are immediate) may now hit a good historical on the 2nd try and return successfully, at the expense of increased cluster load. If you set `numTries > 1` that's probably what you want, but we could put this new behavior behind an opt-in configuration parameter to be safe. I'm not sure how the partially processed results that you described are handled today when `druid.broker.retryPolicy.numTries > 1`. I need to spend a bit more time with the code. BTW, looking at `RetryQueryRunner`, it seems like we're trying `numTries + 1` times total (one initial try + `numTries` retries), contrary to what the docs and the name of the parameter indicate. If there's indeed a bug there then everyone may already have retries enabled by default. I need to run a few unit tests to make sure I'm not missing anything there though.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
