gianm commented on issue #5709: Broker resiliency to misbehaving historical nodes URL: https://github.com/apache/incubator-druid/issues/5709#issuecomment-414175186 Hi @peferron, That scope sounds useful for an initial patch. I think the biggest risk is that queries that are doomed to failure, possibly because of resource limits being exceeded, will get retried too much and double/triple the load on the cluster (depending on how many retries are allowed). Some suggestions to mitigate that: - Check the error code (if there is one) and don't retry on codes like RESOURCE_LIMIT_EXCEEDED, UNAUTHORIZED, or QUERY_TIMEOUT. (The latter one because, probably, the overall timeout of the query has passed by then anyway.) - Don't retry more than X subqueries per query. Another thing to think about is that it is possible for results to be partially retrieved (and partially processed) and then for the query to fail midway through. In this case, it's probably not possible to recover, since subquery results have already been mixed into the overall query results. The query may need to be retried from scratch.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
