Hi, I have a couple PRs and an associated issue here and I was wondering if someone had a chance to take a look?
In Druid clusters with a large number of historicals, there are usually multiple replicas for each segment and if one server misbehaves, an entire query can fail even though it could have been completed. One common way is for a globally cached lookup to fail on some nodes, for example if some nodes cannot authenticate to a database due to a misconfigured node. The goal is to transparently avoid distributing the query to nodes that will cause it to fail when alternatives are available. There is already an issue here listen in [4] that talks about filtering generally misbehaving nodes, so this is a special case. There is one aspect of this issue that is different, which is that lookups are specific to a query, so it requires server selection to be aware of queries. Perhaps, there are other reasons that selection can be benefited such as having affinity for certain historicals to take advantage of caching. This change to the server selection is relatively minimal, which is in [2]. The PR in [3] introduces a solution that can be installed as an extension with only the small change to add the Query parameter to the pick interface. This takes the approach of adding the filtering at the ServerSelectorStrategy level and then allowing a delegating to an existing ServerSelectorStrategy once the servers have been filtered. This approach (when selected) only increased server selection time by a few milliseconds (~8ms to ~11ms) when querying a 13,000 segment datasource with queries that finished on the order of a few seconds, so it does not introduce much overhead. Alternatively, a filter could be introduced directly as a first class citizen in Druid because this seems to be a problem that exists on multiple levels, perhaps by adding a Filter in TierSelectorStrategy before the ServerSelectorStrategy is invoked. I've also been working on the general case and it might simplify the wiring of layering multiple filters. In either case, I think that it is a minimal cost to allow queries to influence server selection, either directly before calling the ServerSelectorStrategy or as an optional server selector strategy. Does anybody have an opinion about either adding the parameter to the pick method as in [2] or adding a filter that can consider the query? The remaining functionality could then be added as an extension, so the changes to core druid would be minimal. thanks for reading! Keefe [1] Failed Query due to missing lookup on some servers https://github.com/apache/druid/issues/10294 [2] allow server selection to be aware of query https://github.com/apache/druid/pull/10428 [3] ServerSelectorStrategy to filter servers with missing required lookups https://github.com/apache/druid/pull/10427 [4] Broker resiliency to misbehaving historical node https://github.com/apache/druid/issues/5709 --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org