Hi, 

I have a couple PRs and an associated issue here and I was wondering if someone 
had a chance to take a look?

In Druid clusters with a large number of historicals, there are usually 
multiple replicas for each segment and if one server misbehaves, an entire 
query can fail even though it could have been completed. One common way is for 
a globally cached lookup to fail on some nodes, for example if some nodes 
cannot authenticate to a database due to a misconfigured node. The goal is to 
transparently avoid distributing the query to nodes that will cause it to fail 
when alternatives are available. 

There is already an issue here listen in [4] that talks about filtering 
generally misbehaving nodes, so this is a special case. There is one aspect of 
this issue that is different, which is that lookups are specific to a query, so 
it requires server selection to be aware of queries. Perhaps, there are other 
reasons that selection can be benefited such as having affinity for certain 
historicals to take advantage of caching. This change to the server selection 
is relatively minimal, which is in [2].

The PR in [3] introduces a solution that can be installed as an extension with 
only the small change to add the Query parameter to the pick interface. This 
takes the approach of adding the filtering at the ServerSelectorStrategy level 
and then allowing a delegating to an existing ServerSelectorStrategy once the 
servers have been filtered. This approach (when selected) only increased server 
selection time by a few milliseconds (~8ms to ~11ms) when querying a 13,000 
segment datasource with queries that finished on the order of a few seconds, so 
it does not introduce much overhead. 

Alternatively, a filter could be introduced directly as a first class citizen 
in Druid because this seems to be a problem that exists on multiple levels, 
perhaps by adding a Filter in TierSelectorStrategy before the 
ServerSelectorStrategy is invoked. I've also been working on the general case 
and it might simplify the wiring of layering multiple filters. 

In either case, I think that it is a minimal cost to allow queries to influence 
server selection, either directly before calling the ServerSelectorStrategy or 
as an optional server selector strategy. 

Does anybody have an opinion about either adding the parameter to the pick 
method as in [2] or adding a filter that can consider the query? The remaining 
functionality could then be added as an extension, so the changes to core druid 
would be minimal. 

thanks for reading!

Keefe


[1] Failed Query due to missing lookup on some servers 
    https://github.com/apache/druid/issues/10294 

[2] allow server selection to be aware of query
    https://github.com/apache/druid/pull/10428

[3] ServerSelectorStrategy to filter servers with missing required lookups
    https://github.com/apache/druid/pull/10427

[4] Broker resiliency to misbehaving historical node
    https://github.com/apache/druid/issues/5709

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Reply via email to