kroeders opened a new issue #10294: URL: https://github.com/apache/druid/issues/10294
### Motivation Given a query with lookups, one server with a missing lookup can cause query execution to fail. When the broker distributes a query to historicals and realtime servers, if any one of those servers does not have the lookup, the query fails as a whole. Lookups can fail to load for a number of reasons, such as missing firewall rules, drivers or slow loading times for large, frequently updated lookups. These queries could be served if the broker considered lookup status when selecting servers for querying. To reproduce this issue, load the druid-lookups-cached-global and create a database backed lookup. Launch an additional historical without the database driver and the lookup will fail to load on that historical. Queries using the lookup will fail altogether because of the one historical without the lookup. ### Proposed Changes The proposal is to modify the broker to track the lookup status on historical and realtime servers and avoid routing queries to servers where relevant lookups are not loaded. This can be done by making server selection aware of the query and excluding servers without required lookups. #### Tracking Lookup Status in Broker The coordinator is responsible for tracking lookups and ensuring they are updated on query servers, so it has the information on which version has been successfully loaded on each node. This is available through the nodeStatus API. The broker can periodically poll the coordinator’s nodeStatus API and maintain a local cache of lookup status on each query server. Alternatively, the broker could poll the internal listener API on the query servers, but this repeats work that the coordinator already does. Other transportation mechanisms like zookeeper could also be used or the coordinator could push the information to the brokers. #### Avoiding Query Servers without Lookup CachingClusteredClient is responsible for determining which servers fulfil a query. The process is to retrieve a set of segment/server mapping relevant to the query and then use a strategy to select servers for each segment. Server selection is not aware of the query. When filtering segments in TierSelectorStrategy before applying the ServerSelector strategy, the query could be considered to avoid query servers without required lookups. Default methods can be added to avoid breaking existing implementations. Alternatively, the pick interface on the ServerSelector interfaces could be extended to add a Query parameter and avoid servers without relevant lookups. Because this is an exceptional case, the servers could also be filtered in CachingClusteredClient before selection. Another alternative would involve handling the exception from the historical/realtime server and retrying the query for those segments. #### Extracting Lookups from Queries Lookups specified as functions in SQL become virtual columns with a lookup expression or as the right join source for join queries. A new query runner could be added to extract the lookups, compare them with the servers and store this blocklist of servers in the query context. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
