gianm opened a new issue #8713: Improper result-level cache ETag handling for union datasources URL: https://github.com/apache/incubator-druid/issues/8713 Result-level caching and union datasources do not play well together: it is possible for one of the underlying datasources to be improperly ignored. Here is how result-level caching was designed to work. It is trying to piggyback off the system that implements the standard `ETag` / `If-None-Match` protocol. Note that there's a QueryRunner stack on the broker that, among other runners, goes **ResultLevelCachingQueryRunner** → **UnionQueryRunner** → **CachingClusteredClient**. 1. ResultLevelCachingQueryRunner computes a cache key for the query using the toolChest's cache strategy. These cache keys do not generally include the datasource or interval. This sounds bad, but is actually fine, as we'll see. 2. ResultLevelCachingQueryRunner fetches results from the cache for that key. Part of the cached value is an ETag based on the totality of segments involved in the query (it is a SHA1 of all of the segment identifiers). The idea is that the ETag includes all the datasource and interval information, so the cache key doesn't have to. It was computed by CachingClusteredClient the last time the query ran (see step 4). 3. That ETag is set as `If-None-Match` in the query request context, and the query is passed down the QueryRunnerStack. 4. CachingClusteredClient, in its `run` method, computes the ETag for the query and saves it in the "ETag" field in the response context. 5. CachingClusteredClient also checks if the ETag matches the `If-None-Match` query request context parameter. If so, it returns an empty sequence, and will not actually execute the query. 6. ResultLevelCachingQueryRunner, after getting the sequence from downstream runners (but before evaluating it), inspects the query response context. If an ETag is set there, and it's the same as the ETag from the cache, then it knows the CachingClusteredClient noticed a match, and it returns the old cached results. If the ETag is set to something else, it instead returns the sequence from the downstream runners. There is a problem here when union datasources come into play. UnionQueryRunner works by splitting up union datasources into N subqueries, then running them all and merging the results together. Crucially, it sits above CachingClusteredClient in the QueryRunner stack, and all subqueries _share the same response context_. Meaning that when CachingClusteredClient sets ETags in the response context, it's actually doing it for each subquery separately, and they're all clobbering each other. That sounds bad enough! But there is another, more subtle problem as well. The ETag setting only happens when CachingClusteredClient's `run` method gets called. But, when a union datasource is split up, it transforms an _eager_ `run` call into a _lazy_ `run` call. This means the ETag actually doesn't get set until you start iterating the sequence, which hasn't happened yet by the time step (6) happens above. So the ETag is always null at this point, even if it will get set later on (probably incorrectly). The effect of all this on ResultLevelCachingQueryRunner is that when one of the subquery ETags matches the currently-cached result, that subquery's results will be ignored (because of step 5) but other subquery results will be fetched and merged as normal and the ResultLevelCachingQueryRunner will think it had a cache miss.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
