Re: RFR: 8304885: Reuse stale data to improve DNS resolver resiliency [v3]

Sergey Bylokhov Thu, 20 Apr 2023 21:57:40 -0700

> I would like to get preliminary feedback about the provided patch.
> 
> Discussion on net-dev@ 
> https://mail.openjdk.org/pipermail/net-dev/2023-March/020682.html
> 
> One of the main issue I try to solve is how the cache handle the intermittent 
> DNS server outages due to overloading or network connection.
> 
> At the moment this cache can be configured by the application using the 
> following two properties:
>    (1) "networkaddress.cache.ttl"(30 sec) - cache policy for successful 
> lookups
>    (2) "networkaddress.cache.negative.ttl"(10 sec) - cache policy for 
> negative lookups
> 
> The default timeout for positive responses is good enough to "have recent 
> dns-records" and to "minimize the number of requests to the DNS server".
> 
> But the cache for the negative responses is problematic. This is a problem I 
> would like to solve. Caching the negative response means that for **10** 
> seconds the application will not be able to connect to the server.
> 
> Possible solutions:
>   1. Decreasing timeout "for the negative responses": unfortunately more 
> requests to the server at the moment of "DNS-outage" cause even more issues, 
> since this is not the right moment to load the network/server more.
>   2. Increasing timeout "for the positive responses": this will decrease the 
> chance to get an error, but the cache will start to use stale data longer.
>   3. This proposal: it would be good to ignore the negative response and 
> continue to use the result of the last "successful lookup" until some 
> additional timeout.
> 
> The idea is to split the notion of the TTL and the timeout used for the 
> cache. When TTL for the record will expire we should request the new data 
> from the server. If this request goes fine we will update the record, if it 
> fails we will continue to use the cached date until the next sync.
> 
> For example, if the new property "networkaddress.cache.extended.ttl" is set 
> to 10 minutes, then we will cache a positive response for 10 minutes but will 
> try to sync it every 30 seconds. If the new property is not set then as 
> before we will cache positive for 30 seconds and then cache the negative 
> response for 10 seconds.
> 
> 
> RFC 8767 Serving Stale Data to Improve DNS Resiliency:
> https://www.rfc-editor.org/rfc/rfc8767
> 
> Comments about current and other possible implementations:
>  * The code intentionally moved to the separate ValidAddresses class, just to 
> make clear that the default configuration, when the new property is not set 
> is not changed much.
>  * The refresh timeout includes the time spent on the server lookup. So if we 
> have to refresh every 2 seconds, but in lookup, we spend 3 seconds then we 
> will request the server on each lookup. Another implementation may spend 3 
> seconds on lookup and then additional use the cached value for two seconds.
>  * The extended timeout is a kind of "maximum stale timer" from the RFC 
> above, but it starts counting not from the moment the record expired, but 
> from the moment we added it to the cache. Another possible implementation may 
> start counting from the moment the TTL expired, so the overall usage of the 
> value will be sum ttl+extended.
>  * The extended timeout has a hard deadline which is never changed during 
> execution, for example, if it sets for 1 day, then at the end of the day we 
> should make a successful lookup to recache the value otherwise we will start 
> to use the "negative" cache. Other implementations may shift the expiration 
> time on every successful sync.
> 
> Any thoughts about other possibilities?


Sergey Bylokhov has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains seven additional 
commits since the last revision:

 - Merge remote-tracking branch 'upstream/master' into JDK-8304885
 - Use "maximum stale timer" instead of the extended timeout, and bump it on 
each successful lookup
 -  the suggested cap is 7 days
 - simplify
 - comments
 - Merge remote-tracking branch 'upstream/master' into JDK-8304885
 - 8304885: Reuse stale data to improve DNS resolver resiliency

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/13285/files
  - new: https://git.openjdk.org/jdk/pull/13285/files/c9b0d79b..4db0216c

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=13285&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13285&range=01-02

  Stats: 10481 lines in 234 files changed: 2916 ins; 7145 del; 420 mod
  Patch: https://git.openjdk.org/jdk/pull/13285.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13285/head:pull/13285

PR: https://git.openjdk.org/jdk/pull/13285

Re: RFR: 8304885: Reuse stale data to improve DNS resolver resiliency [v3]

Reply via email to