On Mon, 3 Apr 2023 22:57:36 GMT, Sergey Bylokhov <[email protected]> wrote:
>> I would like to get preliminary feedback about the provided patch. >> >> Discussion on net-dev@ >> https://mail.openjdk.org/pipermail/net-dev/2023-March/020682.html >> >> One of the main issue I try to solve is how the cache handle the >> intermittent DNS server outages due to overloading or network connection. >> >> At the moment this cache can be configured by the application using the >> following two properties: >> (1) "networkaddress.cache.ttl"(30 sec) - cache policy for successful >> lookups >> (2) "networkaddress.cache.negative.ttl"(10 sec) - cache policy for >> negative lookups >> >> The default timeout for positive responses is good enough to "have recent >> dns-records" and to "minimize the number of requests to the DNS server". >> >> But the cache for the negative responses is problematic. This is a problem I >> would like to solve. Caching the negative response means that for **10** >> seconds the application will not be able to connect to the server. >> >> Possible solutions: >> 1. Decreasing timeout "for the negative responses": unfortunately more >> requests to the server at the moment of "DNS-outage" cause even more issues, >> since this is not the right moment to load the network/server more. >> 2. Increasing timeout "for the positive responses": this will decrease the >> chance to get an error, but the cache will start to use stale data longer. >> 3. This proposal: it would be good to ignore the negative response and >> continue to use the result of the last "successful lookup" until some >> additional timeout. >> >> The idea is to split the notion of the TTL and the timeout used for the >> cache. When TTL for the record will expire we should request the new data >> from the server. If this request goes fine we will update the record, if it >> fails we will continue to use the cached date until the next sync. >> >> For example, if the new property "networkaddress.cache.extended.ttl" is set >> to 10 minutes, then we will cache a positive response for 10 minutes but >> will try to sync it every 30 seconds. If the new property is not set then as >> before we will cache positive for 30 seconds and then cache the negative >> response for 10 seconds. >> >> >> RFC 8767 Serving Stale Data to Improve DNS Resiliency: >> https://www.rfc-editor.org/rfc/rfc8767 >> >> Comments about current and other possible implementations: >> * The code intentionally moved to the separate ValidAddresses class, just >> to make clear that the default configuration, when the new property is not >> set is not changed much. >> * The refresh timeout includes the time spent on the server lookup. So if >> we have to refresh every 2 seconds, but in lookup, we spend 3 seconds then >> we will request the server on each lookup. Another implementation may spend >> 3 seconds on lookup and then additional use the cached value for two seconds. >> * The extended timeout is a kind of "maximum stale timer" from the RFC >> above, but it starts counting not from the moment the record expired, but >> from the moment we added it to the cache. Another possible implementation >> may start counting from the moment the TTL expired, so the overall usage of >> the value will be sum ttl+extended. >> * The extended timeout has a hard deadline which is never changed during >> execution, for example, if it sets for 1 day, then at the end of the day we >> should make a successful lookup to recache the value otherwise we will start >> to use the "negative" cache. Other implementations may shift the expiration >> time on every successful sync. >> >> Any thoughts about other possibilities? > > Sergey Bylokhov has updated the pull request incrementally with one > additional commit since the last revision: > > Use "maximum stale timer" instead of the extended timeout, and bump it on > each successful lookup This looks like a nice improvement. You'll need to update the docs in all places where `networkaddress.cache.ttl` is mentioned, that is: - https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/java.base/share/classes/java/net/doc-files/net-properties.html#L263 - https://github.com/openjdk/jdk/blob/c6bd489cc8d30fb6eec865b3dab1cf861e25c8d7/src/java.base/share/classes/java/net/InetAddress.java#L200 - https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/java.base/share/conf/security/java.security#L358 Plus, as Alan suggested, a CSR and a release note will also be necessary. src/java.base/share/classes/java/net/InetAddress.java line 980: > 978: * are removed from the expirySet and cache. > 979: */ > 980: public boolean expired(long now) { This method's name does not reflect what this method does. It may be worth splitting. src/java.base/share/classes/java/net/InetAddress.java line 1026: > 1024: public InetAddress[] get() { > 1025: long now = System.nanoTime(); > 1026: if ((refreshTime - now) < 0L && lookupLock.tryLock()) { I like it how you only block one thread while refreshing and let other threads use the stale value. src/java.base/share/classes/sun/net/InetAddressCachePolicy.java line 126: > 124: tmp = getProperty(cacheStalePolicyProp, > 125: cacheStalePolicyPropFallback); > 126: if (tmp != null && tmp > cachePolicy) { Why do you require cacheStalePolicy > cachePolicy? In your implementation, stale timer starts running after the base timer expires, so lower values still make sense. ------------- PR Review: https://git.openjdk.org/jdk/pull/13285#pullrequestreview-1373340498 PR Review Comment: https://git.openjdk.org/jdk/pull/13285#discussion_r1158790019 PR Review Comment: https://git.openjdk.org/jdk/pull/13285#discussion_r1158822139 PR Review Comment: https://git.openjdk.org/jdk/pull/13285#discussion_r1158824436
