It's a good topic to bring up.

Have you tried the JFR support for method timing and tracing events that  JEP 520 introduced in JDK 25? I'm wondering if -XX:StartFlightRecording:jdk.MethodTrace#filter=java.net.InetAddress::getByName records events that could help here.

If new events are introduced then I could image them having "NameService" rather than "Dns" in the name as JDK doesn't use DNS directly (except the JNDI-DNS provider), it is whatever is configured on the system.

-Alan


On 13/11/2025 11:07, [email protected] wrote:
Hello,
I would like to start a discussion on introducing new JFR events for DNS lookups. While many lookups are DNS in cloud-native environments, the JDK uses the configured name service, so the event naming and semantics should not imply DNS-only behavior. I’m seeking feedback on scope, naming, and payload fields.
Motivation

  * High-frequency, latency-sensitive lookups are critical for service
    discovery.
  * Current gaps:
      o Cannot distinguish cache hits vs. network lookups
      o Hard to trace lookup latency and diagnose timeouts/failures
      o Concurrent libraries may cause redundant lookups
  * Value:
      o End-to-end observability: lookup → socket connect → data transfer
      o Troubleshooting: identify timeouts, resolution failures
      o Performance: evaluate cache policies, detect hotspot names
      o Security: audit external domains accessed

*Proposed event (initial draft)*
*Event name:* jdk.DnsLookup
*When:* Emitted around DNS hostname resolution call boundaries, including:

  * Actual network DNS queries (when cache is disabled or cache miss
    occurs)
  * Cache hits (when result is retrieved from DNS cache)
  * Stale data usage (when expired but still valid cached data is used)
  * Background DNS cache refresh operations

*Key fields (feedback welcome):*

  * host (String): The hostname being resolved
  * result (String): Comma-separated list of resolved IP addresses, or
    error message if lookup failed
  * success (boolean): Whether the DNS lookup was successful
  * cached (boolean): Whether the result was retrieved from cache
    (true) or from actual DNS network query (false). This helps
    distinguish between three use cases:
      o Actual network queries (cached=false) - represents real DNS
        network traffic
      o Cache hits (cached=true, stale=false) - repeated lookups using
        fresh cached data
      o Stale data usage (cached=true, stale=true) - application
        continues with expired but still valid cached data when DNS
        refresh fails
  * ttl (long, seconds): Time to live in seconds. Values:
      o 0 or -1: Not cached or forever cached
      o > 0: Actual remaining TTL if cached
  * stale (boolean): Whether stale cached data was used (only valid
    when cached=true). Helps identify semi-error scenarios where DNS
    errors occur but application continues using stale cached records

*Event name:* jdk.DnsCacheStatistics
*When:* Periodic event emitted at configurable intervals (default: 5 seconds in default.jfc, 1 second in profile.jfc). This is a statistics event similar to jdk.ExceptionStatistics, providing aggregate metrics about the DNS cache state.
*Key fields (feedback welcome):*

  * cacheSize (long): Current number of entries in the DNS cache.
    Useful for monitoring cache growth and understanding cache
    utilization patterns.
  * staleEntries (long): Number of stale entries currently in the
    cache (entries that have expired but are still within the stale
    period). Helps identify how many entries are using stale data,
    which is important for understanding cache behavior in scenarios
    where DNS refresh fails.
  * entriesRemoved (long): Number of entries that have been removed
    during cache cleanup operations. This metric tracks cache eviction
    and helps understand cache churn patterns, which is particularly
    useful in Kubernetes and cloud-native environments where DNS
    entries may change frequently.

*Use cases:*

  * Monitoring DNS cache size growth over time
  * Identifying cache cleanup frequency and patterns
  * Understanding stale data usage in production environments
  * Troubleshooting DNS-related performance issues in microservices
    architectures
  * Observing cache behavior during DNS server failures or network
    partitions

Prototype/PR

  * A preliminary PR is available for context and discussion:
      o https://git.openjdk.org/jdk/pull/28110
        <https://git.openjdk.org/jdk/pull/28110>
  * I will update the design/implementation per feedback from this thread.

Thanks in advance for your feedback!
Best regards,
NeayGuyCoding

Reply via email to