Hello,
I would like to start a discussion on introducing new JFR events for DNS 
lookups. While many lookups are DNS in cloud-native environments, the JDK uses 
the configured name service, so the event naming and semantics should not imply 
DNS-only behavior. I’m seeking feedback on scope, naming, and payload fields.
Motivation


• High-frequency, latency-sensitive lookups are critical for service discovery.

• Current gaps:
    • Cannot distinguish cache hits vs. network lookups

    • Hard to trace lookup latency and diagnose timeouts/failures

    • Concurrent libraries may cause redundant lookups

• Value:
    • End-to-end observability: lookup → socket connect → data transfer

    • Troubleshooting: identify timeouts, resolution failures

    • Performance: evaluate cache policies, detect hotspot names

    • Security: audit external domains accessed


Proposed event (initial draft)
Event name: jdk.DnsLookup
When: Emitted around DNS hostname resolution call boundaries, including:


• Actual network DNS queries (when cache is disabled or cache miss occurs)

• Cache hits (when result is retrieved from DNS cache)

• Stale data usage (when expired but still valid cached data is used)

• Background DNS cache refresh operations


Key fields (feedback welcome):


• host (String): The hostname being resolved

• result (String): Comma-separated list of resolved IP addresses, or error 
message if lookup failed

• success (boolean): Whether the DNS lookup was successful

• cached (boolean): Whether the result was retrieved from cache (true) or from 
actual DNS network query (false). This helps distinguish between three use 
cases:
    • Actual network queries (cached=false) - represents real DNS network 
traffic

    • Cache hits (cached=true, stale=false) - repeated lookups using fresh 
cached data

    • Stale data usage (cached=true, stale=true) - application continues with 
expired but still valid cached data when DNS refresh fails

• ttl (long, seconds): Time to live in seconds. Values:
    • 0 or -1: Not cached or forever cached

    • > 0: Actual remaining TTL if cached

• stale (boolean): Whether stale cached data was used (only valid when 
cached=true). Helps identify semi-error scenarios where DNS errors occur but 
application continues using stale cached records


Event name: jdk.DnsCacheStatistics
When: Periodic event emitted at configurable intervals (default: 5 seconds in 
default.jfc, 1 second in profile.jfc). This is a statistics event similar to 
jdk.ExceptionStatistics, providing aggregate metrics about the DNS cache state.
Key fields (feedback welcome):


• cacheSize (long): Current number of entries in the DNS cache. Useful for 
monitoring cache growth and understanding cache utilization patterns.

• staleEntries (long): Number of stale entries currently in the cache (entries 
that have expired but are still within the stale period). Helps identify how 
many entries are using stale data, which is important for understanding cache 
behavior in scenarios where DNS refresh fails.

• entriesRemoved (long): Number of entries that have been removed during cache 
cleanup operations. This metric tracks cache eviction and helps understand 
cache churn patterns, which is particularly useful in Kubernetes and 
cloud-native environments where DNS entries may change frequently.


Use cases:


• Monitoring DNS cache size growth over time

• Identifying cache cleanup frequency and patterns

• Understanding stale data usage in production environments

• Troubleshooting DNS-related performance issues in microservices architectures

• Observing cache behavior during DNS server failures or network partitions


Prototype/PR


• A preliminary PR is available for context and discussion:
    • https://git.openjdk.org/jdk/pull/28110

• I will update the design/implementation per feedback from this thread.


Thanks in advance for your feedback!
Best regards,
NeayGuyCoding

Reply via email to