Pedro Mázala created FLINK-34201:
------------------------------------
Summary: Datadog name resolution fails and do not retry causing
metrics to not get exported
Key: FLINK-34201
URL: https://issues.apache.org/jira/browse/FLINK-34201
Project: Flink
Issue Type: Bug
Components: Runtime / Metrics
Affects Versions: 1.17.2
Environment:
{code:java}
metrics.reporters: dghttp
metrics.reporter.dghttp.factory.class:
org.apache.flink.metrics.datadog.DatadogHttpReporterFactory
metrics.reporter.dghttp.apikey: {{ required "A valid .Values.ddApiKey entry
required!" .Values.ddApiKey }}
metrics.reporter.dghttp.dataCenter: US
metrics.reporter.dghttp.maxMetricsPerRequest: "500"
metrics.reporter.dghttp.useLogicalIdentifier: "true"
{code}
Reporter: Pedro Mázala
When node restarts happens on k8s, some deployments fail to report metrics to
datadog.
At first, I thought it could be related to some timeout and added a cap of 500
metrics. But then I got to this exception:
{code:java}
java.lang.IllegalStateException: Failed contacting Datadog to validate API key
at
org.apache.flink.metrics.datadog.DatadogHttpClient.validateApiKey(DatadogHttpClient.java:106)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
org.apache.flink.metrics.datadog.DatadogHttpClient.<init>(DatadogHttpClient.java:86)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
org.apache.flink.metrics.datadog.DatadogHttpReporter.<init>(DatadogHttpReporter.java:75)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
org.apache.flink.metrics.datadog.DatadogHttpReporterFactory.createMetricReporter(DatadogHttpReporterFactory.java:59)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.metrics.ReporterSetup.loadViaFactory(ReporterSetup.java:418)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.metrics.ReporterSetup.loadViaFactory(ReporterSetup.java:408)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.metrics.ReporterSetup.loadReporter(ReporterSetup.java:372)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.metrics.ReporterSetup.setupReporters(ReporterSetup.java:326)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.metrics.ReporterSetup.fromConfiguration(ReporterSetup.java:207)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManagerRunnerServices(TaskManagerRunner.java:224)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.start(TaskManagerRunner.java:293)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:486)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$5(TaskManagerRunner.java:530)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:530)
[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:510)
[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:468)
[flink-dist-1.17.2.jar:1.17.2]
Caused by: java.net.UnknownHostException: app.datadoghq.com: Temporary failure
in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[?:?]
at java.net.InetAddress$PlatformNameService.lookupAllHostAddr(Unknown
Source) ~[?:?]
at java.net.InetAddress.getAddressesFromNameService(Unknown Source)
~[?:?]
at java.net.InetAddress$NameServiceAddresses.get(Unknown Source) ~[?:?]
at java.net.InetAddress.getAllByName0(Unknown Source) ~[?:?]
at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
at okhttp3.Dns.lambda$static$0(Dns.java:39)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:135)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:84)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.java:187)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.java:108)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.java:88)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.connection.Transmitter.newExchange(Transmitter.java:169)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:41)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:94)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:88)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:229)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at okhttp3.RealCall.execute(RealCall.java:81)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
at
org.apache.flink.metrics.datadog.DatadogHttpClient.validateApiKey(DatadogHttpClient.java:101)
~[flink-metrics-datadog-1.17.2.jar:1.17.2]
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)