[ 
https://issues.apache.org/jira/browse/FLINK-34201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899618#comment-17899618
 ] 

Josh Tan edited comment on FLINK-34201 at 11/20/24 2:36 AM:
------------------------------------------------------------

Facing the same issue during kubernetes node restarts.


was (Author: JIRAUSER303018):
Facing the same issue. Seems to happen occasionally during new deployments when 
datadog hostname resolution somehow fails.

> Datadog name resolution fails and do not retry causing metrics to not get 
> exported
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-34201
>                 URL: https://issues.apache.org/jira/browse/FLINK-34201
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Metrics
>    Affects Versions: 1.17.2
>         Environment: {code:java}
>     metrics.reporters: dghttp
>     metrics.reporter.dghttp.factory.class: 
> org.apache.flink.metrics.datadog.DatadogHttpReporterFactory
>     metrics.reporter.dghttp.apikey: {{ required "A valid .Values.ddApiKey 
> entry required!" .Values.ddApiKey }}
>     metrics.reporter.dghttp.dataCenter: US
>     metrics.reporter.dghttp.maxMetricsPerRequest: "500"
>     metrics.reporter.dghttp.useLogicalIdentifier: "true"
> {code}
>            Reporter: Pedro Mázala
>            Priority: Minor
>
> When node restarts happens on k8s, some deployments fail to report metrics to 
> datadog.
> At first, I thought it could be related to some timeout and added a cap of 
> 500 metrics. But then I got to this exception:
> {code:java}
> java.lang.IllegalStateException: Failed contacting Datadog to validate API key
>       at 
> org.apache.flink.metrics.datadog.DatadogHttpClient.validateApiKey(DatadogHttpClient.java:106)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.metrics.datadog.DatadogHttpClient.<init>(DatadogHttpClient.java:86)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.metrics.datadog.DatadogHttpReporter.<init>(DatadogHttpReporter.java:75)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.metrics.datadog.DatadogHttpReporterFactory.createMetricReporter(DatadogHttpReporterFactory.java:59)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.metrics.ReporterSetup.loadViaFactory(ReporterSetup.java:418)
>  ~[flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.metrics.ReporterSetup.loadViaFactory(ReporterSetup.java:408)
>  ~[flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.metrics.ReporterSetup.loadReporter(ReporterSetup.java:372)
>  ~[flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.metrics.ReporterSetup.setupReporters(ReporterSetup.java:326)
>  ~[flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.metrics.ReporterSetup.fromConfiguration(ReporterSetup.java:207)
>  ~[flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManagerRunnerServices(TaskManagerRunner.java:224)
>  ~[flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.start(TaskManagerRunner.java:293)
>  ~[flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:486)
>  ~[flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$5(TaskManagerRunner.java:530)
>  ~[flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
>  [flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:530)
>  [flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:510)
>  [flink-dist-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:468)
>  [flink-dist-1.17.2.jar:1.17.2]
> Caused by: java.net.UnknownHostException: app.datadoghq.com: Temporary 
> failure in name resolution
>       at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[?:?]
>       at java.net.InetAddress$PlatformNameService.lookupAllHostAddr(Unknown 
> Source) ~[?:?]
>       at java.net.InetAddress.getAddressesFromNameService(Unknown Source) 
> ~[?:?]
>       at java.net.InetAddress$NameServiceAddresses.get(Unknown Source) ~[?:?]
>       at java.net.InetAddress.getAllByName0(Unknown Source) ~[?:?]
>       at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
>       at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
>       at okhttp3.Dns.lambda$static$0(Dns.java:39) 
> ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:135) 
> ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:84) 
> ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.java:187)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.java:108)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.java:88) 
> ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.connection.Transmitter.newExchange(Transmitter.java:169) 
> ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:41)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:94) 
> ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) 
> ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:88)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:229) 
> ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at okhttp3.RealCall.execute(RealCall.java:81) 
> ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
>       at 
> org.apache.flink.metrics.datadog.DatadogHttpClient.validateApiKey(DatadogHttpClient.java:101)
>  ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
> {code}
> There is no retry mechanism 
> [here|https://github.com/apache/flink/blob/f9f9299f6e25080c6f869b46ec0bdc5e3e19e00d/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L98-L108]
> {code:java}
>     private void validateApiKey() {
>         Request r = new Request.Builder().url(validateUrl).get().build();
>         try (Response response = client.newCall(r).execute()) {
>             if (!response.isSuccessful()) {
>                 throw new IllegalArgumentException(String.format("API key: %s 
> is invalid", apiKey));
>             }
>         } catch (IOException e) {
>             throw new IllegalStateException("Failed contacting Datadog to 
> validate API key", e);
>         }
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to