[ 
https://issues.apache.org/jira/browse/FLINK-32242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu DESPRIEE updated FLINK-32242:
-------------------------------------
    Description: 
We're running a relatively small flink cluster (7 task-managers, 8 cores) and 
are using datadog for telemetry.

The numbers for outgoing traffic, between kafka producers, tasks activities, 
and host system metrics didn't add-up. After investigation, we discovered that 
this traffic was generated by the DatadogHttpReporter.

We switched the reporter to an implementation using the java dogstatsd client 
(reporting to a datadog agent on each host).

Here are some numbers of outgoing traffic taken at a NAT gateway, between the 
cluster and the outside world. Before/after this change (all other things being 
equal):

!image-2023-06-01-17-56-50-809.png!

We're talking about 850MB in 5mn, so 10GB/h overhead here. That kind of traffic 
is not free on AWS...

Here is the change on `{{{}flink.taskmanager.Status.JVM.CPU.Load{}}}` (over the 
whole cluster)

!image-2023-06-01-17-54-45-900.png!

Reporting telemetry in json over http has a *HUGE* overhead.

So I would strongly advocate to deprecate this reporter, and recommend users to 
use a dogstatsd-based implementation. There exist one 
([https://github.com/aroch/flink-metrics-dogstatsd,] not tested). On our side, 
we developed our own that we can share if requested.

 

 

  was:
We're running a relatively small flink cluster (7 task-managers, 8 cores) and 
are using datadog for telemetry.

The numbers for outgoing traffic, between kafka producers, tasks activities, 
and host system metrics didn't add-up. After investigation, we discovered that 
this traffic was generated by the DatadogHttpReporter.

We switched the reporter to an implementation using the java dogstatsd client 
(reporting to a datadog agent on each host).

Here are some numbers of outgoing traffic taken at a NAT gateway, between the 
cluster and the outside world. Before/after this change (all other things being 
equal):

!image-2023-06-01-17-56-50-809.png!

We're talking about 850MG in 5mn, so 10GB/h overhead here. That kind of traffic 
is not free on AWS...

Here is the change on `{{{}flink.taskmanager.Status.JVM.CPU.Load{}}}` (over the 
whole cluster)

!image-2023-06-01-17-54-45-900.png!

Reporting telemetry in json over http has a *HUGE* overhead.

So I would strongly advocate to deprecate this reporter, and recommend users to 
use a dogstatsd-based implementation. There exist one 
([https://github.com/aroch/flink-metrics-dogstatsd,] not tested). On our side, 
we developed our own that we can share if requested.

 

 


> Datadog HTTP Reporter produces a huge outgoing traffic and CPU overhead
> -----------------------------------------------------------------------
>
>                 Key: FLINK-32242
>                 URL: https://issues.apache.org/jira/browse/FLINK-32242
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Metrics
>    Affects Versions: 1.15.2
>         Environment: Flink 1.15.2, AWS EMR.
>            Reporter: Mathieu DESPRIEE
>            Priority: Minor
>         Attachments: image-2023-06-01-17-54-45-900.png, 
> image-2023-06-01-17-56-50-809.png
>
>
> We're running a relatively small flink cluster (7 task-managers, 8 cores) and 
> are using datadog for telemetry.
> The numbers for outgoing traffic, between kafka producers, tasks activities, 
> and host system metrics didn't add-up. After investigation, we discovered 
> that this traffic was generated by the DatadogHttpReporter.
> We switched the reporter to an implementation using the java dogstatsd client 
> (reporting to a datadog agent on each host).
> Here are some numbers of outgoing traffic taken at a NAT gateway, between the 
> cluster and the outside world. Before/after this change (all other things 
> being equal):
> !image-2023-06-01-17-56-50-809.png!
> We're talking about 850MB in 5mn, so 10GB/h overhead here. That kind of 
> traffic is not free on AWS...
> Here is the change on `{{{}flink.taskmanager.Status.JVM.CPU.Load{}}}` (over 
> the whole cluster)
> !image-2023-06-01-17-54-45-900.png!
> Reporting telemetry in json over http has a *HUGE* overhead.
> So I would strongly advocate to deprecate this reporter, and recommend users 
> to use a dogstatsd-based implementation. There exist one 
> ([https://github.com/aroch/flink-metrics-dogstatsd,] not tested). On our 
> side, we developed our own that we can share if requested.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to