[
https://issues.apache.org/jira/browse/FLINK-32242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mathieu DESPRIEE updated FLINK-32242:
-------------------------------------
Description:
We're running a relatively small flink cluster (7 task-managers * 8 cores) and
are using datadog for telemetry.
The numbers for outgoing traffic, between kafka producers, tasks activities,
and host system metrics didn't add-up. After investigation, we discovered that
this traffic was generated by the DatadogHttpReporter.
We switched the reporter to an implementation using the java dogstatsd client
(reporting to a datadog agent on each host).
Here are some numbers of outgoing traffic taken at a NAT gateway, between the
cluster and the outside world. Before/after this change (all other things being
equal):
!image-2023-06-01-17-56-50-809.png!
We're talking about 850MB in 5mn, so 10GB/h overhead here. That kind of traffic
is not free on AWS...
Here is the change on `{{{}flink.taskmanager.Status.JVM.CPU.Load{}}}` (over the
whole cluster)
!image-2023-06-01-17-54-45-900.png!
Reporting telemetry in json over http has a *HUGE* overhead.
So I would strongly advocate to deprecate this reporter, and recommend users to
use a dogstatsd-based implementation. There exist one
([https://github.com/aroch/flink-metrics-dogstatsd,] not tested). On our side,
we developed our own that we can share if requested.
was:
We're running a relatively small flink cluster (7 task-managers, 8 cores) and
are using datadog for telemetry.
The numbers for outgoing traffic, between kafka producers, tasks activities,
and host system metrics didn't add-up. After investigation, we discovered that
this traffic was generated by the DatadogHttpReporter.
We switched the reporter to an implementation using the java dogstatsd client
(reporting to a datadog agent on each host).
Here are some numbers of outgoing traffic taken at a NAT gateway, between the
cluster and the outside world. Before/after this change (all other things being
equal):
!image-2023-06-01-17-56-50-809.png!
We're talking about 850MB in 5mn, so 10GB/h overhead here. That kind of traffic
is not free on AWS...
Here is the change on `{{{}flink.taskmanager.Status.JVM.CPU.Load{}}}` (over the
whole cluster)
!image-2023-06-01-17-54-45-900.png!
Reporting telemetry in json over http has a *HUGE* overhead.
So I would strongly advocate to deprecate this reporter, and recommend users to
use a dogstatsd-based implementation. There exist one
([https://github.com/aroch/flink-metrics-dogstatsd,] not tested). On our side,
we developed our own that we can share if requested.
> Datadog HTTP Reporter produces a huge outgoing traffic and CPU overhead
> -----------------------------------------------------------------------
>
> Key: FLINK-32242
> URL: https://issues.apache.org/jira/browse/FLINK-32242
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
> Affects Versions: 1.15.2
> Environment: Flink 1.15.2, AWS EMR.
> Reporter: Mathieu DESPRIEE
> Priority: Minor
> Attachments: image-2023-06-01-17-54-45-900.png,
> image-2023-06-01-17-56-50-809.png
>
>
> We're running a relatively small flink cluster (7 task-managers * 8 cores)
> and are using datadog for telemetry.
> The numbers for outgoing traffic, between kafka producers, tasks activities,
> and host system metrics didn't add-up. After investigation, we discovered
> that this traffic was generated by the DatadogHttpReporter.
> We switched the reporter to an implementation using the java dogstatsd client
> (reporting to a datadog agent on each host).
> Here are some numbers of outgoing traffic taken at a NAT gateway, between the
> cluster and the outside world. Before/after this change (all other things
> being equal):
> !image-2023-06-01-17-56-50-809.png!
> We're talking about 850MB in 5mn, so 10GB/h overhead here. That kind of
> traffic is not free on AWS...
> Here is the change on `{{{}flink.taskmanager.Status.JVM.CPU.Load{}}}` (over
> the whole cluster)
> !image-2023-06-01-17-54-45-900.png!
> Reporting telemetry in json over http has a *HUGE* overhead.
> So I would strongly advocate to deprecate this reporter, and recommend users
> to use a dogstatsd-based implementation. There exist one
> ([https://github.com/aroch/flink-metrics-dogstatsd,] not tested). On our
> side, we developed our own that we can share if requested.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)