Mathieu DESPRIEE created FLINK-32242:
----------------------------------------
Summary: Datadog HTTP Reporter produces a huge outgoing traffic
and CPU overhead
Key: FLINK-32242
URL: https://issues.apache.org/jira/browse/FLINK-32242
Project: Flink
Issue Type: Bug
Components: Runtime / Metrics
Affects Versions: 1.15.2
Environment: Flink 1.15.2, AWS EMR.
Reporter: Mathieu DESPRIEE
Attachments: image-2023-06-01-17-42-56-305.png,
image-2023-06-01-17-54-45-900.png, image-2023-06-01-17-56-50-809.png
We're running a relatively small flink cluster (7 task-managers, 8 cores) and
are using datadog for telemetry.
The numbers for outgoing traffic, between kafka producers, tasks activities,
and host system metrics didn't add-up. After investigation, we discovered that
this traffic was generated by the DatadogHttpReporter.
We switched the reporter to an implementation using the java dogstatsd client
(reporting to a datadog agent on each host).
Here are some numbers of outgoing traffic taken at a NAT gateway, between the
cluster and the outside world. Before/after this change (all other things being
equal):
!image-2023-06-01-17-56-50-809.png!
We're talking about 850MG in 5mn, so 10GB/h overhead here. That kind of traffic
is not free on AWS...
Here the change on `flink.taskmanager.Status.JVM.CPU.Load` (over the whole
cluster)
!image-2023-06-01-17-54-45-900.png!
Reporting telemetry in json over http has a *HUGE* overhead.
So I would strongly advocate to deprecate this reporter, and recommend users to
use a dogstatsd-based implementation. There exist one
([https://github.com/aroch/flink-metrics-dogstatsd,] not tested). On our side,
we developed our own that we can share if requested.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)