[PR] A new Control Group Metric for Samza [samza]

via GitHub Mon, 03 Jun 2024 12:30:37 -0700


li-afaris opened a new pull request, #1699:
URL: https://github.com/apache/samza/pull/1699

# Introduction

Hadoop clusters have the ability to restrict CPU usage for Samza
applications by utilizing Control Groups, (Cgroups).
Before enabling CPU enforcement on Hadoop clusters, application owners must
have a way of knowing when their application is being throttled by Cgroups.
This PR will add a new Cgroup metric that makes application owners aware if
containers CPU usage is being throttled by control groups & whether the
application needs to request additional resources.

# Implementation

The Linux kernel reports when applications within a Cgroup has been
throttled by writing values to a file named cpu.stat. cpu.stat contains two
fields named nr_periods & nr_throttled. nr_periods represents the number of
enforcement periods that elapsed. nr_thorttled represents the number of times
the group has been throttled. We can treat these fields as a ratio that shows
the number of times applications has been throttled over a number of
enforcement periods. The proposal is to have the running container locate the
cpu.stat file by reading property values from Hadoop's YARN config.

## Implementation details

* To limit high cardinality in the metrics storage layer, instead of using
the Hadoop YARN container id, the metric will emit the Samza container ID as
the hostname, (ie: Container 3). This is already supported by the existing
metrics framework within Samza.
* The container will emit a float value between zero and 1 as a gauge
metric. A zero value means the Cgroup was not throttled for that period of
time. A value of 1 means the Cgroup was unable to complete any work as it was
persistently throttled.
* To stay consistent with existing metrics, a negative value will be emitted
if an exception is thrown when reading the cpu.stat file. Exceptions when
reading cpu.stat will be logged to the container logs.
* This implementation will be specific to Samza on Hadoop. No metric will
be emitted from applications using Samza as an embedded library. The reasoning
is the application itself should emit this metric, not the embedded library.

## Considered Alternatives

I’m unaware of alternatives but reading values from cpu.stat is a pattern
which appears in the Runc project. Runc is the underlying library for
ContainerD which is used by both Docker & Kubernetes.

The metric needs to be emitted from the Samza container itself. Using a
system daemon or sidecar application complicates deployments & creates data
consistency issues when the sidecar process isn’t running.

# External references

* Linux [kernel
documentation](https://github.com/torvalds/linux/blob/2bfcfd584ff5ccc8bb7acde19b42570414bf880b/Documentation/scheduler/sched-bwc.rst?plain=1#L131-L132)
on the cpu.stat file
* cpu.stat references from the Open Container Initiative [runc
project](https://github.com/search?q=repo%3Aopencontainers%2Frunc%20cpu.stat&type=code).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] A new Control Group Metric for Samza [samza]

Reply via email to