li-afaris opened a new pull request, #1699:
URL: https://github.com/apache/samza/pull/1699

   # Introduction
   
   Hadoop clusters have the ability to restrict CPU usage for Samza 
applications by utilizing Control Groups, (Cgroups).
   Before enabling CPU enforcement on Hadoop clusters, application owners must 
have a way of knowing when their application is being throttled by Cgroups.  
This PR will add a new Cgroup metric that makes application owners aware if 
containers CPU usage is being throttled by control groups & whether the 
application needs to request additional resources.
   
   # Implementation
   
   The Linux kernel reports when applications within a Cgroup has been 
throttled by writing values to a file named cpu.stat.   cpu.stat contains two 
fields named nr_periods & nr_throttled.  nr_periods represents the number of 
enforcement periods that elapsed.  nr_thorttled represents the number of times 
the group has been throttled.  We can treat these fields as a ratio that shows 
the number of times applications has been throttled over a number of 
enforcement periods. The proposal is to have the running container locate the 
cpu.stat file by reading property values from Hadoop's YARN config. 
   
   ## Implementation details
   
    * To limit high cardinality in the metrics storage layer, instead of using 
the Hadoop YARN container id, the metric will emit the Samza container ID as 
the hostname, (ie: Container 3).  This is already supported by the existing 
metrics framework within Samza.
   * The container will emit a float value between zero and 1 as a gauge 
metric.  A zero value means the Cgroup was not throttled for that period of 
time.  A value of 1 means the Cgroup was unable to complete any work as it was 
persistently throttled.
   * To stay consistent with existing metrics, a negative value will be emitted 
if an exception is thrown when reading the cpu.stat file.  Exceptions when 
reading cpu.stat will be logged to the container logs.
   * This implementation will be specific to Samza on Hadoop.  No metric will 
be emitted from applications using Samza as an embedded library.  The reasoning 
is the application itself should emit this metric, not the embedded library.  
   
   ## Considered Alternatives
   
   I’m unaware of alternatives but reading values from cpu.stat is a pattern 
which appears in the Runc project.  Runc is the underlying library for 
ContainerD which is used by both Docker & Kubernetes.   
   
   The metric needs to be emitted from the Samza container itself.  Using a 
system daemon or sidecar application complicates deployments & creates data 
consistency issues when the sidecar process isn’t running.
   
   
   # External references
   
   * Linux [kernel 
documentation](https://github.com/torvalds/linux/blob/2bfcfd584ff5ccc8bb7acde19b42570414bf880b/Documentation/scheduler/sched-bwc.rst?plain=1#L131-L132)
 on the cpu.stat file
   * cpu.stat references from the Open Container Initiative [runc 
project](https://github.com/search?q=repo%3Aopencontainers%2Frunc%20cpu.stat&type=code).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to