Re: RFR: 8203359: Container level resources events

Jaroslav Bachorik Thu, 01 Apr 2021 09:01:18 -0700

On Fri, 26 Mar 2021 11:05:56 GMT, Severin Gehwolf <[email protected]> wrote:


>> Does each getter call result in parsing /proc, or do things aggregated over 
>> several calls or hooks?
>> 
>> Do you have any data how expensive the invocations are? 
>> 
>> You could for example try to measure it by temporary making the events 
>> durational, and fetch the values between begin() and end(), and perhaps show 
>> a 'jfr print --events Container* recording.jfr' printout. 
>> 
>> If possible, it would be interesting to get some idea about the startup cost 
>> as well
>> 
>> If not too much overhead, I think it would be nice to skip the "flag" in the 
>> .jfcs, and always record the events in a container environment.
>> 
>> I know there is a way to test JFR using Docker, maybe @mseledts could 
>> provide information? Some sanity tests would be good to have.
>
>> Does each getter call result in parsing /proc, or do things aggregated over 
>> several calls or hooks?
> 
> From the looks of it the event emitting code uses `Metrics.java` interface 
> for retrieving the info. Each call to a method exposed by Metrics result in 
> file IO on some cgroup (v1 or v2) interface file(s) in `/sys/fs/...`. I don't 
> see any aggregation being done.
> 
> On the hotspot side, we implemented some caching for frequent calls 
> (JDK-8232207, JDK-8227006), but we didn't do that yet for the Java side since 
> there wasn't any need (so far). If calls are becoming frequent with this it 
> should be reconsidered.
> 
> So +1 on getting some data on what the perf penalty of this is.

Thanks to all for chiming in!

I have added the tests to 
`test/hotspot/jtreg/containers/docker/TestJFREvents.java` where there already 
were some templates for the container event data.

As for the performance - as expected, extracting the data from `/proc` is not 
exactly cheap. On my test c5.4xlarge instance I am getting an average 
wall-clock time to generate the usage/throttling events (one instance of each) 
of ~15ms.
I would argue that 15ms per 30s (the default emission period for those events) 
might be acceptable to start with. 

Caching of cgroups parsed data would help if the emission period is shorter 
than the cache TTL. This is exacerbated by the fact that (almost) each 
container event type requires data from a different cgroups control file - 
hence the data will not be shared between the event type instances even if 
cached. Realistically, caching benefits would become visible only for 
sub-second emission periods.

If the caching is still required I would suggest having a follow up ticket just 
for that - it will require setting up some benchmarks to justify the changes 
that would need to be done in the metrics implementation.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3126

Re: RFR: 8203359: Container level resources events

Reply via email to