On Wed, 31 Jan 2024 15:33:01 GMT, Jan Kratochvil <jkratoch...@openjdk.org> 
wrote:

>> It looks like this patch needs a bit more discussion. Generally, it would be 
>> good to have the functionality, but I'm unsure about the proposed 
>> implementation.
>> 
>> Walking the directory hierarchy (**and** interface files) on every call of 
>> `physical_memory()` (or `active_processor_count()` for CPU) seems 
>> concerning. Since it seems unlikely for cgroup interface file paths changing 
>> during runtime, walking the hierarchy for every look-up seems overkill. Note 
>> that the prime users of this feature are in-container use, where most 
>> runtimes have the limit settings at the leaf. A few more comments below:
>> 
>> - After this patch, AFAIUI for a cgroup path like `/foo/bar/baz/memory.max` 
>> the following files are being looked at for the memory limit (for example 
>> for the interface file `memory.max`):
>>   ```
>>    /foo/bar/baz/memory.max
>>    /foo/bar/memory.max
>>    /foo/memory.max
>>   ```
>>   This used to be just one file at the leaf, `/foo/bar/baz/memory.max` 
>> (prior this patch). So this change has an effect on file IO. `N` files per 
>> metric to look at where it used to be one file (i.e. constant). Note that 
>> switches like `UseDynamicNumberOfCompilerThreads` make this particularly 
>> worrying.  I wonder if it would be sufficient to "lock" into the cgroup path 
>> where the lowest  limits are set in the hierarchy and only use the interface 
>> files at that path of a specific controller. This would mean adjusting 
>> `CgroupsV2Subsystem` similar to `CgroupsV1Subsystem` to have unique 
>> controller paths (i.e. `cpu` and `memory` controller paths could potentially 
>> be different). But the code would already be mostly there for this. The idea 
>> would be to do the initial walk of the tree at JVM startup, and from then 
>> on, only look at the path with a limit set (if any) for subsequent calls. 
>> That is `controller->subsystem_path()` would return the correct path at 
>> runtime when initialization is done
 . Thoughts?
>> - There seems to be an inconsistency between cgroups v1 (no recursive 
>> look-up) and cgroups v2  (recursive look-up). Why? I think we should employ 
>> the same semantics to both  versions (and drop the hierarchical work-around 
>> - [JDK-8217338](https://bugs.openjdk.java.net/browse/JDK-8217338) - for cg 
>> v1).
>> - There also seems to be an inconsistency between metrics being iterated. 
>> `memory.max` and `memory.swap.max`  and `memory.swap.current` are being 
>> iterated, others, like CPU quotas (`cpu.max` or
>>   `cpu.weight`) not. Why? Both limits could potentially be one path up the 
>> hierarchy, meaning that cp...
>
>> Walking the directory hierarchy (**and** interface files) on every call of 
>> `physical_memory()` (or `active_processor_count()` for CPU) seems concerning.
> 
> I have therefore implemented a hierarchical `memory.stat` extension for 
> cgroup2 which already exists for cgroup1 - if the patch is agreed upon I will 
> submit it to Linux kernel: 
> [cgroup2-hierarchical_memory_limit.patch.txt](https://github.com/openjdk/jdk/files/14113452/cgroup2-hierarchical_memory_limit.patch.txt)
> 
>> Since it seems unlikely for cgroup interface file paths changing during 
>> runtime, walking the hierarchy for every look-up seems overkill. Note that 
>> the prime users of this feature are in-container use, where most runtimes 
>> have the limit settings at the leaf.
> 
> While I understand the most common change of the limits is at the leaf 
> technically any parent group limit can change. Which is also what this patch 
> is implementing. For the performance issue I have implemented the Linux 
> kernel extension above.
> 
>> `N` files per metric to look at where it used to be one file (i.e. constant).
> 
> The problem is any of the files can change. But to fix the performance I have 
> implemented the Linux kernel patch above.
> 
>> I wonder if it would be sufficient to "lock" into the cgroup path where the 
>> lowest  limits are set in the hierarchy and only use the interface files at 
>> that path of a specific controller.
> 
> That mostly disables the functionality of this patch.
> 
> 
>>     * There seems to be an inconsistency between cgroups v1 (no recursive 
>> look-up) and cgroups v2  (recursive look-up). Why? I think we should employ 
>> the same semantics to both  versions (and drop the hierarchical work-around 
>> - [JDK-8217338](https://bugs.openjdk.java.net/browse/JDK-8217338) - for cg 
>> v1).
> 
> I did not much expect anyone still uses cgroup1. Plus customer was also 
> interested in cgroup2. AFAIK a last OS using cgroup1 was RHEL-7 which will 
> have EOL in a few months. But then I tried to implement cgroup1 but there it 
> sort of already worked thanks for your 
> [JDK-8217338](https://bugs.openjdk.java.net/browse/JDK-8217338). I just fixed 
> there to prefer the hierarchical limit over the leaf limit (and the testcase 
> does test it now) as that is what is used by Linux kernel.
> 
> And then my Linux kernel patch implements the same `memory.stat` way even for 
> cgroup2.
> 
>>     * There also seems to be an inconsistency between metrics being 
>> iterated. `memory.max` and `memory.swap.max`  and `memory.swap.current` are 
>> being iterated, others, like CPU quotas (`cpu.max` or...

> @jankratochvil The "use hierarchical limit" was a work-around that bites us 
> now.

What is the current problem? I see one problem of the cgroup1 hierarchical 
implementation in OpenJDK which is fixed by this patch incl. a testcase.


> So instead we should attempt to unify cgv1 and cgv2 by walking the hierarchy

When you are concerned about the performance I have rather implemented the 
hierarchical interface also for Linux kernel.
  https://lore.kernel.org/linux-mm/zcvlhoz4vbex9...@host1.jankratochvil.net/T/
It has been welcomed by SuSE:
  
https://lore.kernel.org/linux-mm/8f8d5a2d-dde3-42e5-9988-fab042666...@redhat.com/T/#m1238300cf818badebed0183f1b5ab798bf1f2e9f


> and find the place in the hierarchy where the interface files - with actual 
> limits imposed - are to be found.

That is being done if the patch above is not found on the running kernel.


> Since this patch attempts to solve a similar goal to what the old work-around 
> solved for cg v1, we should re-think how to solve it properly. My suggestion 
> would be to go with walking the hierarchy for both: cg v1 and cg v2.

I hope I have fixed the same problem in a more performance wise way.


> As to the walking the hierarchy on every invocation problem, I don't think 
> fixing it by relying on a kernel patch is the way to go.

Why not to support both ways. Time flies like wind.


> It'll take time for the ecosystem to pick those up and existing cg v2 systems 
> won't be fixed (consider RHEL 8 or RHEL 9 systems and their derivatives).

Particuarly for RHEL it depends whether any of the paying customers is 
interested in the performance wise solution (I doubt so as the performance 
difference is small). Backport for a hotfix delivery for the specific customer 
will be done in a few days. Included in the next RHEL y-release it will be in a 
few months (and also released then in CentOS).


> Therefore, the idea would be to do the iteration **once** when the 
> `OSContainer` code initializes.

Done that in this patch. I think it could also occasionally re-read the 
hierarchical structure but that hasn't been implemented.

I expect I would remove the "memory.max.effective" dependency from this patch 
for the OpenJDK commit as the kernel patch has not been approved yet and so the 
kernel interface ("memory.max.effective") still may change.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17198#issuecomment-1954420583

Reply via email to