Re: Help with systemd/cgroup task limits in koji

2023-02-23 Thread Than Ngo


Am 23.02.23 um 20:05 schrieb Kevin Fenzi:

On Thu, Feb 23, 2023 at 11:11:49AM +0100, Florian Weimer wrote:

* Giuseppe Scrivano:


Florian Weimer  writes:

It could be an old kernel bug:

   Task exit is signaled before task resource deallocation, leading to
   bogus EAGAIN errors
   

There have been recent namespace optimizations which introduce a similar
pattern there.  While they improve throughput in many cases, continuous
allocation and deallocation can now fail, even though the program logic
ensures that resources are never exceeded.

Guiseppe, any suggestions how to debug this?

the only optimization I am aware of that could cause a similar issue is
the delayed IPC namespace cleanup.  That would cause the IPC namespace
creation to fail though, not posix_spawn.

If you believe the failure can be related to reaching the pids limit for
the cgroup, could you please check the actual limit inside the
container?  You could check the value of /sys/fs/cgroup/pids.max inside
the container (assuming cgroupv2 and a cgroup namespace for the container).

Please let me know if that helps.

(replying for the benefit of the list)

Than: could you try some chromium builds with a cat of that value at
various points? (ie, prep, build, etc?)


Hi Kevin

i tried chromium build with a cat of that value. The value of 
/sys/fs/cgroup/pids.max is *max* at %prep, %setup and %build

The chromium build failed again with errors:

Error: spawn /usr/bin/node-19 EAGAIN
    at Process.ChildProcess._handle.onexit 
(node:internal/child_process:285:19)

    at onErrorNT (node:internal/child_process:483:16)
    at processTicksAndRejections (node:internal/process/task_queues:82:21)

[!] Error: unfinished hook action(s) on exit

https://kojipkgs.fedoraproject.org//work/tasks/2362/97912362/build.log

Than
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Help with systemd/cgroup task limits in koji

2023-02-23 Thread Kevin Fenzi
On Thu, Feb 23, 2023 at 11:11:49AM +0100, Florian Weimer wrote:
> * Giuseppe Scrivano:
> 
> > Florian Weimer  writes:
> >> It could be an old kernel bug:
> >>
> >>   Task exit is signaled before task resource deallocation, leading to
> >>   bogus EAGAIN errors
> >>   
> >>
> >> There have been recent namespace optimizations which introduce a similar
> >> pattern there.  While they improve throughput in many cases, continuous
> >> allocation and deallocation can now fail, even though the program logic
> >> ensures that resources are never exceeded.
> >>
> >> Guiseppe, any suggestions how to debug this?
> >
> > the only optimization I am aware of that could cause a similar issue is
> > the delayed IPC namespace cleanup.  That would cause the IPC namespace
> > creation to fail though, not posix_spawn.
> >
> > If you believe the failure can be related to reaching the pids limit for
> > the cgroup, could you please check the actual limit inside the
> > container?  You could check the value of /sys/fs/cgroup/pids.max inside
> > the container (assuming cgroupv2 and a cgroup namespace for the container).
> >
> > Please let me know if that helps.
> 
> (replying for the benefit of the list)

Than: could you try some chromium builds with a cat of that value at
various points? (ie, prep, build, etc?)

kevin
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Help with systemd/cgroup task limits in koji

2023-02-23 Thread Florian Weimer
* Giuseppe Scrivano:

> Florian Weimer  writes:
>> It could be an old kernel bug:
>>
>>   Task exit is signaled before task resource deallocation, leading to
>>   bogus EAGAIN errors
>>   
>>
>> There have been recent namespace optimizations which introduce a similar
>> pattern there.  While they improve throughput in many cases, continuous
>> allocation and deallocation can now fail, even though the program logic
>> ensures that resources are never exceeded.
>>
>> Guiseppe, any suggestions how to debug this?
>
> the only optimization I am aware of that could cause a similar issue is
> the delayed IPC namespace cleanup.  That would cause the IPC namespace
> creation to fail though, not posix_spawn.
>
> If you believe the failure can be related to reaching the pids limit for
> the cgroup, could you please check the actual limit inside the
> container?  You could check the value of /sys/fs/cgroup/pids.max inside
> the container (assuming cgroupv2 and a cgroup namespace for the container).
>
> Please let me know if that helps.

(replying for the benefit of the list)
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Help with systemd/cgroup task limits in koji

2023-02-22 Thread Than Ngo


Am 21.02.23 um 07:41 schrieb Florian Weimer:

* Kevin Fenzi:


Greetings.

We are running into some anoying limits on koji builds of chromium.

First, since a long time ago, the koji.service file we are using has:

TasksMax=infinity

But yet, chromium was failing, seemingly hitting a task limit.
"ninja: fatal: posix_spawn: Resource temporarily unavailable"
in the build and:
"kernel: cgroup: fork rejected by pids controller in
/machine.slice/machine-7d12b2e6dcfb4230b04d2c2c0b499171.scope/payload"
on the builder.

Investigation and some help from folks in the #devel room
(many thanks glb!)
Showed that the systemd-nspawn container mock started has:

systemctl show systemd-nspawn@0b3f01a2a8e345a389b30c477812c471
TasksMax=16384

So, I put in place a:
/etc/systemd/system/systemd-nspawn@.service.d/override.conf
with:

[Service]
TasksMax=infinity

and that seemed to be used for the mock systemd-nspawn containers.

However, the builds with lots of cpus is now failing later with:

Error: spawn /usr/bin/node-18 EAGAIN
     at Process.ChildProcess._handle.onexit
(node:internal/child_process:283:19)
     at onErrorNT (node:internal/child_process:476:16)
     at processTicksAndRejections (node:internal/process/task_queues:82:21)
[!] Error: unfinished hook action(s) on exit:

Is there yet another layer here that has another limit?

Is there anything here I can set that says "infinity all the way down' ?

Assistance welcome. I can file a systemd bug, but I am not sure
this is a bug more than a lack of documentation.

It could be an old kernel bug:

   Task exit is signaled before task resource deallocation, leading to
   bogus EAGAIN errors
   

There have been recent namespace optimizations which introduce a similar
pattern there.  While they improve throughput in many cases, continuous
allocation and deallocation can now fail, even though the program logic
ensures that resources are never exceeded.


i am not sure if it's an old kernel bug, because 
kernel-6.1.7-200.fc37.aarch64 is running on koji builds of chromium.


Than
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Help with systemd/cgroup task limits in koji

2023-02-20 Thread Florian Weimer
* Kevin Fenzi:

> Greetings.
>
> We are running into some anoying limits on koji builds of chromium.
>
> First, since a long time ago, the koji.service file we are using has:
>
> TasksMax=infinity
>
> But yet, chromium was failing, seemingly hitting a task limit.
> "ninja: fatal: posix_spawn: Resource temporarily unavailable"
> in the build and:
> "kernel: cgroup: fork rejected by pids controller in
> /machine.slice/machine-7d12b2e6dcfb4230b04d2c2c0b499171.scope/payload"
> on the builder.
>
> Investigation and some help from folks in the #devel room
> (many thanks glb!)
> Showed that the systemd-nspawn container mock started has:
>
> systemctl show systemd-nspawn@0b3f01a2a8e345a389b30c477812c471
> TasksMax=16384
>
> So, I put in place a:
> /etc/systemd/system/systemd-nspawn@.service.d/override.conf
> with:
>
> [Service]
> TasksMax=infinity
>
> and that seemed to be used for the mock systemd-nspawn containers.
>
> However, the builds with lots of cpus is now failing later with:
>
> Error: spawn /usr/bin/node-18 EAGAIN
>     at Process.ChildProcess._handle.onexit
> (node:internal/child_process:283:19)
>     at onErrorNT (node:internal/child_process:476:16)
>     at processTicksAndRejections (node:internal/process/task_queues:82:21)
> [!] Error: unfinished hook action(s) on exit:
>
> Is there yet another layer here that has another limit?
>
> Is there anything here I can set that says "infinity all the way down' ?
>
> Assistance welcome. I can file a systemd bug, but I am not sure
> this is a bug more than a lack of documentation.

It could be an old kernel bug:

  Task exit is signaled before task resource deallocation, leading to
  bogus EAGAIN errors
  

There have been recent namespace optimizations which introduce a similar
pattern there.  While they improve throughput in many cases, continuous
allocation and deallocation can now fail, even though the program logic
ensures that resources are never exceeded.

Guiseppe, any suggestions how to debug this?

Thanks,
Florian
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue