Marek Kozlowski wrote:
> Have you ever read you own code (or quick fixes) written >5 years
> ago if you'd forgotten to place comments? ;-)

I often say, "I miss my younger brain."  Back then I could remember
all of the details.  These days I write notes to my future self.  My
future self who will read them and not remember having written them.

These days if I find something like what you have found and find it
completely undocumented then I know that I must remove it as part of a
cleanup back to the mainstream.  If a problem were to appear again
then now it is in my recent cache memory.  I will then be able to deal
with it.  And if that requires modifying the configuration then I will
leave a comment with it about why it is there and other details.

> No, such configurable limits are great. My question was different. I suppose
> that many many years ago, many versions ago I had some problem with this
> server and I tried to solve it or apply a quick fix by incrementing the
> limit. Unfortunately I don't remember the problem. I don't even know if it
> could reappear if I set it to the default. Can anyone guess the potential
> problem given the solution? ;-))

I think if you were increasing the fork_attempts value that as Wietse
noted your system was overloaded.  And the root cause of the problem
was elsewhere in the system.  But to help understand your question
specifically it is needed to understand fork() and why it might fail.

NetBSD is closest to the original and documents it the simplest.
Every kernel is different.  But similar.

    https://man.netbsd.org/fork.2

    ERRORS

       fork() will fail and no child process will be created if:

       [EAGAIN]   The system-imposed limit on the total number of processes
                  under execution would be exceeded.  This limit is configura-
                  tion-dependent; or the limit RLIMIT_NPROC on the total number
                  of processes under execution by this user id would be
                  exceeded.

       [ENOMEM]   There is insufficient swap space for the new process.

If the kernel is out of process slots then fork() fails.  The shell
usually says "cannot fork" in that case.  Too many processes are
running on the system at the same time.  Fork-bombs can create this
situation.  [[ When out of process slots is sometimes possible to run
exactly one command by replacing the current shell with it using
"exec" to overlay the new process on the same PID as the shell that
invokes it. ]]

If increasing the number of fork attempts improved things then
unlikely a fork-bomb but probably something else running some
thousands of processes "for a while" that drained out "after a bit".
And so a retry of fork() then happens to work later when there is a
process slot available for it.

If this is a web server, if it is running CGI or FPM or Apache or many
other things that react to Internet activity by starting processes, if
this is not limited with a reasonable maximum number of processes,
then it is possible for Internet activity to force the system into
process stress by causing it to start godzillians of processes.  If
that process creation is not limited.  For example.  There are many
other possibilities too.  And high max defaults of 256 might be
completely unsuitable if the actual maximum the system can handle of
something before running out of memory is 12.  Don't guess.  Do the
math and figure it out.  Then test it to verify.

If there is insufficient memory then the fork() fails.  This might
happen if all of the available system memory used for processes has
been consumed.  Again if increasing the fork retry limit helped then
it means that something transiently consumed all memory "for a while"
and then that drained out "after a bit" and subsequently allowed a
later retry to succeed.

Note that different kernels handle these cases very differently.
Linux notably uses memory overcommit.  Which means fork() won't fail
due to out of memory.  If the system is out of memory the fork()
succeeds but then later when memory is written too and the system does
not have memory available the OOM Out-of-Memory Killer is invoked to
kill processes until enough memory becomes available again.  It's a
completely different memory model.  And harder to be as robust on
servers as the traditional model.  Because the OOM might kill
something necessary.  And then it would need to be restarted.  Better
to avoid it getting killed at all.  This is a big discussion topic
just by itself.

Personally I would never see increasing fork_attempts as a suitable
solution for either of the main reasons fork() might fail.  Instead I
would want to understand what is causing the resource stress.  If it
is too many processes from something XYZ then limit those in XYZ to
below the process slot limit of the system.  If the problem is virtual
memory then I would increase the amount of virtual memory available on
the system.  Or again limit the number of XYZ processes that are
consuming memory.  Or if it is a single process then limit the amount
of memory that single process can consume.

If either of these are problems on your system then it is better to
get an understanding of the system resources.  Then make sure there
are enough resources available.  Memory.  CPU.  Storage.  Whatever.

For a simple single system I like using Munin to monitor the trending
state of the system.  Munin is one of many system monitoring systems
which allow looking at what happened on a system after the fact.  And
also what is happening now on the system.  There are many popular
monitoring tools.  Try several.  Find one you like.

If you are monitoring your system and nothing bad is happening then
leave the defaults in place.  Be happy.  If your system is
periodically spiking into problems then address those problems to
prevent them rather than working around them by increasing the number
of retries of other things.

Note that while Postfix has retries on fork() failures almost nothing
else on the system does that.  Which means that if it is in a state
where fork() is failing then many other random things on the system
will also be failing as they will be unprotected.

Bob

Reply via email to