On 2019-08-05 17:34, Ian Jackson wrote:
With current code the options are:
A. Things run in series but with concatenated output and no individual
status.
B. Things run in parallel, giving load spikes and possible concurrency
bugs; vs.
I can see few people who would choose (B).
People who don't care much about paying attention to broken cron
stuff, or people who wouldn't know how to fix it, are better served by
(A). It provides a better experience.
Knowledgeable people will not have too much trouble interpreting
combined output, and maybe have external monitoring arrangements
anyway. Conversely, heisenbugs and load spikes are still undesirable.
So they should also choose (A).
IOW reliability and proper operation is more important than separated
logging and status reporting.
If we are in agreement that concurrency must happen with proper locking
and not depend on accidental lineralization then identifying those
concurrency bugs is actually a worthwhile goal in order to achieve
reliability, is it not? I thought you would be the first to acknowledge
that bugs are worth fixing rather than sweeping them under the rug. We
already identified that parallelism between the various stages is
undesirable. With a systemd timer you can declare conflicts as well as a
lineralization if so needed.
I also question the "knowledgeable people will not have too much
trouble". Export state as granular as possible and there is no guesswork
required. I have no doubt that my co-workers can do this. But I want
their life to be as easy as possible.
Similarly I wonder what the external monitoring should be apart from
injecting fake jobs around every run-parts unit in this case. Replacing
run-parts with something monitoring-aware? Then why not take the tool
that already exists (systemd)?
And finally, the load spikes: Upthread it was mentioned that
RandomizedDelaySec exists. Generally this should be sufficient to even
out such effects. I understand that there is a case where you run a lot
of unrelated VMs that you cannot control. In other cases, like laptops
and desktops, it is very likely much more efficient to generate the load
spike and complete the task as fast as possible in order to return to
the low-power state of (effectively) waiting for input. I suspect that
there is a conflict between the two that could be dealt with by
encouraging liberal use of DefaultTimerAccuracySec on the system-level.
I understand that Debian inherently does not distinguish between the two
cases. I'd still expect a Cloud/Compute provider to offer default images
in any case that could be preconfigured appropriately.
I apologize that I think of this in terms of systemd primitives. But the
tool was written for a reason and a lot of thought went into it.
Kind regards
Philipp Kern