Re: give cron a sensible default max load_avg for batch jobs
On 2015-11-14 Sat 05:57 AM |, Todd C. Miller wrote: > The quesion no one seems to be asking here is "who actually runs > batch". Anyone? > I do, on small servers with an average uptime(1) load of ~0.2
Re: give cron a sensible default max load_avg for batch jobs
On 2015-11-14 09:54, li...@wrant.com wrote: >>> I think 8 is way to high. Isn't the point of batch to run things >>> when the machine is mostly idle? >> The problem is (and we've had this discussion several times before at >> least in misc@), that the system load doesn't really tell us that. > What's the proper way to calculate amount of work a system can do, for > (then) figuring CPU idle time threshold? > Does this not also include the work load (type) being done and imply > capability to manage the work load distribution? The problem is that there is no proper way, at least not *one* proper way, to do that. It all depends on your particular situation. The problem is that cron has no way to know whether the job it is just about to fire away is going to take a hundred milliseconds or a hundred hours to run, or what kind of resources it will consume. The reason to use loadavg as an indicator for system activity is that while it measures neither high cpu activity nor high i/o-activity directly, it is actually a pretty good hint as to whether the system is "busy", for a very fuzzy definition of busy. The problem is that the value isn't absolute, it is relative to the configuration and load profile of each system. If my SP system shows a load of 14, I can with some certainty say that it's quite busy. If my 12-core dual Opteron server shows 14, it's hardly even breathing heavily, *even* if it's got mostly cpu-bound activity. If it shows 100+, then as a sysadmin I'd start looking for explanations. Remember also that running a job via batch generally is a very kind way to start heavy tasks, because cron runs the job niced. So in case of much cpu-bound activity, it may never even get in more than a few time slices here and there to run, while if there's much i/o going on it may run next to unnoticed even if it's got lots of cpu-bound stuff to do. So the problem isn't even that big, since the system's own scheduler is pretty good at handling various system loads when they actually have begun life as processes. The one time when it is especially unsuitable to run an extra batch job is if we are memory starved, and are swapping, or are close to having to swap. And as it happens, load_avg is conveniently going to start skyrocketing as we are starting to swap. So, if we break it down, load_avg is really not such a bad metric to use in this particular case. It is "just" that the default limit is set way too low for today's standards. With that said, I'm looking at other ways to determine system workload. Maybe there's a set of metrics that give us a more accurate snapshot of the system's current state, that can be averaged over time like load_avg so as to avoid temporary spikes that may give a faulty impression of the system's activity. But that calculation must also be as unaffected as possible by the system's "dimensions". A system with fast i/o can of course handle more of it before becoming saturated. Likewise with cpu speed and number of cores, and system memory. So the best aspects in my mind to start looking at is whether the system does a lot of *waiting* to get its jobs done. Either waiting for disk or network i/o, processes waiting to run or a lock contention to clear, things like that. So, if nobody is waiting for anybody, then by all means go ahead and run one more job! It's not going to harm anything, as long as it doesn't consume all of the idling resources for itself. The problem can be as complex as we want to make it, or as simple, if it isn't a problem in practice. >> My particular problem, and the reason I suggested this patch in the >> first place, is that I often see loads of 20-30-50 or even way more, >> without there even being a problem. The machine is very responsive, and >> everything works great - there are just a lot of processes running or >> waiting for an opportunity to run. > > That's not the general case on 'single/dual' (or less than "your chosen > higher than 4 number" of) CPU systems, and when running fewer processes > that are more CPU intensive. In these cases it may also be easier to > know what's happening on the system. Selecting the offloaded period > (automatically?) where you don't have direct control requires more > understanding than average load numbers (suggestion only). Or a > different approach at task running (e.g. service oriented nodes > assisting general worker ones). > > Better use statistical approach per machine (counters) while factoring > processing capability and duty saturation cycle (human assessment). Or > simply users circadian cycle and not care much as machines just work > while people rest, with potential and overlap between multiple machines > for same role/task. Good points. But taking the human circadian cycle into account, that is, working when the human is not and vice versa, can easily be accommodated already, by using "at". >> Since the system load essentially is a decaying average of the number of >> runnable or run
Re: give cron a sensible default max load_avg for batch jobs
On 2015-11-14 13:57, Todd C. Miller wrote: > The quesion no one seems to be asking here is "who actually runs > batch". Anyone? I gave kind of an answer to that in my original posting. :-) At least I run batch and at, and I do it *all the time*. There is imho no more convenient way of firing off a background job than using batch, it's a hidden gem in the unix toolbox. And using at makes it super easy to schedule tasks at times when it is more convenient to run them than "now". And if you have output you get it in a mail when it's done. Very spiffy! Regards, /Benny
Re: give cron a sensible default max load_avg for batch jobs
The quesion no one seems to be asking here is "who actually runs batch". Anyone? - todd
Re: give cron a sensible default max load_avg for batch jobs
> >>> This patch changes the default setting to 1.5 * > >>> (number_of_cpus_in_system) instead, which I find better matches modern > >>> behaviour. > >> > >> A larger number is sensible in this position. > >> > >> I would propose 8. I don't agree with a calculation like that; the > >> amount of work a system can do should not be calculated like that. > > > > I think 8 is way to high. Isn't the point of batch to run things > > when the machine is mostly idle? > > The problem is (and we've had this discussion several times before at > least in misc@), that the system load doesn't really tell us that. What's the proper way to calculate amount of work a system can do, for (then) figuring CPU idle time threshold? Does this not also include the work load (type) being done and imply capability to manage the work load distribution? > It *may* be the case that the system is under lots of work, but it may > also be the case that there are many processes just blocking waiting for > some resource and that the system is essentially idling. > > My particular problem, and the reason I suggested this patch in the > first place, is that I often see loads of 20-30-50 or even way more, > without there even being a problem. The machine is very responsive, and > everything works great - there are just a lot of processes running or > waiting for an opportunity to run. That's not the general case on 'single/dual' (or less than "your chosen higher than 4 number" of) CPU systems, and when running fewer processes that are more CPU intensive. In these cases it may also be easier to know what's happening on the system. Selecting the offloaded period (automatically?) where you don't have direct control requires more understanding than average load numbers (suggestion only). Or a different approach at task running (e.g. service oriented nodes assisting general worker ones). Better use statistical approach per machine (counters) while factoring processing capability and duty saturation cycle (human assessment). Or simply users circadian cycle and not care much as machines just work while people rest, with potential and overlap between multiple machines for same role/task. > Since the system load essentially is a decaying average of the number of > runnable or running processes, it is not in any way connected to actual > processor workload as in instructions executed, just to the fact that > there is much *potentially* going on in the system. Obviously, this explains why the average load figure is not 'the' proper way to quantify processor business, such method gains little adequacy without a tuning knob and that is after assessment of other factors. CPU number does correlate but is not solely deterministic, and imagine the mess from twisting a knob without understanding what it does (sane limits, sane defaults). > That's also why I suggested to base the default on a value relative to > the number of cores - it made sense from my practical point of view. But > I understand where Theo's coming from on this. Please comment (improved?) method to estimate processor offloaded periods that reduces average load guess work, or simply a practical approach at solving the problem of finding offloaded periods (threshold) without pushing edge case changes.
Re: give cron a sensible default max load_avg for batch jobs
On 2015 Nov 13 (Fri) at 20:28:01 -0700 (-0700), Todd C. Miller wrote: :On Fri, 13 Nov 2015 16:45:44 -0700, Theo de Raadt wrote: : :> > This patch changes the default setting to 1.5 * :> > (number_of_cpus_in_system) instead, which I find better matches modern :> > behaviour. :> :> A larger number is sensible in this position. :> :> I would propose 8. I don't agree with a calculation like that; the :> amount of work a system can do should not be calculated like that. : :I think 8 is way to high. Isn't the point of batch to run things :when the machine is mostly idle? : : - todd : my laptop currently has chrome with no javascript pages open, a torrent client that is fully paused, and is running cvsync. My load is at 1.65. :/ I think 8 is much better, imho. -- It was one of those perfect summer days -- the sun was shining, a breeze was blowing, the birds were singing, and the lawn mower was broken ... -- James Dent
Re: give cron a sensible default max load_avg for batch jobs
On 2015-11-14 04:28, Todd C. Miller wrote: > On Fri, 13 Nov 2015 16:45:44 -0700, Theo de Raadt wrote: > >>> This patch changes the default setting to 1.5 * >>> (number_of_cpus_in_system) instead, which I find better matches modern >>> behaviour. >> >> A larger number is sensible in this position. >> >> I would propose 8. I don't agree with a calculation like that; the >> amount of work a system can do should not be calculated like that. > > I think 8 is way to high. Isn't the point of batch to run things > when the machine is mostly idle? The problem is (and we've had this discussion several times before at least in misc@), that the system load doesn't really tell us that. It *may* be the case that the system is under lots of work, but it may also be the case that there are many processes just blocking waiting for some resource and that the system is essentially idling. My particular problem, and the reason I suggested this patch in the first place, is that I often see loads of 20-30-50 or even way more, without there even being a problem. The machine is very responsive, and everything works great - there are just a lot of processes running or waiting for an opportunity to run. Since the system load essentially is a decaying average of the number of runnable or running processes, it is not in any way connected to actual processor workload as in instructions executed, just to the fact that there is much *potentially* going on in the system. For example, I run a couple of Hadoop clusters (not on OpenBSD unfortunately), and with cluster nodes containing dual 6-core hyper-threading Xeon processors, there is 24 "cores" that can be tasked with calculations, and if they are all doing something the system load will be at least 24 - but there would be no problem whatsoever to do more things on the server, especially since the map/reduce tasks are running with lowered priority. Each core's individual load would be about 1. That's also why I suggested to base the default on a value relative to the number of cores - it made sense from my practical point of view. But I understand where Theo's coming from on this. Regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / "Words must Benny Lofgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted." /email: benny -at- internetlabbet.se
Re: give cron a sensible default max load_avg for batch jobs
On Fri, 13 Nov 2015 16:45:44 -0700, Theo de Raadt wrote: > > This patch changes the default setting to 1.5 * > > (number_of_cpus_in_system) instead, which I find better matches modern > > behaviour. > > A larger number is sensible in this position. > > I would propose 8. I don't agree with a calculation like that; the > amount of work a system can do should not be calculated like that. I think 8 is way to high. Isn't the point of batch to run things when the machine is mostly idle? - todd
Re: give cron a sensible default max load_avg for batch jobs
On 2015-11-14 00:45, Theo de Raadt wrote: >> This patch changes the default setting to 1.5 * >> (number_of_cpus_in_system) instead, which I find better matches modern >> behaviour. > > A larger number is sensible in this position. > > I would propose 8. I don't agree with a calculation like that; the > amount of work a system can do should not be calculated like that. Fair enough! I agree that 8 will probably fit most cases. It makes for a simpler patch, too. :-) (I retained the decimal point in 8.0 in the man page, as an indicator that it is not an integer value.) Regards, /Benny Index: config.h === RCS file: /cvs/src/usr.sbin/cron/config.h,v retrieving revision 1.23 diff -u -p -u -r1.23 config.h --- config.h23 Oct 2015 18:42:55 - 1.23 +++ config.h14 Nov 2015 00:32:21 - @@ -40,7 +40,7 @@ #define MAILARG _PATH_SENDMAIL /*-*/ /* maximum load at which batch jobs will still run */ -#define BATCH_MAXLOAD 1.5 /*-*/ +#define BATCH_MAXLOAD 8.0 /*-*/ /* Define this to run crontab setgid instead of * setuid root. Group access will be used to read Index: cron.8 === RCS file: /cvs/src/usr.sbin/cron/cron.8,v retrieving revision 1.34 diff -u -p -u -r1.34 cron.8 --- cron.8 12 Nov 2015 21:14:01 - 1.34 +++ cron.8 14 Nov 2015 00:32:21 - @@ -116,7 +116,7 @@ If the current load average is greater t .Ar load_avg , .Xr batch 1 jobs will not be run. -The default value is 1.5. +The default value is 8.0. To allow .Xr batch 1 jobs to run regardless of the load, a value of 0.0 may be used.
Re: give cron a sensible default max load_avg for batch jobs
> This patch changes the default setting to 1.5 * > (number_of_cpus_in_system) instead, which I find better matches modern > behaviour. A larger number is sensible in this position. I would propose 8. I don't agree with a calculation like that; the amount of work a system can do should not be calculated like that.