On Sunday 19 October 2008 14:01:49 Denys Vlasenko wrote:
> > > > If one runs Linux on such a system with HZ=100, poll() with timeout
> > > > of 10ms is basically a "one timer tick timeout". I don't know how
> > > > precise it can be.
> >
> > Sleeps in the linux kernel can never return early unless interrupted, but
> > they can return arbitrarily late if the system is busy doing something
> > else. (This horrifies the realtime people, but that's a can of worms.)
> >
> > When your sleep expires,
>
> How kernel can know that your sleep expires on a hardware where the only
> "timer source" is timer interrupt at 100 Hz?
If the timer granularity isn't good enough, sleeps round up. (Rounding down
would be bad.) So it'll give you a longer sleep than you requested, rather
than returning early.
I remember reading a linux kernel mailing list thread on this, but alas
searching for it after the fact is nontrivial. (I miss kernel-traffic...)
> Let's say program did a poll(10) exactly at .00950000 seconds,
> i.e. when 95% of time between timer interrupts has passed.
> Kernel has no way of knowing that!
And so it rounds the sleep up. The delay value says "do not return before
this, unless interrupted and returning -EINTR or similar". The scheduler can
already _delay_ the return by an arbitrary amount of time because the system
is loaded or your executable pages got swapped out. If the sleep could
return _early_ as well, then it wouldn't mean anything at all.
> Should poll expire on next timer interrupt (which will be early),
> or on the one after it (which will very often result in 20ms timeouts)?
It should never return early, unless interrupted. That's the rule. If your
system's timer granularity sucks, then delays will be longer than you asked
for.
> > > > To be exact, can kernel ensure that it won't be significantly
> > > > _smaller_ than 10ms?
> >
> > Yes. It can. That's the whole POINT of sleeps.
>
> This is possible if you have some sort of high-res timer,
> not just timer ticks.
Which is why kernel/time.c:timespec_to_jiffies() starts with this comment:
> /*
> * The TICK_NSEC - 1 rounds up the value to the next resolution.
Other functions like timespec_trunc() round down, for use with the returned
value when you ask "what time is it". But when you do a sleep, it rounds up.
> So, lacking there hardware capabilities, kernel has to be very pessimistic
> and on every poll(10) assume the worst - that timer interrupt interval
> is about to expire, and therefore wakeup has to be delayed for TWO
> timer interrupt intervals in order to not time out before 10ms.
> This decision will result in:
>
> while (1) {
> poll(&pfd, 1, 10);
> write(1, ".", 1);
> }
>
> printing a dot every 20 ms, not 10. Surprise!
Yes. Exactly. We agree. So why is vi waiting longer for a character to come
in a bad thing here?
> I am paranoid. I assume some kernels won't do that.
Those other kernels are not only not linux, they're not posix compliant.
Here's what SUSv3 has to say about poll():
http://www.opengroup.org/onlinepubs/009695399/functions/poll.html
> If none of the defined events have occurred on any selected file
> descriptor, poll() shall wait at least timeout milliseconds for an event to
> occur on any of the selected file descriptors. If the value of timeout is
> 0, poll() shall return immediately. If the value of timeout is -1, poll()
> shall block until a requested event occurs or until the call is
> interrupted.
With "at least" meaning "rounds up".
> I am presuming some
> kernels (call them broken if you wish) will wake up poll(10) on next
> timer interrupt, making it possible that poll(10) returns earlier than
> 10ms.
Those kernels, if they exist, explicitly violate the Single Unix Standard
version 3, the Open Group Base specifications edition 6, and Posix.
> > > > What if poll() was called soon before our 10ms
> > > > timeslice was going to expire anyway?
> >
> > Why do you believe time slices are 10ms? Where did you get that from?
> > Even the O(1) scheduler in 2.4 did timeslice aggregation, not sure about
> > Linus's original scheduler from 1992...
>
> I believe that there are hardware where there are no timer sources
> other than timer interrupt. On such hardware, kernel can't do
> finer-grained scheduling decisions.
That's a bit of an oversimplification. If it switches tasks due to a serial
interrupt, and the serial interrupt comes in halfway between two timer ticks,
it's making a finer grained scheduling decision. What you mean is it can't
measure timeouts more accurately than that, which is why all the standards
tell it to round up.
> > > Imagine a high-end system with HT=1000 and poll with timeout of 1ms.
> > > Now poll() happens to be called just 1µs before the end of the
> > > timeslice. If the kernel wouldn't wait for *at least* the specified
> > > time, the behavior would be undefined and such a timeout would be
> > > useless.
> >
> > Time slices DON'T WORK THAT WAY ANYMORE. They HAVEN'T FOR YEARS.
> > Jiffies are a resource allocation thing, that's all. Scheduling
> > decisions don't wait for time slices to end, it's a cheap O(1) action so
> > it can be done on return from every system call and most interrupts.
>
> Ok, a process returns from a system call.
> Without high-res timer source (like cycle counter in x86), how would you
> know that it "ran long enough and we can evict it, giving CPU to someone
> else"? Since it got it's CPU, did it run 1us? 10us? 999us? No way to know
> that till you get next timer interrupt! On what data would you base
> you finer-grained scheduling decisions?
I'm not the one basing it, and I pointed you at a bunch of LWN articles on
this subject. I learned it by arguing with people about it on LKML back
around 2003 (a quick google turned up
http://www.uwsg.iu.edu/hypermail/linux/kernel/0308.0/1659.html which doesn't
look like the right thread... Maybe
http://www.uwsg.iu.edu/hypermail/linux/kernel/0308.1/1485.html ... No, it's
looking like it was scattered over a bunch of threads and not easy to link
to.)
> > Processes get scheduled as soon as there's a processor to run them on
> > that isn't busy with something else.
>
> This case is not interesting. I am thinking about the case where
> num_runnable_processes >> num_cpus. In this case, processes can't
> be scheduled "as soon as there's a processor to run them on".
> There are no free CPUs. You have to evict someone.
Which is why they evict "non-interactive" processes in favor of "interactive"
processes, and one of the two links I gave were to scheduler people trying to
figure out how to automatically detect interactivity, which involved defining
it. Off in scheduler-land, this is an old well-trodden argument. Going
round robin over timeslices was what Desqview did, Linux is a little better
about it.
Rob
_______________________________________________
busybox mailing list
[email protected]
http://busybox.net/cgi-bin/mailman/listinfo/busybox