On Saturday 18 October 2008 16:06:22 Ralf Friedl wrote:
> Denys Vlasenko wrote:
> > I think scheduler is at play here.
> > Imagine a low-end system (say ~100MHz cpu) with rather poor capabilities
> > wrt measuring time. No microsecond clock, just timer interrupts.

I don't seem to have gotten the original of this message, but I repeat: timer 
ticks have nothing to do with it.  This is a race between two entirely 
in-kernel events, neither of which involves the scheduler until the event is 
already _over_.

> > If one runs Linux on such a system with HZ=100, poll() with timeout of
> > 10ms is basically a "one timer tick timeout". I don't know how precise it
> > can be.

Sleeps in the linux kernel can never return early unless interrupted, but they 
can return arbitrarily late if the system is busy doing something else.  
(This horrifies the realtime people, but that's a can of worms.)

When your sleep expires, your process gets taken off the wait queue it's on 
and put back in the scheduler queue.  Your process _cannot_ be scheduled 
before this happens, but when the scheduler actually has a spare time slice 
to hand it is a separate issue.  (And that second decision depends on process 
priority and which scheduler variant you're using and whether or not you've 
invoked the funky realtime options and...)  This was the original O(1) stuff, 
the way they _got_ O(1) was by not looking at inactive processes at all when 
making a scheduler decision but instead moving processes between different 
queues depending on its current status, so if your process is waiting it's on 
a wait queue and not on a scheduler queue.  (It gets moved to a separate 
linked list, and then moved back when its timer expires.)

I believe current kernels use CFS (the "Completely Fair Scheduler"), which was 
Ingo Molnar following up on the work Con Kolivas did to improve Ingo's 
original O(1) scheduler, which replaced Linus's original scheduler back 
during 2.4.

If you _really_ care about this, go to http://lwn.net/Kernel/Index and scroll 
down to the "Scheduler" topic.  The dozen or so articles in there are a good 
primer on the issue, although they don't cover things like this ACM article 
on CFS:
  http://portal.acm.org/citation.cfm?id=1400097.1400102

> > To be exact, can kernel ensure that it won't be significantly 
> > _smaller_ than 10ms?

Yes.  It can.  That's the whole POINT of sleeps.

Now what _can_ return early from a sleep is being interrupted by a signal, in 
which case the system call returns -EINTR, as for all system calls, which is 
why sigaction(SA_RESTART) or a wrapper is a good idea.  Busybox has a 
safe_poll() wrapper (which may itself be buggy, but that's a separate issue).

That said, just about the only signal that's likely to interrupt a sleep in a 
way that returns EINTR rather than killing the process is SIGSTOP.  I suppose 
you could write a program to SIGSTOP and SIGCONT vi in a tight loop to try to 
beat bad behavior out of it, with the current safe_poll() wrapper doing a 
strange decrement you'd have to send 20 signals in under 10ms in order to 
beat bad behavior out of it.  You'd have to write it in C though, even on my 
Core 2 duo laptop a shell script probably couldn't do it fast enough.  (Yes, 
I can come up with a pathological case to break most things, but this is 
really not an interesting case.)

> > What if poll() was called soon before our 10ms 
> > timeslice was going to expire anyway?

Why do you believe time slices are 10ms?  Where did you get that from?  Even 
the O(1) scheduler in 2.4 did timeslice aggregation, not sure about Linus's 
original scheduler from 1992...

> > When the timeout will expire? Next 
> > timeslice? Then it can be smaller than 10ms. Two timeslices from now?
> > This means almost always poll(10) will do ~20ms waits.
>
> Imagine a high-end system with HT=1000 and poll with timeout of 1ms. Now
> poll() happens to be called just 1µs before the end of the timeslice. If
> the kernel wouldn't wait for *at least* the specified time, the behavior
> would be undefined and such a timeout would be useless.

Time slices DON'T WORK THAT WAY ANYMORE.  They HAVEN'T FOR YEARS.  Jiffies are 
a resource allocation thing, that's all.  Scheduling decisions don't wait for 
time slices to end, it's a cheap O(1) action so it can be done on return from 
every system call and most interrupts.  (I believe that was part of Robert 
Love's original low latency work predating even the O(1) scheduler, but it's 
been so many years I'd have to look it up.  I'm fairly sure this predated the 
kernel preemption stuff...)

Processes get scheduled as soon as there's a processor to run them on that 
isn't busy with something else.  When they wait they get a priority boost 
(notice how "priority" and "niceness" are different fields in top), so they 
only wait until the end of a time slice for other processes of the same 
priority.  Having another process with the exact same priority as vi is 
somewhat potluck due to there being _lots_ priority levels the dynamic 
priority adjustment goes through, but also means that either the other 
process is also massively I/O bound and woke up at the exact same instant we 
did (lottery situation) so it's almost never actually running right now, or 
else it means that VI has become CPU bound and interactivity ain't an issue 
no more anyway.

I _also_ note that modern Linux system calls such as ppoll() have a 
_nanoseconds_ field because miliseconds wasn't considered accurate enough.  I 
also point out that arm is the platform on which Thomas Gleixner first 
implemented the high resolution timers infrastructure in the first place, so 
modern embedded hardware is surprisingly well represented here.

I also point out that the scheduler still isn't involved the poll() system 
call's decision to return data or to return expiration.  It's a separate 
kernel wait queue, so this entire discussion is off to one side of the real 
issue...

Rob
_______________________________________________
busybox mailing list
[email protected]
http://busybox.net/cgi-bin/mailman/listinfo/busybox

Reply via email to