FW: [rtl] Is this asking for trouble or is the bug elsewhere? An UPDATE

Dresner, Norman A. Wed, 22 Dec 1999 10:59:51 -0800
Here's the situation and a confusing update.

The system is -- naturally -- complex.  [If it were simple. we'd have solved
the problem two weeks ago.]  The system involves shared memory reserved with
an append="mem=xx" line in /etc/lilo.conf used by a device driver (which I
wrote) that manages a MIL-1553B MUXBUS interface card as a bus controller.
There are a variety of data structures stored in the shared memory that are
monitored by the monitoring program I wrote which displays data in an xterm
window using ncurses.  That's the easy part.  [BTW, the MUXBUS device driver
has been fairly thoroughly tested and has never failed to dispatch messages
on the regular schedule to which it works.]

The monitoring program uses another device driver I wrote for pacing.  The
theory is simple.  If a user-task wants to do something at, say 4 HZ, it
reads (in a while-loop) from the device /dev/FRT4.  This device simply
blocks any and all reads until the next strike of the (simulated) 4 Hz clock
at which time it releases the block and the tasks are freed to continue
doing what they wanted to do.  The reason I chose to do it this way rather
than with a rt-linux FIFO was that there are a variable -- and essentially
unknown -- number of tasks that will be running at any one time.  

The rate-driver logic is simple enough.  In the driver there are (IIRC) six
task queues (see Rubini's book for more info), one for each rate.  When a
read-request comes in, the task is immediately queued onto the appropriate
Linux task-queue to wait [using sleep_on_interruptable()], blocking the task
-- this is a standard technique in a Linux device driver.  When the 4 Hz
clock strikes, the task dispatcher unblocks any and all tasks on the 4 Hz
queue [by calling wake_up_interruptable()] and then returns a one-byte read
status [just to keep the library functions happy, even though no data is
actually transmitted].

And it works.  I've watched three or four monitoring tasks track changes in
the content of messages for several hours.

Originally, the source of the frequency for the task dispatcher was a
real-time Linux periodic task running at the message-slot rate of 1280 Hz.
But I became concerned about calling a normal Linux kernel-function from
within a real-time task so I converted it to use a standard Linux
timer-queue to be called every jiffy (100 Hz).  I was sure that this would
cure the problem, but it didn't:  the programs would run happily for a while
and then, all within the same 4 Hz interval, they would inexplicably hang.
They were uninterruptable and accumulating no CPU-time (as shown by top) and
were not (again by top) not zombie tasks.  The programs do not hang at the
same relative time each time (from program start) but I've never seen them
fail for the first few hours.  I keep thinking that there must be a
race-condition somewhere, but I've never been able to figure out what it
might be.

No device driver is hung.  The MUXBUS driver is still pumping out messages
at exactly the right rate and I can start several more monitoring tasks
which immediately function correctly for a few hours until they too hang.

I also shut down the source of an asynchronous 1 Hz interrupt that was
handled by a (literally) do-nothing ISR [that I wrote] that simply executed
a return.  Again, this did not help.

And everything is reproducible not only from time to time, but between two
different computers (with different motherboards and video cards too). 

The systems are P133 and P233MMX running Red Hat 5.2 with real-time Linux
0.9J (the Zentropix distribution).

Any and all thoughts, suggestions, and comments are welcome.  I can provide
code samples on request.

If you reply to me personally, please use   [EMAIL PROTECTED]  since we're shut
down until the next millennium.

Happy (and safe) holidays to all.

        Norm

> -----Original Message-----
> 
> >Date: Wed, 22 Dec 1999 03:34:43 +0100 (CET)
> >From: root <[EMAIL PROTECTED]>
> >To: Norm Dresner <[EMAIL PROTECTED]>
> >cc: [EMAIL PROTECTED]
> >Subject: Re: [rtl] Is this asking for trouble or is the bug elsewhere?
> >
> >On Sun, 19 Dec 1999, Norm Dresner wrote:
> >
> >> there's no static logic failure in the code.  The logic in the device
> >> driver is quite simple: when the appropriate read-request is received
> and
> >> the minor device number decoded (which encodes the rate), a call is
> made
> >> to sleep_on_interruptible().  That's all.  When the block is broken
> >> later, the code returns a 1 to signal that one byte has been
> transferred
> >> -- even though no data was really moved and that's all of the
> processing
> >> in the read-routine.
> >
> >About which device driver are you talking here (RT FIFO?)? I can not
> >comment on wake_up_interuptible, but why don't you just use rtf_put from
> >RT kernel code? This should wake up the user space task blocking on read.
> >
> >--
> >Tomek
> >
> >
> >
> >
--- [rtl] ---
To unsubscribe:
echo "unsubscribe rtl" | mail [EMAIL PROTECTED] OR
echo "unsubscribe rtl <Your_email>" | mail [EMAIL PROTECTED]
----
For more information on Real-Time Linux see:
http://www.rtlinux.org/~rtlinux/
FW: [rtl] Is this asking for trouble or is the bug elsewhere? An UPDATE

Reply via email to