Here's the situation and a confusing update. The system is -- naturally -- complex. [If it were simple. we'd have solved the problem two weeks ago.] The system involves shared memory reserved with an append="mem=xx" line in /etc/lilo.conf used by a device driver (which I wrote) that manages a MIL-1553B MUXBUS interface card as a bus controller. There are a variety of data structures stored in the shared memory that are monitored by the monitoring program I wrote which displays data in an xterm window using ncurses. That's the easy part. [BTW, the MUXBUS device driver has been fairly thoroughly tested and has never failed to dispatch messages on the regular schedule to which it works.] The monitoring program uses another device driver I wrote for pacing. The theory is simple. If a user-task wants to do something at, say 4 HZ, it reads (in a while-loop) from the device /dev/FRT4. This device simply blocks any and all reads until the next strike of the (simulated) 4 Hz clock at which time it releases the block and the tasks are freed to continue doing what they wanted to do. The reason I chose to do it this way rather than with a rt-linux FIFO was that there are a variable -- and essentially unknown -- number of tasks that will be running at any one time. The rate-driver logic is simple enough. In the driver there are (IIRC) six task queues (see Rubini's book for more info), one for each rate. When a read-request comes in, the task is immediately queued onto the appropriate Linux task-queue to wait [using sleep_on_interruptable()], blocking the task -- this is a standard technique in a Linux device driver. When the 4 Hz clock strikes, the task dispatcher unblocks any and all tasks on the 4 Hz queue [by calling wake_up_interruptable()] and then returns a one-byte read status [just to keep the library functions happy, even though no data is actually transmitted]. And it works. I've watched three or four monitoring tasks track changes in the content of messages for several hours. Originally, the source of the frequency for the task dispatcher was a real-time Linux periodic task running at the message-slot rate of 1280 Hz. But I became concerned about calling a normal Linux kernel-function from within a real-time task so I converted it to use a standard Linux timer-queue to be called every jiffy (100 Hz). I was sure that this would cure the problem, but it didn't: the programs would run happily for a while and then, all within the same 4 Hz interval, they would inexplicably hang. They were uninterruptable and accumulating no CPU-time (as shown by top) and were not (again by top) not zombie tasks. The programs do not hang at the same relative time each time (from program start) but I've never seen them fail for the first few hours. I keep thinking that there must be a race-condition somewhere, but I've never been able to figure out what it might be. No device driver is hung. The MUXBUS driver is still pumping out messages at exactly the right rate and I can start several more monitoring tasks which immediately function correctly for a few hours until they too hang. I also shut down the source of an asynchronous 1 Hz interrupt that was handled by a (literally) do-nothing ISR [that I wrote] that simply executed a return. Again, this did not help. And everything is reproducible not only from time to time, but between two different computers (with different motherboards and video cards too). The systems are P133 and P233MMX running Red Hat 5.2 with real-time Linux 0.9J (the Zentropix distribution). Any and all thoughts, suggestions, and comments are welcome. I can provide code samples on request. If you reply to me personally, please use [EMAIL PROTECTED] since we're shut down until the next millennium. Happy (and safe) holidays to all. Norm > -----Original Message----- > > >Date: Wed, 22 Dec 1999 03:34:43 +0100 (CET) > >From: root <[EMAIL PROTECTED]> > >To: Norm Dresner <[EMAIL PROTECTED]> > >cc: [EMAIL PROTECTED] > >Subject: Re: [rtl] Is this asking for trouble or is the bug elsewhere? > > > >On Sun, 19 Dec 1999, Norm Dresner wrote: > > > >> there's no static logic failure in the code. The logic in the device > >> driver is quite simple: when the appropriate read-request is received > and > >> the minor device number decoded (which encodes the rate), a call is > made > >> to sleep_on_interruptible(). That's all. When the block is broken > >> later, the code returns a 1 to signal that one byte has been > transferred > >> -- even though no data was really moved and that's all of the > processing > >> in the read-routine. > > > >About which device driver are you talking here (RT FIFO?)? I can not > >comment on wake_up_interuptible, but why don't you just use rtf_put from > >RT kernel code? This should wake up the user space task blocking on read. > > > >-- > >Tomek > > > > > > > > --- [rtl] --- To unsubscribe: echo "unsubscribe rtl" | mail [EMAIL PROTECTED] OR echo "unsubscribe rtl <Your_email>" | mail [EMAIL PROTECTED] ---- For more information on Real-Time Linux see: http://www.rtlinux.org/~rtlinux/