If you are using 2.17.U then one possibility is to configure your modules and drivers to not use queue locking at all.  In the Config file put in a line like:

qlock module sscf 0
qlock driver atm 0

Then manage whatever exclusion you think you need yourself.

Another possibility is to run the message processing off of the service procedure instead of the put procedure.

I don't know why 2.16 did not have the same problem, but with spin locks instead of semaphores.  Semaphores were introduced so as to be able to enter STREAMS put/service procedures with no spin locks being held.  The kernel is getting increasingly fussy about what you cannot do when holding a spin lock.

The qlock option is intended to help out drivers whose internal model does not fit well with LiS's use of the locking.  The idea is that you can defeat the whole mechanism and do it yourself to take advantage of your knowledge about your driver's special cases and your driver's design.  It might make sense, for example, for your modules to get a stream oriented lock, do processing and release the lock prior to a putnext or qreply given that you know that it is safe to reenter your driver at that point.  If you aren't calling any kernel routines that are fussy about spin locks then you could use a spin lock, otherwise a semaphore.  This is but one example.  You could construct others based upon special circumstances and particular driver behavior.

The stream head always has qlock 1 (each side of the queue locked independently), but all the driver and module queues below can utilize other options.

-- Dave

At 01:02 PM 9/8/2004, dan_gora wrote:
Hi Dave.

I am experiencing some problems with LiS 2.17.2 and LiS 2.17.U and I
would like you input.

The problem that I am experiencing is that, on SMP machines, with my
ATM driver and two protocol modules (SSCOP and SSCF), if I am
receiving data at the same time as I perform an ioctl to the driver,
LiS locks up on the first protocol module's read queue in putnext().
A picture will help explain what is going on a little easier:


1)    app does ioctl, for every putnext on the way down the stream
      this thread will grab the write queue's semaphore.  (The ioctl
      message's path is indicated by the single arrow)

           |
           v
   -------------------------------------------
   |  stream head wrq      stream head rdq   |
   -------------------------------------------
           |
           v
   -------------------------------------------
   |     sscf wrq             sscf rdq       |
   -------------------------------------------
           |
           v
   -------------------------------------------
   |     sscop wrq            sscop rdq      |
   -------------------------------------------
           |        //    ^        ^^
           v       vv    /         ||
   -------------------------------------------
   |     atm   wrq             atm  rdq      |
   -------------------------------------------
2)    ATM driver processes ioctl.

3)    During the time that the ioctl is down in the ATM driver's
      ioctl() routine being processed, on another processor, a POLL
      message comes into the ATM driver and is sent up to SSCOP. (The
      POLL message's path is indicated in my crappy ascii drawing by
      the double arrow).  When the ATM driver puts the POLL message
up
      to SSCOP, it grabs the SSCOP read queue semaphore and calls
      sscoprput().

4)    SSCOP responds to the POLL by creating a STAT message.  This
      message is sent to the atm driver via a qreply() call from
      sscoprput().  Since the ioctl is being processed in the atm
      driver and the ioctl has the ATM write queue semaphore, this
      STAT message blocks waiting for the ATM driver write queue
      semaphore to be released.

5)    The ATM driver tries to reply to the ioctl by sending a
M_IOCACK
      message back up to the application.  The ATM driver does a
      qreply() which tries to grab the SSCOP read queue's semaphore.
      However, since the POLL message sent from the ATMI driver
      already has grabbed the semaphore, the M_IOCACK message cannot
      and so it blocks too.

6)    Here we are in a deadly embrace and neither thread can proceed.
      In LiS 2.17.2 you can break out by sending an interrupt signal
      to the process (ctl-c).  Since it is blocked in
      _down_interruptable(), this signal causes it to fail, but at
      least proceed.  This was how I debugged this problem, using a
      combination of the LiS lock tracing and KGDB.

So, it appears to me that this is a generic problem.  If you have two
messages going in opposite directions at the same time that are
qreply()'d in put routines, you are going to end up in this deadly
embrace on SMP machines.  For me, this calls into question the whole
locking strategy used.

One curious thing is that this problem does not occur with LiS
2.16.18.  I noticed that between 2.16.18 and 2.17.2 that the queue
locks were changed from spin locks (2.16.18) to semaphores (2.17.2),
but this alone doesn't really explain why you do not end up in this
same situation, just really stuck since spin locks would not be
interruptable.

The other thing is that in 2.17.U the machine just locks up entirely,
which implies to me that it is not getting stuck in
down_interruptable() but something else.  I have not spent a lot of
time on 2.17.U because it is much more difficult to debug a machine
that is completely locked up than one that can be partially recovered
and have kgdb used on it..

So, I guess the question is, what to do here?  I am wondering if it
is
possible to release the module/driver's lock as soon as the message
leaves that module/driver.  For example, in the example above, why
does the ioctl thread need to continue to hold the atm driver's write
queue lock when it has done a qreply()?  Why does it need to continue
to hold any of the locks on the write side of the stream for that
matter?  Ioctls seem especially treacherous since an ioctl will end
up
holding every queue lock in the stream by the time that the
M_IOCACK/NAK arrives back to the stream head read queue.  Does this
have to be like this?

I'll continue looking into this, but would like everyone's opinion
about this issue and what may be possible to do about it.

Any questions or comments, please let me know.

thanks-
Dan

_______________________________________________
Linux-streams mailing list
[EMAIL PROTECTED]
http://gsyc.escet.urjc.es/mailman/listinfo/linux-streams

Reply via email to