[nwam-dev] [Bug 8759] nwamd locking up when sending a message

[email protected] Fri, 8 May 2009 04:28:31 -0700 (PDT)

http://defect.opensolaris.org/bz/show_bug.cgi?id=8759



amaguire <alan.maguire at sun.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |alan.maguire at sun.com
             Status|NEW                         |ACCEPTED




--- Comment #4 from amaguire <alan.maguire at sun.com>  2009-05-08 04:28:24 ---
(In reply to comment #3)
> A quick look at nwam_event_send() suggests one avenue for investigation.  The
> event mutex is held across a bunch of blocking calls including msgsnd() which
> is in the stack of thread 1.  Even if this turns out not to be the root cause
> for this bug another bug should be written to fix this.  Mutex's should be 
> held
> like this.

This is certainly a problem, but unlikely to be the cause in this case, since
no other threads appear to be holding the event mutex, and that the
thread is actually blocking in the msgsnd() call. The likely cause 
here is one of the reasons for msgsnd(2) blocking, i.e.

         o    The number of bytes already on the queue  is  equal
              to msg_qbytes. See Intro(2).

         o    The total number of messages  on  the  queue  would
              exceed  the  maximum  allowed  by  the  system. See
              NOTES.

I suspect the reason here is that a lot of old message queues haven't been
cleaned up, and as a consequence either we've filled one up to msg_qbytes with
messages, or exceeded the total number of messages. ipcs -qa shows 19 of them.
I don't think we see a syslog message here since the default resource control
action is to deny but not to log. It might make sense to change this on the
offchance it recurs, i.e.

# rctladm -e syslog process.max-msg-qbytes
# rctladm -e syslog process.max-msg-messages
# rctladm -u

I can't reproduce this at my end yet, but I've found a bug in the
"is this a dead event queue?" logic - unfortunately it's one
that would suggest more enthusiastic reaping of old event queues
rather than less (the latter of which would lead to 19 of them 
hanging around).

Here's a way to test. Run "nwamadm interact" and after a little
while, hit Ctrl^C. Check for an nwam_event_msgs.<pid> file in 
/etc/svc/volatile/nwam, noting the pid. Refresh nwamd a large number 
of times (>20). Has that nwam_event_msgs.<pid> file disappeared?
It should have - this is what I'm seeing at least.

Assuming the queues filling up is indeed the cause, we could
change msgsnd() to specify IPC_NOWAIT - it would immediately
return if the message can't be sent. Given that the conditions
that cause this are likely indicators that the queue is no
longer being listened to on the client side (a full queue
of sent messages not read), such a failure should probably result 
in a reaping of the event queue. It'd be good to be able
to verify the root cause before proceeding with this change though.

-- 
Configure bugmail: http://defect.opensolaris.org/bz/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
You are the assignee for the bug.

[nwam-dev] [Bug 8759] nwamd locking up when sending a message

Reply via email to