> I see hangs killing opensm related to a bug in user_mad.c.  The problem 
 > appears
 > to be:
 > 
 > ib_umad_close()
 >      downgrade_write(&file->port->mutex)
 >      ib_unregister_mad_agent(...)
 >      up_read(&file->port->mutex)
 > 
 > ib_unregister_mad_agent() flushes any outstanding MADs, resulting in calls to
 > send_handler() and recv_handler(), both of which call queue_packet():
 > 
 > queue_packet()
 >      down_read(&file->port->mutex)
 >      ...
 >      up_read(&file->port->mutex)

This should be fine (and comes from an earlier set of changes to fix
deadlocks): ib_umad_close() does a downgrade_write() before calling
ib_unregister_mad_agent(), so it only holds the mutex with a read
lock, which means that queue_packet() should be able to take another
read lock.

Unless there's something that prevents one thread from taking a read
lock twice?  What kernel are you seeing these problems with?

 > Does anyone know the reasoning for holding the mutex around
 > ib_unregister_mad_agent()?

It's to keep things serialized against a port disappearing because a
device is being removed.  But looking at things, I think we can
probably rejigger the locking to make things simpler, and avoid the
use of downgrade_write(), which the -rt people don't like.

 - R.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to