> I see hangs killing opensm related to a bug in user_mad.c. The problem > appears > to be: > > ib_umad_close() > downgrade_write(&file->port->mutex) > ib_unregister_mad_agent(...) > up_read(&file->port->mutex) > > ib_unregister_mad_agent() flushes any outstanding MADs, resulting in calls to > send_handler() and recv_handler(), both of which call queue_packet(): > > queue_packet() > down_read(&file->port->mutex) > ... > up_read(&file->port->mutex)
This should be fine (and comes from an earlier set of changes to fix deadlocks): ib_umad_close() does a downgrade_write() before calling ib_unregister_mad_agent(), so it only holds the mutex with a read lock, which means that queue_packet() should be able to take another read lock. Unless there's something that prevents one thread from taking a read lock twice? What kernel are you seeing these problems with? > Does anyone know the reasoning for holding the mutex around > ib_unregister_mad_agent()? It's to keep things serialized against a port disappearing because a device is being removed. But looking at things, I think we can probably rejigger the locking to make things simpler, and avoid the use of downgrade_write(), which the -rt people don't like. - R. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
