> mlx4_create_qp and mlx4_destroy_qp are not atomic WRT each other. If one > thread is > destroying a QP while another is creating a qp, there is a race hole. The > destroying thread > can lose its timeslice after it has deleted the QP from kernel space, but > before it has cleared > it from userspace store (mlx4_clear_qp). > If the other thread creates a qp during this break, it gets the same QP base > number and overwrites > the destroyed QPs entry with mlx4_store_qp().
Yes, looks like a real bug. > 2. Create a mutex for this purpose, and use it to force the create and > destroy qp operations > to be atomic WRT the ibv_cmd_xxx_qp operations and the store/clear qp > operations. This looks like the best solution. I wonder if we should just add this synchronization in libibverbs rather than individual drivers? I notice that libcxgb3 seems to have the same bug AFAICS. But maybe it's better to just keep the simple rule that driver libraries are responsible for locking their own data structures. - R. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
