> Again, how can a rank - even if it is yet to be attached to a MPI job > - get the same fabric endpoint address from its OFI provider as some > other rank in the system? Is this spawn test doing something crazy > like attach-detach-attach-detach-etc and a previous address is not > being removed properly before the next (same) address is inserted > again?
I have no idea what MPI spawn is doing other than inserting the same address more than once. I was hoping to live quite happily with that ignorance. :) > I guess I don't understand the intricacies of this MPI spawn problem, > and it's difficult for me to believe the statement "It is apparently > non-trivial for the apps to avoid duplicate insertions" without this > understanding. But, to me, this seems like applications/middleware > just shouldn't be inserting a fabric endpoint address twice ... at > least for HPC/MPI anyway. But maybe this duplicate insert scenario can > still happen in a data center environment? I asked what it would take for MPI to avoid the duplicate insertion. The response was for it to store a list of inserted addresses mapped to an fi_addr and do a lookup of each address prior to inserting it into an AV. This spawned (ha) my non-trivial comment. See: https://github.com/ofiwg/libfabric/pull/3931 Dmitry (copied) may be able to provide greater details on the problem. - Sean _______________________________________________ ofiwg mailing list [email protected] http://lists.openfabrics.org/mailman/listinfo/ofiwg
