> Again, how can a rank - even if it is yet to be attached to a MPI job
> - get the same fabric endpoint address from its OFI provider as some
> other rank in the system? Is this spawn test doing something crazy
> like attach-detach-attach-detach-etc and a previous address is not
> being removed properly before the next (same) address is inserted
> again?

I have no idea what MPI spawn is doing other than inserting the same address 
more than once.  I was hoping to live quite happily with that ignorance.  :)


> I guess I don't understand the intricacies of this MPI spawn problem,
> and it's difficult for me to believe the statement "It is apparently
> non-trivial for the apps to avoid duplicate insertions" without this
> understanding. But, to me, this seems like applications/middleware
> just shouldn't be inserting a fabric endpoint address twice ... at
> least for HPC/MPI anyway. But maybe this duplicate insert scenario can
> still happen in a data center environment?

I asked what it would take for MPI to avoid the duplicate insertion.  The 
response was for it to store a list of inserted addresses mapped to an fi_addr 
and do a lookup of each address prior to inserting it into an AV.  This spawned 
(ha) my non-trivial comment.

See:
https://github.com/ofiwg/libfabric/pull/3931

Dmitry (copied) may be able to provide greater details on the problem.

- Sean
_______________________________________________
ofiwg mailing list
[email protected]
http://lists.openfabrics.org/mailman/listinfo/ofiwg

Reply via email to