The failures are related to MPI spawn tests. This happens with Intel MPI, but I suspect MPICH or other MPIs may have similar problems with this test.
> -----Original Message----- > From: Blocksome, Michael > Sent: Tuesday, March 20, 2018 11:29 AM > To: Hefty, Sean <[email protected]>; [email protected] > Subject: RE: inserting duplicate addresses into an AV > > Which application, or which MPI, is inserting duplicate addresses? I > don't see how MPI could be doing this. At least the MPI > implementations I'm familiar with use PMI1, PMI2, or PMIx to exchange > addresses at job startup into a distributed key-value store, and then > after a barrier each MPI rank initializes its av with all these unique > addresses. For a duplicate address to happen multiple MPI ranks would > have to get the *same* local address from the OFI provider - how would > that happen? > > Some providers, like bgq, can stuff all the fabric address information > within the 64 bits of fi_addr_t, which basically makes the > fi_av_insert() call a noop in FI_AV_MAP mode. So if this duplicate > address problem happened on bgq it would still "just work" from the > provider's perspective. Now MPI (or whatever is using the provider) > might get messed up because of it, but the fabric communication > operations would still work. > > Mike > > -----Original Message----- > From: ofiwg [mailto:[email protected]] On Behalf Of > Hefty, Sean > Sent: Tuesday, March 20, 2018 11:54 AM > To: [email protected] > Subject: [ofiwg] inserting duplicate addresses into an AV > > MPI is hitting into an issue that is the result of inserting the same > address into an AV more than once. There is no defined behavior for > what a provider should do in this case. At least one provider allows > the duplicate insertion, and at least one fails the call... and > neither work with MPI when this occurs. :/ > > There are a couple of problems trying to define this. In the case of > the provider that fails the call, the failure is detected when > attempting to insert the same address into a hash table. However, not > all providers are easily able to detect duplicates. Forcing them to > do so _may_ require the provider to perform a linear search over the > AV looking for a duplicate for every address that is inserted. At > scale, this is a significant overhead. > > Even if the decision is made to force detecting duplicates (maybe even > making this an AV option), there's the question of how a provider > should respond. Should it insert the address twice -- creating a new > fi_addr for it, discard the duplicate -- and return the existing > fi_addr, or generate an error. And does it matter if AV_TABLE or MAP > is used? > > We need to know what applications need here, and how difficult it will > be for providers to detect duplicates. It is apparently non-trivial > for the apps to avoid duplicate insertions. > > - Sean > > _______________________________________________ > ofiwg mailing list > [email protected] > http://lists.openfabrics.org/mailman/listinfo/ofiwg _______________________________________________ ofiwg mailing list [email protected] http://lists.openfabrics.org/mailman/listinfo/ofiwg
