Re: [OMPI devel] DDT and spawn issue?

George Bosilca Wed, 15 Jul 2009 15:57:25 -0400

Actually I don't think this will help. I looked on MTT and there areno errors related to this (logically all reductions should havefailed) ... and MTT is supposed to run on several platforms. Whathappens inside is really strange, but as we do the same mistake whenwe look-up the op as hen we store it, this works on most cases.Moreover, even with the op corrected we still see segfaults, and itlooks more and more as some memory overwrite problem... Before thecommit we even test it on a Sicortex machine (which is clearly adifferent architecture than the x86_64) and this didn't trigger anyerrors either.

Regarding the latency issue, there is not much to say about. Theplatform we tested on is clearly older than what other people test on,but this is all about. The two versions (before and after the data-type move) have the same latency, there is no reason to focus on thelatency number.


  george.


On Jul 15, 2009, at 12:18 , Jeff Squyres wrote:

Perhaps we should add a requirement for testing on 2-3 differentsystems before long-term (or "big change") branches like this cometo the trunk? I say this because it seems like at least some ofthese problems were based on bad luck -- i.e., the stuff worked onthe platform that it was being tested and developed on, even thoughthere are bugs left. Having fallen victim to this myself many times("worked for me on Cisco machines! I dunno why it's failing foryou... :-("), I think we all recognize the value of just running thesame code on someone else's systems -- it has a good tendency toturn up issues that don't show up on yours. I'm not trying to saythat every little trunk commit needs to be validated -- but "big"changes like this could certainly benefit from multiple validations.
Cisco is very willing to be a 2nd platform for testing for stuffthat we can run without too much trouble, especially via MTT (e.g.,I already have the right kind of networks to test, etc.).
BTW, is anyone going to comment about the latency issue that I askedabout?
(in case you can't tell, I'm moderately displeased about how thiswhole branch came to the trunk... :-\ )
On Jul 15, 2009, at 12:04 PM, Rainer Keller wrote:
Hi Jeff,
Ralph and Edgar send fwd an email about this.
We (George and myselve) are currently looking into this.

With the changes we have I can get IBM/spawn to work "sometimes", aka
sometimes, it segfaults.

Thanks,
Rainer




On Wednesday 15 July 2009 11:50:13 am Jeff Squyres wrote:
> I [very briefly] read about the DDT spawn issues, so I went tolook at
> ompi/op/op.c.  I notice that there's a new comment above the op
> datatype<-->op map construction area that says:
>
>      /* XXX TODO */
>
> svn blame says:
>
>   21641   rusraink     /* XXX TODO */
>
> r21641 is the big merge from the past weekend where the DDT splitcame
> in.
>
> Has this area been looked at and the comment is out of date? Ordoes> it need to be updated with new mappings? (I honestly have notlooked
> any farther than this -- the new comment caught my eye)

--
------------------------------------------------------------------------
Rainer Keller, PhD                  Tel: +1 (865) 241-6293
Oak Ridge National Lab          Fax: +1 (865) 241-4811
PO Box 2008 MS 6164           Email: [email protected]
Oak Ridge, TN 37831-2008    AIM/Skype: rusraink
--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] DDT and spawn issue?

Reply via email to