Actually I don't think this will help. I looked on MTT and there are no errors related to this (logically all reductions should have failed) ... and MTT is supposed to run on several platforms. What happens inside is really strange, but as we do the same mistake when we look-up the op as hen we store it, this works on most cases. Moreover, even with the op corrected we still see segfaults, and it looks more and more as some memory overwrite problem... Before the commit we even test it on a Sicortex machine (which is clearly a different architecture than the x86_64) and this didn't trigger any errors either.

Regarding the latency issue, there is not much to say about. The platform we tested on is clearly older than what other people test on, but this is all about. The two versions (before and after the data- type move) have the same latency, there is no reason to focus on the latency number.

  george.


On Jul 15, 2009, at 12:18 , Jeff Squyres wrote:

Perhaps we should add a requirement for testing on 2-3 different systems before long-term (or "big change") branches like this come to the trunk? I say this because it seems like at least some of these problems were based on bad luck -- i.e., the stuff worked on the platform that it was being tested and developed on, even though there are bugs left. Having fallen victim to this myself many times ("worked for me on Cisco machines! I dunno why it's failing for you... :-("), I think we all recognize the value of just running the same code on someone else's systems -- it has a good tendency to turn up issues that don't show up on yours. I'm not trying to say that every little trunk commit needs to be validated -- but "big" changes like this could certainly benefit from multiple validations.

Cisco is very willing to be a 2nd platform for testing for stuff that we can run without too much trouble, especially via MTT (e.g., I already have the right kind of networks to test, etc.).

BTW, is anyone going to comment about the latency issue that I asked about?

(in case you can't tell, I'm moderately displeased about how this whole branch came to the trunk... :-\ )



On Jul 15, 2009, at 12:04 PM, Rainer Keller wrote:

Hi Jeff,
Ralph and Edgar send fwd an email about this.
We (George and myselve) are currently looking into this.

With the changes we have I can get IBM/spawn to work "sometimes", aka
sometimes, it segfaults.

Thanks,
Rainer




On Wednesday 15 July 2009 11:50:13 am Jeff Squyres wrote:
> I [very briefly] read about the DDT spawn issues, so I went to look at
> ompi/op/op.c.  I notice that there's a new comment above the op
> datatype<-->op map construction area that says:
>
>      /* XXX TODO */
>
> svn blame says:
>
>   21641   rusraink     /* XXX TODO */
>
> r21641 is the big merge from the past weekend where the DDT split came
> in.
>
> Has this area been looked at and the comment is out of date? Or does > it need to be updated with new mappings? (I honestly have not looked
> any farther than this -- the new comment caught my eye)

--
------------------------------------------------------------------------
Rainer Keller, PhD                  Tel: +1 (865) 241-6293
Oak Ridge National Lab          Fax: +1 (865) 241-4811
PO Box 2008 MS 6164           Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008    AIM/Skype: rusraink





--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to