Actually I don't think this will help. I looked on MTT and there are
no errors related to this (logically all reductions should have
failed) ... and MTT is supposed to run on several platforms. What
happens inside is really strange, but as we do the same mistake when
we look-up the op as hen we store it, this works on most cases.
Moreover, even with the op corrected we still see segfaults, and it
looks more and more as some memory overwrite problem... Before the
commit we even test it on a Sicortex machine (which is clearly a
different architecture than the x86_64) and this didn't trigger any
errors either.
Regarding the latency issue, there is not much to say about. The
platform we tested on is clearly older than what other people test on,
but this is all about. The two versions (before and after the data-
type move) have the same latency, there is no reason to focus on the
latency number.
george.
On Jul 15, 2009, at 12:18 , Jeff Squyres wrote:
Perhaps we should add a requirement for testing on 2-3 different
systems before long-term (or "big change") branches like this come
to the trunk? I say this because it seems like at least some of
these problems were based on bad luck -- i.e., the stuff worked on
the platform that it was being tested and developed on, even though
there are bugs left. Having fallen victim to this myself many times
("worked for me on Cisco machines! I dunno why it's failing for
you... :-("), I think we all recognize the value of just running the
same code on someone else's systems -- it has a good tendency to
turn up issues that don't show up on yours. I'm not trying to say
that every little trunk commit needs to be validated -- but "big"
changes like this could certainly benefit from multiple validations.
Cisco is very willing to be a 2nd platform for testing for stuff
that we can run without too much trouble, especially via MTT (e.g.,
I already have the right kind of networks to test, etc.).
BTW, is anyone going to comment about the latency issue that I asked
about?
(in case you can't tell, I'm moderately displeased about how this
whole branch came to the trunk... :-\ )
On Jul 15, 2009, at 12:04 PM, Rainer Keller wrote:
Hi Jeff,
Ralph and Edgar send fwd an email about this.
We (George and myselve) are currently looking into this.
With the changes we have I can get IBM/spawn to work "sometimes", aka
sometimes, it segfaults.
Thanks,
Rainer
On Wednesday 15 July 2009 11:50:13 am Jeff Squyres wrote:
> I [very briefly] read about the DDT spawn issues, so I went to
look at
> ompi/op/op.c. I notice that there's a new comment above the op
> datatype<-->op map construction area that says:
>
> /* XXX TODO */
>
> svn blame says:
>
> 21641 rusraink /* XXX TODO */
>
> r21641 is the big merge from the past weekend where the DDT split
came
> in.
>
> Has this area been looked at and the comment is out of date? Or
does
> it need to be updated with new mappings? (I honestly have not
looked
> any farther than this -- the new comment caught my eye)
--
------------------------------------------------------------------------
Rainer Keller, PhD Tel: +1 (865) 241-6293
Oak Ridge National Lab Fax: +1 (865) 241-4811
PO Box 2008 MS 6164 Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008 AIM/Skype: rusraink
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel