On May 5, 2009, at 6:01 PM, Eugene Loh wrote:
You and Terry saw something that was occurring about 0.01% of the time
during MPI_Init during add_procs. That does not seem to be what we
are
seeing here.
Right -- that's what I'm saying. It's different than the MPI_INIT
errors.
But we have seen failures in 1.3.1 and 1.3.2 that look like the one
here. They occur more like 1% of the time and can occur during
MPI_Init
*OR* later during a collective call. What we're looking at here seems
to be related. E.g., see
http://www.open-mpi.org/community/lists/devel/2009/03/5768.php
Good to see that we're agreeing.
Yes, I agree that this is not a new error, but it is worth fixing.
Cisco's MTT didn't run last night because there was no new trunk
tarball last night. I'll check Cisco's MTT tomorrow morning and see
if there are any sm failures of this new flavor, and how frequently
they're happening.
--
Jeff Squyres
Cisco Systems