If it would help in tracking this problem to give someone access to
Sif, I can probably make that happen. Just let me know.
Cheers,
Josh
On May 5, 2009, at 8:08 PM, Eugene Loh wrote:
Jeff Squyres wrote:
On May 5, 2009, at 6:01 PM, Eugene Loh wrote:
You and Terry saw something that was occurring about 0.01% of the
time
during MPI_Init during add_procs. That does not seem to be what
we are
seeing here.
Right -- that's what I'm saying. It's different than the
MPI_INIT errors.
I was trying to say that there are two kinds of MPI_Init errors.
One, which you and Terry have seen, is in add_procs and shows up
about 0.01% of the time. The other, um, is not and occurs more
like 1% of the time. I'm not real sure what "1%" means. It isn't
always 1%. But the times I've seen it has been in MTT runs in
which there are dozens of failures among thousands of runs.
But we have seen failures in 1.3.1 and 1.3.2 that look like the one
here. They occur more like 1% of the time and can occur during
MPI_Init
*OR* later during a collective call. What we're looking at here
seems
to be related. E.g., see
http://www.open-mpi.org/community/lists/devel/2009/03/5768.php
Good to see that we're agreeing.
Yes, I agree that this is not a new error, but it is worth
fixing. Cisco's MTT didn't run last night because there was no
new trunk tarball last night. I'll check Cisco's MTT tomorrow
morning and see if there are any sm failures of this new flavor,
and how frequently they're happening.
I just took a stroll down memory lane and these errors seem to be
harder to find than I thought. But, got some: http://www.open-
mpi.org/mtt/index.php?do_redir=1030 IU, v1.3.1
Ah, and http://www.open-mpi.org/mtt/index.php?do_redir=1031
IU_Sif, v1.3 January 4/9700 failures
I'm not sure what to key in on to find these particular errors.
Yeah, worth fixing.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel