If it would help in tracking this problem to give someone access to Sif, I can probably make that happen. Just let me know.

Cheers,
Josh

On May 5, 2009, at 8:08 PM, Eugene Loh wrote:

Jeff Squyres wrote:

On May 5, 2009, at 6:01 PM, Eugene Loh wrote:

You and Terry saw something that was occurring about 0.01% of the time during MPI_Init during add_procs. That does not seem to be what we are
seeing here.

Right -- that's what I'm saying. It's different than the MPI_INIT errors.

I was trying to say that there are two kinds of MPI_Init errors. One, which you and Terry have seen, is in add_procs and shows up about 0.01% of the time. The other, um, is not and occurs more like 1% of the time. I'm not real sure what "1%" means. It isn't always 1%. But the times I've seen it has been in MTT runs in which there are dozens of failures among thousands of runs.

But we have seen failures in 1.3.1 and 1.3.2 that look like the one
here. They occur more like 1% of the time and can occur during MPI_Init *OR* later during a collective call. What we're looking at here seems
to be related.  E.g., see
http://www.open-mpi.org/community/lists/devel/2009/03/5768.php

Good to see that we're agreeing.

Yes, I agree that this is not a new error, but it is worth fixing. Cisco's MTT didn't run last night because there was no new trunk tarball last night. I'll check Cisco's MTT tomorrow morning and see if there are any sm failures of this new flavor, and how frequently they're happening.

I just took a stroll down memory lane and these errors seem to be harder to find than I thought. But, got some: http://www.open- mpi.org/mtt/index.php?do_redir=1030 IU, v1.3.1

Ah, and http://www.open-mpi.org/mtt/index.php?do_redir=1031 IU_Sif, v1.3 January 4/9700 failures

I'm not sure what to key in on to find these particular errors.

Yeah, worth fixing.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to