Re: [OMPI devel] possible bug in 1.3.2 sm transport

Bryan Lally Tue, 14 Jul 2009 17:28:55 -0400

Developers,

I was about to test 1.3.3rc2, then I saw that 1.3.3 had also escaped. Itried it, and voila! It solves the issue I reported in May, below.


Thanks for all the work that went into this.

        - Bryan

--
Bryan Lally, la...@lanl.gov
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico



Bryan Lally wrote:

Developers,
This is my first post to the openmpi developers list. I think I've runacross a race condition in your latest release. Since my demonstratoris somewhat large and cumbersome, I'd like to know if you already knowabout this issue before we start the process of providing code and details.
Basics: openmpi 1.3.2, Fedora 9, 2 x86_64 quad-core cpus in one machine.
Symptoms: our code hangs, always in the same vicinity, usually at thesame place, 10-25% of the time. Sometimes more often, sometimes less.
Our code has run reliably with many MPI implementations for years. Wehaven't added anything recently that is a likely culprit. While we haveour own issues, this doesn't feel like one of ours.
We see that there is new code in the shared memory transport between1.3.1 and 1.3.2. Our code doesn't hang with 1.3.1 (nor 1.2.9). Onlywith 1.3.2.
If we switch to tcp for transport (with mpirun --mca btl tcp,self ...)we don't see any hangs. Running using --mca btl sm,self results in hangs.
If we sprinkle a few calls (3) to MPI_Barrier in the vicinity of theproblem, we no longer see hangs.
We demonstrate this with 4 processes. When we attach a debugger to thehung processes, we see that the hang results from an MPI_Allreduce. Allprocesses have made the same call to MPI_Allreduce. The processes areall in opal_progress, called (with intervening calls) by MPI_Allreduce.
My question is, have you seen anything like this before? If not, whatdo we do next?
Thanks.

    - Bryan

Re: [OMPI devel] possible bug in 1.3.2 sm transport

Reply via email to