Re: [OMPI devel] Hang in collectives involving shared memory

Bryan Lally Tue, 16 Jun 2009 15:39:33 -0400

Ashley Pittman wrote:

Whilst the fact that it appears to only happen on your machine implies
it's not a general problem with OpenMPI the fact that it happens in the
same location/rep count every time does swing the blame back the other
way.

This sounds a _lot_ like the problem I was seeing, my initial message isappended here. If it's the same thing, then it's not only on the bigmachines here that Ralph was talking about, but on very vanilla Fedora 7and 9 boxes.

I was able to hang Ralph's reproducer on an 8 core Dell, Fedora 9,kernel 2.6.27(.4-78.2.53.fc9.x86_64).


I don't think it's just the one machine and it's configuration.

        - Bryan

--
Bryan Lally, la...@lanl.gov
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico

Developers,

This is my first post to the openmpi developers list. I think I've runacross a race condition in your latest release. Since my demonstratoris somewhat large and cumbersome, I'd like to know if you already knowabout this issue before we start the process of providing code and details.


Basics: openmpi 1.3.2, Fedora 9, 2 x86_64 quad-core cpus in one machine.

Symptoms: our code hangs, always in the same vicinity, usually at thesame place, 10-25% of the time. Sometimes more often, sometimes less.

Our code has run reliably with many MPI implementations for years. Wehaven't added anything recently that is a likely culprit. While we haveour own issues, this doesn't feel like one of ours.

We see that there is new code in the shared memory transport between1.3.1 and 1.3.2. Our code doesn't hang with 1.3.1 (nor 1.2.9). Onlywith 1.3.2.

If we switch to tcp for transport (with mpirun --mca btl tcp,self ...)we don't see any hangs. Running using --mca btl sm,self results in hangs.

If we sprinkle a few calls (3) to MPI_Barrier in the vicinity of theproblem, we no longer see hangs.

We demonstrate this with 4 processes. When we attach a debugger to thehung processes, we see that the hang results from an MPI_Allreduce. Allprocesses have made the same call to MPI_Allreduce. The processes areall in opal_progress, called (with intervening calls) by MPI_Allreduce.

My question is, have you seen anything like this before? If not, whatdo we do next?


Thanks.

    - Bryan

Re: [OMPI devel] Hang in collectives involving shared memory

Reply via email to