Eugene Loh wrote:
Ralph Castain wrote:

Hi Bryan

I have seen similar issues on LANL clusters when message sizes were fairly large. How big are your buffers when you call Allreduce? Can you send us your Allreduce call params (e.g., the reduce operation, datatype, num elements)?

If you don't want to send that to the list, you can send it to me at LANL.

I haven't seen any updates on this. Please tell me Bryan sent info to Ralph at LANL and Ralph nailed this one. Please! :^)

Eugene,

I've got mostly good news ...

Ralph sent me a platform file and a corresponding .conf file. I built ompi from openmpi-1.3.3a1r21223.tar.gz, with these files. I've been running my normal tests and have been unable to hang a job yet. I've run enough that I don't expect to see a problem.

So we're up and running, but with some extra voodoo in the platform files. This is on a totally vanilla Fedora 9 installation (other than a couple of Fortran compilers, but we're not using the Fortran interface to mpi), running on a Dell workstation with 2 quad-core CPUs - vanilla hardware, too. MPI isn't going out of the box.

From a user's perspective, configure should be setting the right defaults on such a setup. But the core code seems to be working - I'm giving it a good hammering.

The allreduces in question were doing a logical or on 1 integer from each process - it was an error check. Hence the buffers (on the application side) were 4 bytes. There were only 4 processes involved.

        - Bryan

--
Bryan Lally, la...@lanl.gov
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico

Reply via email to