Eugene Loh wrote:
Ralph Castain wrote:
Hi Bryan
I have seen similar issues on LANL clusters when message sizes were
fairly large. How big are your buffers when you call Allreduce? Can
you send us your Allreduce call params (e.g., the reduce operation,
datatype, num elements)?
If you don't want to send that to the list, you can send it to me at
LANL.
I haven't seen any updates on this. Please tell me Bryan sent info to
Ralph at LANL and Ralph nailed this one. Please! :^)
Eugene,
I've got mostly good news ...
Ralph sent me a platform file and a corresponding .conf file. I built
ompi from openmpi-1.3.3a1r21223.tar.gz, with these files. I've been
running my normal tests and have been unable to hang a job yet. I've
run enough that I don't expect to see a problem.
So we're up and running, but with some extra voodoo in the platform
files. This is on a totally vanilla Fedora 9 installation (other than a
couple of Fortran compilers, but we're not using the Fortran interface
to mpi), running on a Dell workstation with 2 quad-core CPUs - vanilla
hardware, too. MPI isn't going out of the box.
From a user's perspective, configure should be setting the right
defaults on such a setup. But the core code seems to be working - I'm
giving it a good hammering.
The allreduces in question were doing a logical or on 1 integer from
each process - it was an error check. Hence the buffers (on the
application side) were 4 bytes. There were only 4 processes involved.
- Bryan
--
Bryan Lally, la...@lanl.gov
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico