Hi Ralph,

Thanks for your reply!

> One thing you might want to try: add this to your mpirun cmd line:
> 
> --display-allocation
> 
> This will tell you how many slots we think we've been given on your
> cluster.

I tried that using 1.8.2rc4, this is what I get:

======================   ALLOCATED NODES   ======================
        node2: slots=48 max_slots=48 slots_inuse=0 state=UNKNOWN
=================================================================

I forgot to mention previously that mpirun runs all cores on localhost,
it is only when running on another host (--hostfile hosts) that the 32
proc cap is observed. I'm attaching a snapshot of the most recent run.
The job was invoked by:

/usr/local/openmpi-1.8.2rc4/bin/mpirun -np 48 --hostfile hosts
  --display-allocation ./test.py > test.std 2> test.ste

test.ste contains the hwloc error I mentioned in my previous post:

****************************************************************************
* hwloc has encountered what looks like an error from the operating system.
*
* object (L3 cpuset 0x000003f0) intersection without inclusion!
* Error occurred in topology.c line 760
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology.sh script.
****************************************************************************

Hope this helps,
Andrej


> On Aug 21, 2014, at 12:50 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > Starting early in the 1.7 series, we began to bind procs by default
> > to cores when -np <= 2, and to sockets if np > 2. Is it possible
> > this is what you are seeing?
> > 
> > 
> > On Aug 21, 2014, at 12:45 PM, Andrej Prsa <aprs...@gmail.com> wrote:
> > 
> >> Dear devels,
> >> 
> >> I have been trying out 1.8.2rcs recently and found a show-stopping
> >> problem on our cluster. Running any job with any number of
> >> processors larger than 32 will always employ only 32 cores per
> >> node (our nodes have 48 cores). We are seeing identical behavior
> >> with 1.8.2rc4, 1.8.2rc2, and 1.8.1. Running identical programs
> >> shows no such issues with version 1.6.5, where all 48 cores per
> >> node are working. While our system is running torque/maui, the
> >> problem is evident by running mpirun directly.
> >> 
> >> I am attaching hwloc topology in case that helps -- I am aware of a
> >> buggy bios code that trips hwloc, but I don't know if that might
> >> be an issue or not. I am happy to help debugging if you can
> >> provide me with guidance.
> >> 
> >> Thanks,
> >> Andrej
> >> <cluster.output><cluster.tar.bz2>_______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/devel/2014/08/15676.php
> > 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15678.php

Reply via email to