Jeff,
Only the processes of the program where process 0 successed to publish
name, have srv=1 and then call MPI_Comm_accept.
The processes of the program where process 0 failed to publish name,
have srv=0 and then call MPI_Comm_connect.
That's worked like this with openmpi 1.4.1.
Is it diffe
Jeff,
The dead lock is not in MPI_Comm_accept and MPI_Comm_connect, but before
in MPI_Publish_name and MPI_Lookup_name.
So the broadcast of srv is not involved in the dead lock.
Best
Bernard
Bernard Secher - SFME/LGLS a écrit :
Jeff,
Only the processes of the program where process 0 success
I get the same dead lock with openmpi tests: pubsub, accept and connect
with version 1.5.1
Bernard Secher - SFME/LGLS a écrit :
Jeff,
The dead lock is not in MPI_Comm_accept and MPI_Comm_connect, but
before in MPI_Publish_name and MPI_Lookup_name.
So the broadcast of srv is not involved in t
The accept and connect tests are OK with version openmpi 1.4.1.
I think there is a bug in version 1.5.1
Best
Bernard
Bernard Secher - SFME/LGLS a écrit :
I get the same dead lock with openmpi tests: pubsub, accept and
connect with version 1.5.1
Bernard Secher - SFME/LGLS a écrit :
Jeff,
Th
On 6 January 2011 21:10, Gilbert Grosdidier wrote:
> Hi Jeff,
>
> Where's located lstopo command on SuseLinux, please ?
> And/or hwloc-bind, which seems related to it ?
I was able to get hwloc to install quite easily on SuSE -
download/configure/make
Configure it to install to /usr/local/bin
A
On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:
> > lstopo
> Machine (35GB)
> NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
>L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
> PU L#0 (P#0)
> PU L#1 (P#8)
>L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
> PU L#2 (P#1)
>
Hi Jeff,
Thanks for taking care of this.
Here is what I got on a worker node:
> mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/
1.1rc6r3028/bin/hwloc-bind --get
0x0001
Is this what is expected, please ? Or should I try yet another
command ?
Thanks, Regards, Gilbert
On Jan 7, 2011, at 5:27 AM, John Hearns wrote:
> Actually, the topic of hyperthreading is interesting, and we should
> discuss it please.
> Hyperthreading is supposedly implemented better and 'properly' on
> Nehalem - I would be interested to see some genuine
> performance measurements with hypert
Can you run with np=8?
On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:
> Hi Jeff,
>
> Thanks for taking care of this.
>
> Here is what I got on a worker node:
>
> > mpirun --mca mpi_paffinity_alone 1
> > /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get
> 0x0001
>
> Is thi
Yes, here it is :
> mpirun -np 8 --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/
1.1rc6r3028/bin/hwloc-bind --get
0x0001
0x0002
0x0004
0x0008
0x0010
0x0020
0x0040
0x0080
Gilbert.
Le 7 janv. 11 à 15:50, Jeff Squyres a écrit :
Can you run with np=8?
On J
The FW version looks ok. But it may be driver issues as well. I guess that OFED
1.4.X or 1.5.x driver should be ok.
To check driver version , you may run ofed_info command.
Regards,
Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National
You're calling bcast with root=0, so whatever value rank 0 has for srv,
everyone will have after the bcast. Plus, I didn't see in your code where *srv
was ever set to 0.
In my runs, rank 0 is usually the one that publishes first. Everyone then gets
the lookup properly, and then the bcast send
On 1/7/2011 6:49 AM, Jeff Squyres wrote:
My understanding is that hyperthreading can only be activated/deactivated at
boot time -- once the core resources are allocated to hyperthreads, they can't
be changed while running.
Whether disabling the hyperthreads or simply telling Linux not to sche
+1
AFAIR (and I stopped being an IB vendor a long time ago, so I might be wrong),
the _resize_cq function being there or not is not an issue of the underlying
HCA; it's a function of what version of OFED you're running.
On Jan 7, 2011, at 10:14 AM, Shamis, Pavel wrote:
> The FW version looks
Well, bummer -- there goes my theory. According to the hwloc info you posted
earlier, this shows that OMPI is binding to the 1st hyperthread on each core;
*not* to both hyperthreads on a single core. :-\
It would still be slightly interesting to see if there's any difference when
you run with
I'll very soon give a try to using Hyperthreading with our app,
and keep you posted about the improvements, if any.
Our current cluster is made out of 4-core dual-socket Nehalem nodes.
Cheers,Gilbert.
Le 7 janv. 11 à 16:17, Tim Prince a écrit :
On 1/7/2011 6:49 AM, Jeff Squyres wrote:
srv = 0 is set in my main program
I call Bcast because all the processes must call MPI_Comm_accept
(collective) or must call MPI_Comm_connect (collective)
Anyway, I get also a dead lock with your lookup program:
That's what I do:
ompi-server -r URIfile
mpirun -np 1 -ompi-server file:URIfile
Bonjour Pavel,
Here is the output of the ofed_info command :
==
OFED-1.4.1
libibverbs:
git://git.openfabrics.org/ofed_1_4/libibverbs.git ofed_1_4
commit b00dc7d2f79e0660ac40160607c9c4937a895433
libmthca:
git://git.kernel.org/pub/scm/libs/infiniban
I'm still testing the slurm integration, which seems to work fine so
far. However, i just upgraded another cluster to openmpi-1.5 and
slurm 2.1.15 but this machine has no infiniband
if i salloc the nodes and mpirun the command it seems to run and complete fine
however if i srun the command i get
On Jan 7, 2011, at 10:41 AM, Bernard Secher - SFME/LGLS wrote:
> srv = 0 is set in my main program
> I call Bcast because all the processes must call MPI_Comm_accept (collective)
> or must call MPI_Comm_connect (collective)
Ah -- I see. I thought this was a test program where some processes wer
On Jan 7, 2011, at 11:16 AM, Jeff Squyres wrote:
> Ok, I can replicate the hang in publish now. I'll file a bug report.
Filed here:
https://svn.open-mpi.org/trac/ompi/ticket/2681
Thanks for your persistence!
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http:
Unfortunately, I was unable to spot any striking difference in perfs
when using --bind-to-core.
Sorry. Any other suggestion ?
Regards,Gilbert.
Le 7 janv. 11 à 16:32, Jeff Squyres a écrit :
Well, bummer -- there goes my theory. According to the hwloc info
you posted earlier, this s
Ralph Castain wrote:
> Afraid not - though you could alias your program name to be "nice --10
prog"
>
Is there an OMPI wish list? If so, can we please add to it "a method
to tell mpirun what nice values to use when it starts programs on
nodes"? Minimally, something like this:
--nice 12
David Mathog wrote:
Ralph Castain wrote:
Afraid not - though you could alias your program name to be "nice --10 prog"
Is there an OMPI wish list? If so, can we please add to it "a method
to tell mpirun what nice values to use when it starts programs on
nodes"? Minimall
Gilbert Grosdidier wrote:
Any other suggestion ?
Can any more information be extracted from profiling? Here is where I
think things left off:
Eugene Loh wrote:
Gilbert Grosdidier wrote:
#
[time] [calls] <%mpi> <%wall>
# MPI_Waitall
Hello,
When I run this code:
program testcase
use mpi
implicit none
integer :: rank, lsize, rsize, code
integer :: intercomm
call MPI_INIT(code)
call MPI_COMM_GET_PARENT(intercomm, code)
if (intercomm == MPI_COMM_NULL) then
call MPI_COMM_SPAWN ("./testcase"
26 matches
Mail list logo