Hi folks,

Recently I have been seeing a hang with trunk when I specify a
particular binding by use of rankfile or "-map-by slot".

This can be reproduced by the rankfile which allocates a process
beyond socket boundary. For example, on the node05 which has 2 socket
with 4 core, the rank 1 is allocated through socket 0 and 1 as shown
below. Then it hangs in the middle of communication.

[mishima@manage trial]$ cat rankfile1
rank 0=node05 slot=0-1
rank 1=node05 slot=3-4
rank 2=node05 slot=6-7

[mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings demos/myprog
[node05.cluster:02342] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node05.cluster:02342] MCW rank 1 bound to socket 0[core 3[hwt 0]], socket
1[core 4[hwt 0]]: [./././B][B/././.]
[node05.cluster:02342] MCW rank 2 bound to socket 1[core 6[hwt 0]], socket
1[core 7[hwt 0]]: [./././.][././B/B]
Hello world from process 2 of 3
Hello world from process 1 of 3
<< hang here! >>

If I disable coll_ml or use 1.8 series, it works, which means it
might be affected by coll_ml component, I guess. But, unfortunately,
I have no idea to fix this problem. So, please somebody could resolve
the issue.

[mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings -mca
coll_ml_priority 0 demos/myprog
[node05.cluster:02382] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node05.cluster:02382] MCW rank 1 bound to socket 0[core 3[hwt 0]], socket
1[core 4[hwt 0]]: [./././B][B/././.]
[node05.cluster:02382] MCW rank 2 bound to socket 1[core 6[hwt 0]], socket
1[core 7[hwt 0]]: [./././.][././B/B]
Hello world from process 2 of 3
Hello world from process 0 of 3
Hello world from process 1 of 3

In addtition, when I use the host with 12 cores, "-map-by slot" causes the
same problem.
[mishima@manage trial]$ mpirun -np 3 -map-by slot:pe=4 -report-bindings
demos/myprog
[manage.cluster:02557] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B/./.][./././././.]
[manage.cluster:02557] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [././././B/B][B/B/./././.]
[manage.cluster:02557] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
ocket 1[core 11[hwt 0]]: [./././././.][././B/B/B/B]
Hello world from process 1 of 3
Hello world from process 2 of 3
<< hang here! >>

Regards,
Tetsuya Mishima

Reply via email to