Hi Ralph,

By the way, something is wrong with your latest rmaps_rank_file.c.
I've got the error below. I'm tring to find the problem. But, you
could find it more quickly...

[mishima@manage trial]$ cat rankfile
rank 0=node05 slot=0-1
rank 1=node05 slot=3-4
rank 2=node05 slot=6-7
[mishima@manage trial]$ mpirun -np 3 -rf rankfile -report-bindings
demos/myprog
--------------------------------------------------------------------------
Error, invalid syntax in the rankfile (rankfile)
syntax must be the fallowing
rank i=host_i slot=string
Examples of proper syntax include:
    rank 1=host1 slot=1:0,1
    rank 0=host2 slot=0:*
    rank 2=host4 slot=1-2
    rank 3=host3 slot=0:1;1:0-2
--------------------------------------------------------------------------
[manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file
rmaps_rank_file.c at line 483
[manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file
rmaps_rank_file.c at line 149
[manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_map_job.c at line 287

Regards,
Tetsuya Mishima

> My guess is that the coll/ml component may have problems with binding a
single process across multiple cores like that - it might be that we'll
have to have it check for that condition and disqualify
> itself. It is a particularly bad binding pattern, though, as shared
memory gets completely messed up when you split that way.
>
>
> On Jun 19, 2014, at 3:57 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> > Hi folks,
> >
> > Recently I have been seeing a hang with trunk when I specify a
> > particular binding by use of rankfile or "-map-by slot".
> >
> > This can be reproduced by the rankfile which allocates a process
> > beyond socket boundary. For example, on the node05 which has 2 socket
> > with 4 core, the rank 1 is allocated through socket 0 and 1 as shown
> > below. Then it hangs in the middle of communication.
> >
> > [mishima@manage trial]$ cat rankfile1
> > rank 0=node05 slot=0-1
> > rank 1=node05 slot=3-4
> > rank 2=node05 slot=6-7
> >
> > [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings
demos/myprog
> > [node05.cluster:02342] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]]: [B/B/./.][./././.]
> > [node05.cluster:02342] MCW rank 1 bound to socket 0[core 3[hwt 0]],
socket
> > 1[core 4[hwt 0]]: [./././B][B/././.]
> > [node05.cluster:02342] MCW rank 2 bound to socket 1[core 6[hwt 0]],
socket
> > 1[core 7[hwt 0]]: [./././.][././B/B]
> > Hello world from process 2 of 3
> > Hello world from process 1 of 3
> > << hang here! >>
> >
> > If I disable coll_ml or use 1.8 series, it works, which means it
> > might be affected by coll_ml component, I guess. But, unfortunately,
> > I have no idea to fix this problem. So, please somebody could resolve
> > the issue.
> >
> > [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings -mca
> > coll_ml_priority 0 demos/myprog
> > [node05.cluster:02382] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]]: [B/B/./.][./././.]
> > [node05.cluster:02382] MCW rank 1 bound to socket 0[core 3[hwt 0]],
socket
> > 1[core 4[hwt 0]]: [./././B][B/././.]
> > [node05.cluster:02382] MCW rank 2 bound to socket 1[core 6[hwt 0]],
socket
> > 1[core 7[hwt 0]]: [./././.][././B/B]
> > Hello world from process 2 of 3
> > Hello world from process 0 of 3
> > Hello world from process 1 of 3
> >
> > In addtition, when I use the host with 12 cores, "-map-by slot" causes
the
> > same problem.
> > [mishima@manage trial]$ mpirun -np 3 -map-by slot:pe=4 -report-bindings
> > demos/myprog
> > [manage.cluster:02557] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > cket 0[core 3[hwt 0]]: [B/B/B/B/./.][./././././.]
> > [manage.cluster:02557] MCW rank 1 bound to socket 0[core 4[hwt 0]],
socket
> > 0[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> > cket 1[core 7[hwt 0]]: [././././B/B][B/B/./././.]
> > [manage.cluster:02557] MCW rank 2 bound to socket 1[core 8[hwt 0]],
socket
> > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > ocket 1[core 11[hwt 0]]: [./././././.][././B/B/B/B]
> > Hello world from process 1 of 3
> > Hello world from process 2 of 3
> > << hang here! >>
> >
> > Regards,
> > Tetsuya Mishima
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/06/15030.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/06/15032.php

Reply via email to