Thanks Ralph. I'll check it on next Monday.

Tetsuya

> Should be fixed with r32058
>
>
> On Jun 20, 2014, at 4:13 AM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > By the way, something is wrong with your latest rmaps_rank_file.c.
> > I've got the error below. I'm tring to find the problem. But, you
> > could find it more quickly...
> >
> > [mishima@manage trial]$ cat rankfile
> > rank 0=node05 slot=0-1
> > rank 1=node05 slot=3-4
> > rank 2=node05 slot=6-7
> > [mishima@manage trial]$ mpirun -np 3 -rf rankfile -report-bindings
> > demos/myprog
> >
--------------------------------------------------------------------------
> > Error, invalid syntax in the rankfile (rankfile)
> > syntax must be the fallowing
> > rank i=host_i slot=string
> > Examples of proper syntax include:
> >    rank 1=host1 slot=1:0,1
> >    rank 0=host2 slot=0:*
> >    rank 2=host4 slot=1-2
> >    rank 3=host3 slot=0:1;1:0-2
> >
--------------------------------------------------------------------------
> > [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in
file
> > rmaps_rank_file.c at line 483
> > [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in
file
> > rmaps_rank_file.c at line 149
> > [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in
file
> > base/rmaps_base_map_job.c at line 287
> >
> > Regards,
> > Tetsuya Mishima
> >
> >> My guess is that the coll/ml component may have problems with binding
a
> > single process across multiple cores like that - it might be that we'll
> > have to have it check for that condition and disqualify
> >> itself. It is a particularly bad binding pattern, though, as shared
> > memory gets completely messed up when you split that way.
> >>
> >>
> >> On Jun 19, 2014, at 3:57 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>>
> >>> Hi folks,
> >>>
> >>> Recently I have been seeing a hang with trunk when I specify a
> >>> particular binding by use of rankfile or "-map-by slot".
> >>>
> >>> This can be reproduced by the rankfile which allocates a process
> >>> beyond socket boundary. For example, on the node05 which has 2 socket
> >>> with 4 core, the rank 1 is allocated through socket 0 and 1 as shown
> >>> below. Then it hangs in the middle of communication.
> >>>
> >>> [mishima@manage trial]$ cat rankfile1
> >>> rank 0=node05 slot=0-1
> >>> rank 1=node05 slot=3-4
> >>> rank 2=node05 slot=6-7
> >>>
> >>> [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings
> > demos/myprog
> >>> [node05.cluster:02342] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
> >>> [node05.cluster:02342] MCW rank 1 bound to socket 0[core 3[hwt 0]],
> > socket
> >>> 1[core 4[hwt 0]]: [./././B][B/././.]
> >>> [node05.cluster:02342] MCW rank 2 bound to socket 1[core 6[hwt 0]],
> > socket
> >>> 1[core 7[hwt 0]]: [./././.][././B/B]
> >>> Hello world from process 2 of 3
> >>> Hello world from process 1 of 3
> >>> << hang here! >>
> >>>
> >>> If I disable coll_ml or use 1.8 series, it works, which means it
> >>> might be affected by coll_ml component, I guess. But, unfortunately,
> >>> I have no idea to fix this problem. So, please somebody could resolve
> >>> the issue.
> >>>
> >>> [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings -mca
> >>> coll_ml_priority 0 demos/myprog
> >>> [node05.cluster:02382] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
> >>> [node05.cluster:02382] MCW rank 1 bound to socket 0[core 3[hwt 0]],
> > socket
> >>> 1[core 4[hwt 0]]: [./././B][B/././.]
> >>> [node05.cluster:02382] MCW rank 2 bound to socket 1[core 6[hwt 0]],
> > socket
> >>> 1[core 7[hwt 0]]: [./././.][././B/B]
> >>> Hello world from process 2 of 3
> >>> Hello world from process 0 of 3
> >>> Hello world from process 1 of 3
> >>>
> >>> In addtition, when I use the host with 12 cores, "-map-by slot"
causes
> > the
> >>> same problem.
> >>> [mishima@manage trial]$ mpirun -np 3 -map-by slot:pe=4
-report-bindings
> >>> demos/myprog
> >>> [manage.cluster:02557] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>> cket 0[core 3[hwt 0]]: [B/B/B/B/./.][./././././.]
> >>> [manage.cluster:02557] MCW rank 1 bound to socket 0[core 4[hwt 0]],
> > socket
> >>> 0[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>> cket 1[core 7[hwt 0]]: [././././B/B][B/B/./././.]
> >>> [manage.cluster:02557] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> > socket
> >>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>> ocket 1[core 11[hwt 0]]: [./././././.][././B/B/B/B]
> >>> Hello world from process 1 of 3
> >>> Hello world from process 2 of 3
> >>> << hang here! >>
> >>>
> >>> Regards,
> >>> Tetsuya Mishima
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/06/15030.php
> >>
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/06/15032.php
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/06/15039.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/06/15040.php

Reply via email to