Should be fixed with r32058

On Jun 20, 2014, at 4:13 AM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> By the way, something is wrong with your latest rmaps_rank_file.c.
> I've got the error below. I'm tring to find the problem. But, you
> could find it more quickly...
> 
> [mishima@manage trial]$ cat rankfile
> rank 0=node05 slot=0-1
> rank 1=node05 slot=3-4
> rank 2=node05 slot=6-7
> [mishima@manage trial]$ mpirun -np 3 -rf rankfile -report-bindings
> demos/myprog
> --------------------------------------------------------------------------
> Error, invalid syntax in the rankfile (rankfile)
> syntax must be the fallowing
> rank i=host_i slot=string
> Examples of proper syntax include:
>    rank 1=host1 slot=1:0,1
>    rank 0=host2 slot=0:*
>    rank 2=host4 slot=1-2
>    rank 3=host3 slot=0:1;1:0-2
> --------------------------------------------------------------------------
> [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file
> rmaps_rank_file.c at line 483
> [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file
> rmaps_rank_file.c at line 149
> [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/rmaps_base_map_job.c at line 287
> 
> Regards,
> Tetsuya Mishima
> 
>> My guess is that the coll/ml component may have problems with binding a
> single process across multiple cores like that - it might be that we'll
> have to have it check for that condition and disqualify
>> itself. It is a particularly bad binding pattern, though, as shared
> memory gets completely messed up when you split that way.
>> 
>> 
>> On Jun 19, 2014, at 3:57 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> Hi folks,
>>> 
>>> Recently I have been seeing a hang with trunk when I specify a
>>> particular binding by use of rankfile or "-map-by slot".
>>> 
>>> This can be reproduced by the rankfile which allocates a process
>>> beyond socket boundary. For example, on the node05 which has 2 socket
>>> with 4 core, the rank 1 is allocated through socket 0 and 1 as shown
>>> below. Then it hangs in the middle of communication.
>>> 
>>> [mishima@manage trial]$ cat rankfile1
>>> rank 0=node05 slot=0-1
>>> rank 1=node05 slot=3-4
>>> rank 2=node05 slot=6-7
>>> 
>>> [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings
> demos/myprog
>>> [node05.cluster:02342] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node05.cluster:02342] MCW rank 1 bound to socket 0[core 3[hwt 0]],
> socket
>>> 1[core 4[hwt 0]]: [./././B][B/././.]
>>> [node05.cluster:02342] MCW rank 2 bound to socket 1[core 6[hwt 0]],
> socket
>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>> Hello world from process 2 of 3
>>> Hello world from process 1 of 3
>>> << hang here! >>
>>> 
>>> If I disable coll_ml or use 1.8 series, it works, which means it
>>> might be affected by coll_ml component, I guess. But, unfortunately,
>>> I have no idea to fix this problem. So, please somebody could resolve
>>> the issue.
>>> 
>>> [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings -mca
>>> coll_ml_priority 0 demos/myprog
>>> [node05.cluster:02382] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node05.cluster:02382] MCW rank 1 bound to socket 0[core 3[hwt 0]],
> socket
>>> 1[core 4[hwt 0]]: [./././B][B/././.]
>>> [node05.cluster:02382] MCW rank 2 bound to socket 1[core 6[hwt 0]],
> socket
>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>> Hello world from process 2 of 3
>>> Hello world from process 0 of 3
>>> Hello world from process 1 of 3
>>> 
>>> In addtition, when I use the host with 12 cores, "-map-by slot" causes
> the
>>> same problem.
>>> [mishima@manage trial]$ mpirun -np 3 -map-by slot:pe=4 -report-bindings
>>> demos/myprog
>>> [manage.cluster:02557] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]: [B/B/B/B/./.][./././././.]
>>> [manage.cluster:02557] MCW rank 1 bound to socket 0[core 4[hwt 0]],
> socket
>>> 0[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>> cket 1[core 7[hwt 0]]: [././././B/B][B/B/./././.]
>>> [manage.cluster:02557] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> socket
>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>> ocket 1[core 11[hwt 0]]: [./././././.][././B/B/B/B]
>>> Hello world from process 1 of 3
>>> Hello world from process 2 of 3
>>> << hang here! >>
>>> 
>>> Regards,
>>> Tetsuya Mishima
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/15030.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/15032.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15039.php

Reply via email to