Thanks Ralph. I'll check it on next Monday.
Tetsuya > Should be fixed with r32058 > > > On Jun 20, 2014, at 4:13 AM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi Ralph, > > > > By the way, something is wrong with your latest rmaps_rank_file.c. > > I've got the error below. I'm tring to find the problem. But, you > > could find it more quickly... > > > > [mishima@manage trial]$ cat rankfile > > rank 0=node05 slot=0-1 > > rank 1=node05 slot=3-4 > > rank 2=node05 slot=6-7 > > [mishima@manage trial]$ mpirun -np 3 -rf rankfile -report-bindings > > demos/myprog > > -------------------------------------------------------------------------- > > Error, invalid syntax in the rankfile (rankfile) > > syntax must be the fallowing > > rank i=host_i slot=string > > Examples of proper syntax include: > > rank 1=host1 slot=1:0,1 > > rank 0=host2 slot=0:* > > rank 2=host4 slot=1-2 > > rank 3=host3 slot=0:1;1:0-2 > > -------------------------------------------------------------------------- > > [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file > > rmaps_rank_file.c at line 483 > > [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file > > rmaps_rank_file.c at line 149 > > [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file > > base/rmaps_base_map_job.c at line 287 > > > > Regards, > > Tetsuya Mishima > > > >> My guess is that the coll/ml component may have problems with binding a > > single process across multiple cores like that - it might be that we'll > > have to have it check for that condition and disqualify > >> itself. It is a particularly bad binding pattern, though, as shared > > memory gets completely messed up when you split that way. > >> > >> > >> On Jun 19, 2014, at 3:57 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> Hi folks, > >>> > >>> Recently I have been seeing a hang with trunk when I specify a > >>> particular binding by use of rankfile or "-map-by slot". > >>> > >>> This can be reproduced by the rankfile which allocates a process > >>> beyond socket boundary. For example, on the node05 which has 2 socket > >>> with 4 core, the rank 1 is allocated through socket 0 and 1 as shown > >>> below. Then it hangs in the middle of communication. > >>> > >>> [mishima@manage trial]$ cat rankfile1 > >>> rank 0=node05 slot=0-1 > >>> rank 1=node05 slot=3-4 > >>> rank 2=node05 slot=6-7 > >>> > >>> [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings > > demos/myprog > >>> [node05.cluster:02342] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > socket > >>> 0[core 1[hwt 0]]: [B/B/./.][./././.] > >>> [node05.cluster:02342] MCW rank 1 bound to socket 0[core 3[hwt 0]], > > socket > >>> 1[core 4[hwt 0]]: [./././B][B/././.] > >>> [node05.cluster:02342] MCW rank 2 bound to socket 1[core 6[hwt 0]], > > socket > >>> 1[core 7[hwt 0]]: [./././.][././B/B] > >>> Hello world from process 2 of 3 > >>> Hello world from process 1 of 3 > >>> << hang here! >> > >>> > >>> If I disable coll_ml or use 1.8 series, it works, which means it > >>> might be affected by coll_ml component, I guess. But, unfortunately, > >>> I have no idea to fix this problem. So, please somebody could resolve > >>> the issue. > >>> > >>> [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings -mca > >>> coll_ml_priority 0 demos/myprog > >>> [node05.cluster:02382] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > socket > >>> 0[core 1[hwt 0]]: [B/B/./.][./././.] > >>> [node05.cluster:02382] MCW rank 1 bound to socket 0[core 3[hwt 0]], > > socket > >>> 1[core 4[hwt 0]]: [./././B][B/././.] > >>> [node05.cluster:02382] MCW rank 2 bound to socket 1[core 6[hwt 0]], > > socket > >>> 1[core 7[hwt 0]]: [./././.][././B/B] > >>> Hello world from process 2 of 3 > >>> Hello world from process 0 of 3 > >>> Hello world from process 1 of 3 > >>> > >>> In addtition, when I use the host with 12 cores, "-map-by slot" causes > > the > >>> same problem. > >>> [mishima@manage trial]$ mpirun -np 3 -map-by slot:pe=4 -report-bindings > >>> demos/myprog > >>> [manage.cluster:02557] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > socket > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>> cket 0[core 3[hwt 0]]: [B/B/B/B/./.][./././././.] > >>> [manage.cluster:02557] MCW rank 1 bound to socket 0[core 4[hwt 0]], > > socket > >>> 0[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > >>> cket 1[core 7[hwt 0]]: [././././B/B][B/B/./././.] > >>> [manage.cluster:02557] MCW rank 2 bound to socket 1[core 8[hwt 0]], > > socket > >>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>> ocket 1[core 11[hwt 0]]: [./././././.][././B/B/B/B] > >>> Hello world from process 1 of 3 > >>> Hello world from process 2 of 3 > >>> << hang here! >> > >>> > >>> Regards, > >>> Tetsuya Mishima > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/06/15030.php > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/06/15032.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15039.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15040.php