just an idea, maybe it is worse to provide brand new cmd line option to mpirun. This option will accept filename and support combined syntax for machinefile/hostfile (to define allocations) and rankfile (to define placement).
YAML syntax can be used in order to describe file primitives ( http://www.yaml.org/start.html) for example: $ mpirun -clusterfile /path/to/clusterfile $ cat clusterfile hostX: slots : int maxslots : int ranks : rankid[@socket:core] example of clusterfile =============== hostX: slots : 4 maxslots : 4 ranks : 1,16,22 hostY: slots : 8 maxslots : 8 ranks : 1@0:*, 3@2-3, 4@0:1, 5 By doing so, we keep backwards compatability. after reading clusterfile, code should perform *hostfile* and *rankfile* parts as today. what do you think? Mike On Mon, Jun 22, 2009 at 1:30 PM, Terry Dontje <terry.don...@sun.com> wrote: > Let us think about this some more. We'll try and reply later today. > > --td > > Ralph Castain wrote: > >> Had a chance to think about how this might be done, and looked at it for >> awhile after getting home. I -think- I found a way to do it, but there are a >> couple of caveats: >> >> 1. Len's point about oversubscribing without warning would definitely hold >> true - this would positively be a "user beware" option >> >> 2. there could be no RM-provided allocation, hostfile, or -host options >> specified. Basically, I would be adding the "read rankfile" option to the >> end of the current allocation determination procedure >> >> I would still allow more procs than shown in the rankfile (mapping the >> rest bynode on the nodes specified in the rankfile - can't do byslot because >> I don't know how many slots are on each node), which means the only change >> in behavior would be the forced bynode mapping of unspecified procs. >> >> So use of this option will entail some risks and a slight difference in >> behavior, but would relieve you from the burden of having to provide a >> hostfile. I'm not personally convinced it is worth the risk and probable >> user complaints of "it didn't work", but since we don't use this option, I >> don't have a strong opinion on the matter. >> >> Let's just avoid going back-and-forth over wanting it, or how it should be >> implemented - let's get it all ironed out, and then implement it once, like >> we finally did at the end with the whole hostfile thing. >> >> Let me know if you want me to do this - it obviously isn't at the top of >> my priority list, but still could be done in the next few weeks. >> >> Ralph >> >> >> On Jun 21, 2009, at 9:00 AM, Lenny Verkhovsky wrote: >> >> Sorry for the delay in response, I totally agree with Ralph that it's not >>> as easy as it seems, 1. rankfile mapper uses already allocated machines ( by >>> scheduler or hostfile ), by using rankfile as a hostfile we can run into >>> problem where trying to use unallocated nodes, what can hang the run. >>> 2. we can't define in rankfile number of slots on each machine, which >>> means oversubscribing can take place without any warning. >>> 3. I personally dont see any problem using hostfile, even if it has >>> redundant info, hostfile and rankfile belong to different layers in the >>> system and solve different problems. The original hostfile ( if I recall >>> correctly ) could bind rank to the node, but the syntax wasn't very flexible >>> and clear. >>> Lenny. >>> >>> On Sun, Jun 21, 2009 at 5:15 PM, Ralph Castain <r...@open-mpi.org<mailto: >>> r...@open-mpi.org>> wrote: >>> >>> Let me suggest a two-step process, then: >>> >>> 1. let's change the error message as this is easily done and thus >>> can be done now >>> >>> 2. I can look at how to eat the rankfile as a hostfile. This may >>> not even be possible - the problem is that the entire system is >>> predicated on certain ordering due to our framework architecture. >>> So we get an allocation, and then do a mapping against that >>> allocation, filtering the allocation through hostfiles, -host, >>> and other options. >>> >>> By the time we reach the rankfile mapper, we have already >>> determined that we don't have an allocation and have to abort. It >>> is the rankfile mapper itself that looks for the -rankfile >>> option, so the system can have no knowledge that someone has >>> specified that option before that point - and thus, even if I >>> could parse the rankfile, I don't know it was given! >>> >>> What will take time is to figure out a way to either: >>> >>> (a) allow us to run the mapper even though we don't have any >>> nodes we know about, and allow the mapper to insert the nodes >>> itself - without causing non-rankfile uses to break (which could >>> be a major feat); or >>> >>> (b) have the overall system check for the rankfile option and >>> pass it as a hostfile as well, assuming that a hostfile wasn't >>> also given, no RM-based allocation exists, etc. - which breaks >>> our abstraction rules and also opens a possible can of worms. >>> >>> Either way, I also then have to teach the hostfile parser how to >>> realize it is a rankfile format and convert the info in it into >>> what we expected to receive from a hostfile - another non-trivial >>> problem. >>> >>> I'm willing to give it a try - just trying to make clear why my >>> response was negative. It isn't as simple as it sounds...which is >>> why Len and I didn't pursue it when this was originally developed. >>> >>> Ralph >>> >>> >>> On Sun, Jun 21, 2009 at 5:28 AM, Terry Dontje >>> <terry.don...@sun.com <mailto:terry.don...@sun.com>> wrote: >>> >>> Being a part of these discussions I can understand your >>> reticence to reopen this discussion. However, I think this >>> is a major usability issue with this feature which actually >>> is fairly important in order to get things to run performant. >>> Which IMO is important. >>> >>> That being said I think there are one of two things that >>> could be done to mitigate the issue. >>> >>> 1. To eliminate the element of surprise by changing mpirun >>> to eat rankfile without the hostfile. >>> 2. To change the error message to something understandable >>> by the user such that they >>> know they might be missing the hostfile option. >>> >>> Again I understand this topic is frustrating and there are >>> some boundaries with the design that make these two option >>> orthogonal to each other but I really believe we need to make >>> the rankfile option something that is easily usable by our users. >>> >>> >>> --td >>> >>> Ralph Castain wrote: >>> >>> Having gone around in circles on hostfile-related issues >>> for over five years now, I honestly have little >>> motivation to re-open the entire discussion again. It >>> doesn't seem to be that daunting a requirement for those >>> who are using it, so I'm inclined to just leave well >>> enough alone. >>> >>> :-) >>> >>> >>> On Fri, Jun 19, 2009 at 2:21 PM, Eugene Loh >>> <eugene....@sun.com <mailto:eugene....@sun.com> >>> <mailto:eugene....@sun.com <mailto:eugene....@sun.com>>> >>> >>> wrote: >>> >>> Ralph Castain wrote: >>> >>> The two files have a slightly different format >>> >>> Agreed. >>> >>> and completely different meaning. >>> >>> Somewhat agreed. They're both related to mapping >>> processes onto a >>> cluster. >>> >>> The hostfile specifies how many slots are on a >>> node. The rankfile >>> specifies a rank and what node/slot it is to be >>> mapped onto. >>> >>> Agreed. >>> >>> Rankfiles can use relative node indexing and refer >>> to nodes >>> received from a resource manager - i.e., without >>> any hostfile. >>> >>> This is the main part I'm concerned about. E.g., >>> >>> % cat rankfile >>> rank 0=node0 slot=0 >>> rank 1=node1 slot=0 >>> % mpirun -np 2 -rf rankfile ./a.out >>> >>> -------------------------------------------------------------------------- >>> Rankfile claimed host node1 that was not allocated or >>> oversubscribed it's slots: >>> >>> >>> -------------------------------------------------------------------------- >>> [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad >>> parameter in file >>> rmaps_rank_file.c at line 107 >>> [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad >>> parameter in file >>> base/rmaps_base_map_job.c at line 86 >>> [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad >>> parameter in file >>> base/plm_base_launch_support.c at line 86 >>> [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad >>> parameter in file >>> plm_rsh_module.c at line 1016 >>> % mpirun -np 2 -host node0,node1 -rf rankfile ./a.out >>> 0 on node0 >>> 1 on node1 >>> done >>> >>> It seems to me that the rankfile has sufficient >>> information to >>> express what I want it to do. But mpirun won't accept >>> this. To >>> fix this, I have to, e.g., supply/maintain/specify >>> redundant >>> information in a hostfile or host list. >>> >>> So the files are intentionally quite different. >>> Trying to combine >>> them would be rather ugly. >>> >>> Right. And my issue is that I'm forced to use both >>> when I only >>> want rankfile functionality. >>> >>> On Thu, Jun 18, 2009 at 1:52 PM, Eugene Loh >>> <eugene....@sun.com <mailto:eugene....@sun.com> >>> <mailto:eugene....@sun.com >>> <mailto:eugene....@sun.com>>> wrote: >>> >>> In order to use "mpirun --rankfile", I also >>> need to specify >>> hosts/hostlist. But that information is >>> redundant with what >>> I provide in the rankfile. So, from a user's >>> point of view, >>> this strikes me as broken. Yes? Should I >>> file a ticket, or >>> am I missing something here about this >>> functionality? >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> <mailto:de...@open-mpi.org <mailto:de...@open-mpi.org>> >>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> >>> ------------------------------------------------------------------------ >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >