just an idea, maybe it is worse to provide brand new cmd line option to
mpirun. This option will  accept filename and support combined syntax for
machinefile/hostfile (to define allocations) and rankfile (to define
placement).

YAML syntax can be used in order to describe file primitives (
http://www.yaml.org/start.html)

for example:


$ mpirun -clusterfile /path/to/clusterfile
$ cat clusterfile
hostX:
       slots       : int
       maxslots : int
       ranks      : rankid[@socket:core]


example of clusterfile
===============

hostX:
       slots       : 4
       maxslots : 4
       ranks      : 1,16,22

hostY:
      slots       : 8
      maxslots : 8
      ranks      : 1@0:*, 3@2-3,  4@0:1, 5


By doing so, we keep backwards compatability.
after reading clusterfile, code should perform *hostfile* and *rankfile*
parts as today.

what do you think?
Mike



On Mon, Jun 22, 2009 at 1:30 PM, Terry Dontje <terry.don...@sun.com> wrote:

> Let us think about this some more.  We'll try and reply later today.
>
> --td
>
> Ralph Castain wrote:
>
>> Had a chance to think about how this might be done, and looked at it for
>> awhile after getting home. I -think- I found a way to do it, but there are a
>> couple of caveats:
>>
>> 1. Len's point about oversubscribing without warning would definitely hold
>> true - this would positively be a "user beware" option
>>
>> 2. there could be no RM-provided allocation, hostfile, or -host options
>> specified. Basically, I would be adding the "read rankfile" option to the
>> end of the current allocation determination procedure
>>
>> I would still allow more procs than shown in the rankfile (mapping the
>> rest bynode on the nodes specified in the rankfile - can't do byslot because
>> I don't know how many slots are on each node), which means the only change
>> in behavior would be the forced bynode mapping of unspecified procs.
>>
>> So use of this option will entail some risks and a slight difference in
>> behavior, but would relieve you from the burden of having to provide a
>> hostfile. I'm not personally convinced it is worth the risk and probable
>> user complaints of "it didn't work", but since we don't use this option, I
>> don't have a strong opinion on the matter.
>>
>> Let's just avoid going back-and-forth over wanting it, or how it should be
>> implemented - let's get it all ironed out, and then implement it once, like
>> we finally did at the end with the whole hostfile thing.
>>
>> Let me know if you want me to do this - it obviously isn't at the top of
>> my priority list, but still could be done in the next few weeks.
>>
>> Ralph
>>
>>
>> On Jun 21, 2009, at 9:00 AM, Lenny Verkhovsky wrote:
>>
>>  Sorry for the delay in response, I totally agree with Ralph that it's not
>>> as easy as it seems, 1. rankfile mapper uses already allocated machines ( by
>>> scheduler or hostfile ), by using rankfile as a hostfile we can run into
>>> problem where trying to use unallocated nodes, what can hang the run.
>>> 2. we can't define in rankfile number of slots on each machine, which
>>> means oversubscribing can take place without any warning.
>>> 3. I personally dont see any problem using hostfile, even if it has
>>> redundant info, hostfile and rankfile belong to different layers in the
>>> system and solve different problems. The original hostfile ( if I recall
>>> correctly ) could bind rank to the node, but the syntax wasn't very flexible
>>> and clear.
>>> Lenny.
>>>
>>> On Sun, Jun 21, 2009 at 5:15 PM, Ralph Castain <r...@open-mpi.org<mailto:
>>> r...@open-mpi.org>> wrote:
>>>
>>>    Let me suggest a two-step process, then:
>>>
>>>    1. let's change the error message as this is easily done and thus
>>>    can be done now
>>>
>>>    2. I can look at how to eat the rankfile as a hostfile. This may
>>>    not even be possible - the problem is that the entire system is
>>>    predicated on certain ordering due to our framework architecture.
>>>    So we get an allocation, and then do a mapping against that
>>>    allocation, filtering the allocation through hostfiles, -host,
>>>    and other options.
>>>
>>>    By the time we reach the rankfile mapper, we have already
>>>    determined that we don't have an allocation and have to abort. It
>>>    is the rankfile mapper itself that looks for the -rankfile
>>>    option, so the system can have no knowledge that someone has
>>>    specified that option before that point - and thus, even if I
>>>    could parse the rankfile, I don't know it was given!
>>>
>>>    What will take time is to figure out a way to either:
>>>
>>>    (a) allow us to run the mapper even though we don't have any
>>>    nodes we know about, and allow the mapper to insert the nodes
>>>    itself - without causing non-rankfile uses to break (which could
>>>    be a major feat); or
>>>
>>>    (b) have the overall system check for the rankfile option and
>>>    pass it as a hostfile as well, assuming that a hostfile wasn't
>>>    also given, no RM-based allocation exists, etc. - which breaks
>>>    our abstraction rules and also opens a possible can of worms.
>>>
>>>    Either way, I also then have to teach the hostfile parser how to
>>>    realize it is a rankfile format and convert the info in it into
>>>    what we expected to receive from a hostfile - another non-trivial
>>>    problem.
>>>
>>>    I'm willing to give it a try - just trying to make clear why my
>>>    response was negative. It isn't as simple as it sounds...which is
>>>    why Len and I didn't pursue it when this was originally developed.
>>>
>>>    Ralph
>>>
>>>
>>>    On Sun, Jun 21, 2009 at 5:28 AM, Terry Dontje
>>>    <terry.don...@sun.com <mailto:terry.don...@sun.com>> wrote:
>>>
>>>        Being a part of these discussions I can understand your
>>>        reticence to reopen this discussion.  However, I think this
>>>        is a major usability issue with this feature which actually
>>>        is fairly important in order to get things to run performant.
>>>        Which IMO is important.
>>>
>>>        That being said I think there are one of two things that
>>>        could be done to mitigate the issue.
>>>
>>>        1.  To eliminate the element of surprise by changing mpirun
>>>        to eat rankfile without the hostfile.
>>>        2.  To change the error message to something understandable
>>>        by the user such that they
>>>        know they might be missing the hostfile option.
>>>
>>>        Again I understand this topic is frustrating and there are
>>>        some boundaries with the design that make these two option
>>>        orthogonal to each other but I really believe we need to make
>>>        the rankfile option something that is easily usable by our users.
>>>
>>>
>>>        --td
>>>
>>>        Ralph Castain wrote:
>>>
>>>            Having gone around in circles on hostfile-related issues
>>>            for over five years now, I honestly have little
>>>            motivation to re-open the entire discussion again. It
>>>            doesn't seem to be that daunting a requirement for those
>>>            who are using it, so I'm inclined to just leave well
>>>            enough alone.
>>>
>>>            :-)
>>>
>>>
>>>            On Fri, Jun 19, 2009 at 2:21 PM, Eugene Loh
>>>            <eugene....@sun.com <mailto:eugene....@sun.com>
>>>            <mailto:eugene....@sun.com <mailto:eugene....@sun.com>>>
>>>
>>>            wrote:
>>>
>>>               Ralph Castain wrote:
>>>
>>>                   The two files have a slightly different format
>>>
>>>               Agreed.
>>>
>>>                   and completely different meaning.
>>>
>>>               Somewhat agreed.  They're both related to mapping
>>>            processes onto a
>>>               cluster.
>>>
>>>                   The hostfile specifies how many slots are on a
>>>                node. The rankfile
>>>                   specifies a rank and what node/slot it is to be
>>>                mapped onto.
>>>
>>>               Agreed.
>>>
>>>                   Rankfiles can use relative node indexing and refer
>>>                to nodes
>>>                   received from a resource manager - i.e., without
>>>                any hostfile.
>>>
>>>               This is the main part I'm concerned about.  E.g.,
>>>
>>>               % cat rankfile
>>>               rank 0=node0 slot=0
>>>               rank 1=node1 slot=0
>>>               % mpirun -np 2 -rf rankfile ./a.out
>>>
>>> --------------------------------------------------------------------------
>>>               Rankfile claimed host node1 that was not allocated or
>>>               oversubscribed it's slots:
>>>
>>>
>>> --------------------------------------------------------------------------
>>>               [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad
>>>            parameter in file
>>>               rmaps_rank_file.c at line 107
>>>               [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad
>>>            parameter in file
>>>               base/rmaps_base_map_job.c at line 86
>>>               [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad
>>>            parameter in file
>>>               base/plm_base_launch_support.c at line 86
>>>               [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad
>>>            parameter in file
>>>               plm_rsh_module.c at line 1016
>>>               % mpirun -np 2 -host node0,node1 -rf rankfile ./a.out
>>>               0 on node0
>>>               1 on node1
>>>               done
>>>
>>>               It seems to me that the rankfile has sufficient
>>>            information to
>>>               express what I want it to do.  But mpirun won't accept
>>>            this.  To
>>>               fix this, I have to, e.g., supply/maintain/specify
>>>            redundant
>>>               information in a hostfile or host list.
>>>
>>>                   So the files are intentionally quite different.
>>>                Trying to combine
>>>                   them would be rather ugly.
>>>
>>>               Right.  And my issue is that I'm forced to use both
>>>            when I only
>>>               want rankfile functionality.
>>>
>>>                   On Thu, Jun 18, 2009 at 1:52 PM, Eugene Loh
>>>                <eugene....@sun.com <mailto:eugene....@sun.com>
>>>                   <mailto:eugene....@sun.com
>>>                <mailto:eugene....@sun.com>>> wrote:
>>>
>>>                       In order to use "mpirun --rankfile", I also
>>>                need to specify
>>>                       hosts/hostlist.  But that information is
>>>                redundant with what
>>>                       I provide in the rankfile.  So, from a user's
>>>                point of view,
>>>                       this strikes me as broken.  Yes?  Should I
>>>                file a ticket, or
>>>                       am I missing something here about this
>>>                functionality?
>>>
>>>
>>>               _______________________________________________
>>>               devel mailing list
>>>               de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>            <mailto:de...@open-mpi.org <mailto:de...@open-mpi.org>>
>>>
>>>               http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>>  ------------------------------------------------------------------------
>>>
>>>
>>>
>>>            _______________________________________________
>>>            devel mailing list
>>>            de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>            http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>        _______________________________________________
>>>        devel mailing list
>>>        de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>        http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>>    _______________________________________________
>>>    devel mailing list
>>>    de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>    http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Reply via email to