Being a part of these discussions I can understand your reticence to reopen this discussion. However, I think this is a major usability issue with this feature which actually is fairly important in order to get things to run performant. Which IMO is important.

That being said I think there are one of two things that could be done to mitigate the issue.

1. To eliminate the element of surprise by changing mpirun to eat rankfile without the hostfile. 2. To change the error message to something understandable by the user such that they
know they might be missing the hostfile option.

Again I understand this topic is frustrating and there are some boundaries with the design that make these two option orthogonal to each other but I really believe we need to make the rankfile option something that is easily usable by our users.


--td

Ralph Castain wrote:
Having gone around in circles on hostfile-related issues for over five years now, I honestly have little motivation to re-open the entire discussion again. It doesn't seem to be that daunting a requirement for those who are using it, so I'm inclined to just leave well enough alone.

:-)


On Fri, Jun 19, 2009 at 2:21 PM, Eugene Loh <eugene....@sun.com <mailto:eugene....@sun.com>> wrote:

    Ralph Castain wrote:
    The two files have a slightly different format
    Agreed.
    and completely different meaning.
    Somewhat agreed.  They're both related to mapping processes onto a
    cluster.

    The hostfile specifies how many slots are on a node. The rankfile
    specifies a rank and what node/slot it is to be mapped onto.
    Agreed.

    Rankfiles can use relative node indexing and refer to nodes
    received from a resource manager - i.e., without any hostfile.
    This is the main part I'm concerned about.  E.g.,

    % cat rankfile
    rank 0=node0 slot=0
    rank 1=node1 slot=0
    % mpirun -np 2 -rf rankfile ./a.out
    --------------------------------------------------------------------------
    Rankfile claimed host node1 that was not allocated or
    oversubscribed it's slots:

    --------------------------------------------------------------------------
    [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
    rmaps_rank_file.c at line 107
    [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
    base/rmaps_base_map_job.c at line 86
    [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
    base/plm_base_launch_support.c at line 86
    [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
    plm_rsh_module.c at line 1016
    % mpirun -np 2 -host node0,node1 -rf rankfile ./a.out
    0 on node0
    1 on node1
    done

    It seems to me that the rankfile has sufficient information to
    express what I want it to do.  But mpirun won't accept this.  To
    fix this, I have to, e.g., supply/maintain/specify redundant
    information in a hostfile or host list.

    So the files are intentionally quite different. Trying to combine
    them would be rather ugly.
    Right.  And my issue is that I'm forced to use both when I only
    want rankfile functionality.

    On Thu, Jun 18, 2009 at 1:52 PM, Eugene Loh <eugene....@sun.com
    <mailto:eugene....@sun.com>> wrote:

        In order to use "mpirun --rankfile", I also need to specify
        hosts/hostlist.  But that information is redundant with what
        I provide in the rankfile.  So, from a user's point of view,
        this strikes me as broken.  Yes?  Should I file a ticket, or
        am I missing something here about this functionality?


    _______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/devel


------------------------------------------------------------------------

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to