Yo all More thoughts on hostfile usage - I'm sure you are all sitting on pins-and-needles awaiting more discussion on this exciting topic!
I'm continuing to try and work through the use-cases here so we can get this fixed. It continues to be an issue for users on the list, as well as our own developers. The problem is that we use "hostfile" and "-host" for dual purposes, which means there is an opening for confusion over what should happen. Let's consider two major use-cases. Use-case 1: hostfile and/or -host, no managed environment I believe there is an expected and consistent behavior for the case where we are not in a managed environment, but the user specifies a hostfile and/or -host. In these cases, we use the hostfile (if provided) to completely describe the available hosts, and any -host is used to specify which hosts in that hostfile are to be used for the initial application. At issue, however, is what happens with comm_spawn - is the child job restricted to the -host list, or is it free to use any of the hosts in the hostfile? I have heard it both ways from users, so I believe we are free to decide here. Does anyone have an opinion? Do we need an option to indicate that all child jobs are restricted to the specified -host list? Use-case 2: managed environment, hostfile and/or -host provided You will find a lengthy discourse in Ticket #1018 about how to deal with this use-case - it is messy, with multiple definitions running around. I believe we have hit upon a reasonable path forward in that discussion regarding how to parse a node list from this use-case. However, it left open the question of who has access to the resulting node list. As I tried to indicate in my prior note, the question revolves again around comm_spawn: does the child job have access to all nodes in the original allocation; those nodes in the original allocation that are also listed in a hostfile; those nodes in the original allocation that are also in the -host list; or...? Obviously, as someone primarily focused on the RTE, I couldn't possibly care less which of these modes you select. However, I *do* need to know how you want Open MPI to operate so I can build the system to meet those requirements. I hope this - in combination with the prior note - will help you to understand the question. Any direction would be appreciated as we are kinda stuck until I know how you want the system to behave. Oh yeah - in case you were wondering, prior MPI's like LA-MPI and LAM-MPI avoided these issues (e.g., by ignoring hostfiles in use case 2). So we are kind of charting new territory here - I think our users will be fine either way if we just tell them "this is how it works". Thanks Ralph