On 8/30/2011 7:34 PM, Ralph Castain wrote:
On Aug 29, 2011, at 11:18 PM, Eugene Loh wrote:
Maybe someone can help me from having to think too hard.

Let's say I want to max my system limits.  I can say this:

    % mpirun --mca opal_set_max_sys_limits 1 ...

Cool.

Meanwhile, if I do this:

    % setenv OMPI_MCA_opal_set_max_sys_limits 1
    % mpirun ...

remote processes don't see the setting.  (Local processes and ompi_info are 
fine.)
I looked at the 1.5 code, and mpirun is reaping all OMPI_ params from the 
environ and adding them to the app. So it should be getting set.

I then ran "mpirun -n 1 printenv" on a slurm machine, and verified that indeed 
that param was in the environment. Ditto when I told it to use the rsh launcher.
Bug?  Naively, this looks "wrong."  At least disturbing, in any case.
This is with v1.5.
Okay, so one answer is implicit in your reply: you are expecting the same result I am. So, if the behavior is not as I expect but as I describe, it's a bug candidate. (As opposed to, "The problem you're describing is how it's supposed to work; it's no problem at all.")

Now, regarding "mpirun -n 1 printenv", I agree that the environment variable is getting set. Even on a remote node. That suggests that things are fine, but it turns out they are not. The problem is -- and I'm afraid I don't understand the details -- it's set "too late." I imagine a time line like this:

A)  orted starts
B)  orted calls opal_util_init_sys_limits()
C)  daemonize a child process
D)  child process execs target process
E)  target process starts up

Looking at the environment, I don't see the variable set in B, which is the only place the variable does any good. Like you, I do see it in E, which is interesting but doesn't help the user.

Your experiment was reasonable, but the problem is odd. I suggest the following to see the problem. Set the variable in your environment. Then use mpirun to launch a remote process. Then: 1) In the remote orted, inside opal_util_init_sys_limits(), check for the variable in your environment.
And/or:
2)  Make the remotely launched process something like this:

#!/bin/csh
limit descriptors

and see if the descriptor limit got bumped up from what it otherwise should be.

In contrast, if you set the MCA parameter on your mpirun command line, the environment variable *does* get set, even in the environment of the orted when it calls opal_util_init_sys_limits().

I can poke at this more tomorrow, but I suspect with one "aha!" you'll figure it out a lot faster than I can. :^(

Reply via email to