It is a long-standing problem that due to a bug in Sun GridEngine (setting the stack size limit equal to the address space limit) using qrsh from within OpenMPI fails if a large memory is requested but the stack size not explicitly set to a reasonably small value.
The best solution were if SGE just would not touch the stack size limit and leave it at INFINITY. However I have tested that just reducing the stack size limit in file orte/mca/plm/rsh/plm_rsh_module.c, function ssh_child() before execv'ing qrsh circumvents the problem, so just after exec_patch is set by strdup(...) I inserted the lines { struct rlimit rlim; int l; l=strlen(exec_path); if (l > 5 && !strcmp("/qrsh", exec_path + (l-5))) { getrlimit(RLIMIT_STACK, &rlim); if (rlim.rlim_max > 10000000L) rlim.rlim_max=10000000L; if (rlim.rlim_cur > 10000000L) rlim.rlim_cur=10000000L; setrlimit(RLIMIT_STACK, &rlim); } } It looks quick-and-dirty and it certainly is, but it solves a severe problem many users have with OpenMPI and SGE. Feel free to use this information as you like. Note that MPI worker jobs eventually spawned off on "distant" nodes do not suffer from the reduced stack size limit, it is only the qrsh command. Is this (still) of interest? +---------------------------------+----------------------------------+ | Prof. Christoph van Wüllen | Tele-Phone (+49) (0)631 205 2749 | | TU Kaiserslautern, FB Chemie | Tele-Fax (+49) (0)631 205 2750 | | Erwin-Schrödinger-Str. | | | D-67663 Kaiserslautern, Germany | vanwul...@chemie.uni-kl.de | | | | HomePage: http://www.chemie.uni-kl.de/vanwullen | +---------------------------------+----------------------------------+