Greg,

Thanks for tracking this down!
Tim


Greg Watson wrote:
Hi all,

To recap: the problem was that if orted was launched from Eclipse (on OS X) then subsequent attempts to run a program (using mpirun or whatever) returned immediately. If orted was launched from anywhere else (java, command line, etc.) it worked fine.

Turning on daemon logging showed that the reason that the program was aborting immediately was that the execv() of the ssh command to the remote machine was exiting with errno=14 (EFAULT). Clearly there was some environment difference, and after much checking it became apparent that the difference was that the Eclipse-launched orted did not have $(OMPI_INSTALL) in it's path. The orte_pls_rsh_launch() function checks if you're launching onto the local or a remote machine. For local machines (as it was in this case), it calls opal_path_findv() to find the local path of orted. Unfortunately because $(OMPI_INSTALL) is not included in the local path, this fails by returning NULL. The NULL is then passed to the first argument of execv() which returns EFAULT.

The problem is easily reproducible by taking $(OMPI_INSTALL) out of your path, running $(OMPI_INSTALL)/orted, then trying to run something with mpirun.

Why did it work from the command line? On OS X, the shell gets the PATH set in ~/.bash_profile, etc., (which in this case contained OMPI_INSTALL) but applications launched from window system get their path from the loginwindow app, which looks in ~/.MacOSX/ environment.plist for environment variables (which didn't contain OMPI_INSTALL). I suspect, but haven't tried, launching Eclipse from the command line would have worked.

I'm not sure why the logic is there to look up the path again for local launches, since it should be the same as the path in the component. It should certainly check for a NULL return though.

Greg


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to