Hi all,

I asked this question on the torque mailing list, and I found several
similar issues on the web, but no definitive solutions. When we run our
MPI programs via torque/maui, at random times, in ~50-70% of all cases,
the job will fail with the following error message:

[node1:51074] [[36074,0],0] ORTE_ERROR_LOG: File open failure in file 
ras_tm_module.c at line 142
[node1:51074] [[36074,0],0] ORTE_ERROR_LOG: File open failure in file 
ras_tm_module.c at line 82
[node1:51074] [[36074,0],0] ORTE_ERROR_LOG: File open failure in file 
base/ras_base_allocate.c at line 149
[node1:51074] [[36074,0],0] ORTE_ERROR_LOG: File open failure in file 
base/plm_base_launch_support.c at line 99
[node1:51074] [[36074,0],0] ORTE_ERROR_LOG: File open failure in file 
plm_tm_module.c at line 194
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

I compiled hwloc 1.9 with --with-libpci, torque with --enable-cpuset and
openmpi with --with-tm, so I thought (from the docs) that this should 
make torque and openmpi communicate seamlessly. Resubmitting the exact
same job will run the next time or the time after that. Adding sleep to
work around any race conditions did not help.

Any ideas?

Thanks,
Andrej

Reply via email to