[OMPI devel] Orted problem

2007-08-22 Thread Carlos Segura
Hi, I am having a problem with the last version of openmpi.
In some executions (1 each 100 more or less) a message is printed:
 [tegasaste:01617] [NO-NAME] ORTE_ERROR_LOG: File read failure in file
util/universe_setup_file_io.c at line 123
It seems like if it try to read the universe file and it have nothing.
If I look the file, it contains correct information. It seems like if the
file would have been created, but no filled yet when the read is executed.

The output of ompi_info command:

Open MPI: 1.2.3
   Open MPI SVN revision: r15136
Open RTE: 1.2.3
   Open RTE SVN revision: r15136
OPAL: 1.2.3
   OPAL SVN revision: r15136
  Prefix: /soft/openmpi1.2.3
 Configured architecture: i686-pc-linux-gnu
   Configured by: csegura
   Configured on: Wed Aug 22 04:25:19 WEST 2007
  Configure host: tegasaste
Built by: csegura
Built on: miÃ(c) ago 22 04:38:34 WEST 2007
  Built host: tegasaste
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
  Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: no
  Thread support: posix (mpi: no, progress: no)
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
   MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.3)
  MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.3)
   MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.3)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.3)
   MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.3)
 MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.3)
 MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.3)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.3)
MCA coll: self (MCA v1.0, API v1.0, Component v1.2.3)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.3)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.3)
  MCA io: romio (MCA v1.0, API v1.0, Component v1.2.3)
   MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.3)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.3)
 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.3)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.3)
 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.3)
  MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.3)
 MCA btl: openib (MCA v1.0, API v1.0.1, Component v1.2.3)
 MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.3)
 MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.3)
 MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.3)
 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.3)
  MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.3)
  MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.3)
  MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.3)
 MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.3)
 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.3)
 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.3)
 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.3)
 MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.3)
  MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.3)
  MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.3)
 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
 MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.3)
 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.3)
 MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.3)
 MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.3)
 MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.3)
 MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.3)
 MCA rds: resfile (MCA 

Re: [OMPI devel] orted problem

2006-07-05 Thread Ralph H Castain
This has been around for a very long time (at least a year, if memory serves
correctly). The problem is that the system "hangs" while trying to flush the
io buffers through the RML because it loses connection to the head node
process (for 1.x, that's basically mpirun) - but the "flush" procedure
doesn't give up.

What's needed is some tuneup of the entire I/O-RML system so that we can
timeout properly and, when receiving that error, exit instead of retrying. I
thought someone was going to take a shot at that awhile back (at least six
months ago), but I don't recall it actually happening - too many higher
priorities.

Ralph



On 7/4/06 3:05 PM, "Josh Hursey"  wrote:

> I have been noticing this for a while (at least 2 months) as well
> along with stale session directories. I filed a bug yesterday #177
>https://svn.open-mpi.org/trac/ompi/ticket/177
> I'll add this stack trace to it. I want to take a closer look
> tomorrow to see what's really going on here.
> 
> When I left it yesterday I found that if you CTRL-C the running
> mpirun, and the orted's hang then if you send another signal to
> mpirun sometimes mpirun will die from SIGPIPE. This is a race
> condition due to the orteds leaving, but we should be masking that
> signal or something other than dieing.
> 
> So I think there is more than one race in this code, and will need
> some serious looking at.
> 
> --Josh
> 
> On Jul 4, 2006, at 12:38 PM, George Bosilca wrote:
> 
>> Starting with few days ago, I notice that more and more orted are
>> left over after my runs. Usually, if the job run to completions they
>> disappear. But if I kill the job or it segfault they don't. I
>> attached to one of them and I get the following stack:
>> 
>> #0  0x9001f7a8 in select ()
>> #1  0x00375d34 in select_dispatch (arg=0x39ec6c, tv=0xbfffe664)
>> at ../../../ompi-trunk/opal/event/select.c:202
>> #2  0x00373b70 in opal_event_loop (flags=1) at ../../../ompi-trunk/
>> opal/event/event.c:485
>> #3  0x00237ee0 in orte_iof_base_flush () at ../../../../ompi-trunk/
>> orte/mca/iof/base/iof_base_flush.c:111
>> #4  0x004cbb38 in orte_pls_fork_wait_proc (pid=9045, status=9,
>> cbdata=0x50c250) at ../../../../../ompi-trunk/orte/mca/pls/fork/
>> pls_fork_module.c:175
>> #5  0x002111f0 in do_waitall (options=0) at ../../ompi-trunk/orte/
>> runtime/orte_wait.c:500
>> #6  0x00210ac8 in orte_wait_signal_callback (fd=20, event=8,
>> arg=0x26f3f8) at ../../ompi-trunk/orte/runtime/orte_wait.c:366
>> #7  0x003737f8 in opal_event_process_active () at ../../../ompi-trunk/
>> opal/event/event.c:428
>> #8  0x00373ce8 in opal_event_loop (flags=1) at ../../../ompi-trunk/
>> opal/event/event.c:513
>> #9  0x00368714 in opal_progress () at ../../ompi-trunk/opal/runtime/
>> opal_progress.c:259
>> #10 0x004cdf48 in opal_condition_wait (c=0x4cf0f0, m=0x4cf0b0)
>> at ../../../../../ompi-trunk/opal/threads/condition.h:81
>> #11 0x004cde60 in orte_pls_fork_finalize () at ../../../../../ompi-
>> trunk/orte/mca/pls/fork/pls_fork_module.c:764
>> #12 0x002417d0 in orte_pls_base_finalize () at ../../../../ompi-trunk/
>> orte/mca/pls/base/pls_base_close.c:42
>> #13 0x000ddf58 in orte_rmgr_urm_finalize () at ../../../../../ompi-
>> trunk/orte/mca/rmgr/urm/rmgr_urm.c:521
>> #14 0x00254ec0 in orte_rmgr_base_close () at ../../../../ompi-trunk/
>> orte/mca/rmgr/base/rmgr_base_close.c:39
>> #15 0x0020e574 in orte_system_finalize () at ../../ompi-trunk/orte/
>> runtime/orte_system_finalize.c:65
>> #16 0x0020899c in orte_finalize () at ../../ompi-trunk/orte/runtime/
>> orte_finalize.c:42
>> #17 0x2ac8 in main (argc=19, argv=0xb17c) at ../../../../ompi-
>> trunk/orte/tools/orted/orted.c:377
>> 
>> Somehow, it wait for the pid 9045. But this was one of the kids, and
>> it get the SIG_KILL signal (I checked with strace). I wonder if we
>> don't have a race condition somewhere on the wait_signal code.
>> 
>> Hope that helps,
>>george.
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> Josh Hursey
> jjhur...@open-mpi.org
> http://www.open-mpi.org/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] orted problem

2006-07-04 Thread Josh Hursey
I have been noticing this for a while (at least 2 months) as well  
along with stale session directories. I filed a bug yesterday #177

  https://svn.open-mpi.org/trac/ompi/ticket/177
I'll add this stack trace to it. I want to take a closer look  
tomorrow to see what's really going on here.


When I left it yesterday I found that if you CTRL-C the running  
mpirun, and the orted's hang then if you send another signal to  
mpirun sometimes mpirun will die from SIGPIPE. This is a race  
condition due to the orteds leaving, but we should be masking that  
signal or something other than dieing.


So I think there is more than one race in this code, and will need  
some serious looking at.


--Josh

On Jul 4, 2006, at 12:38 PM, George Bosilca wrote:


Starting with few days ago, I notice that more and more orted are
left over after my runs. Usually, if the job run to completions they
disappear. But if I kill the job or it segfault they don't. I
attached to one of them and I get the following stack:

#0  0x9001f7a8 in select ()
#1  0x00375d34 in select_dispatch (arg=0x39ec6c, tv=0xbfffe664)
at ../../../ompi-trunk/opal/event/select.c:202
#2  0x00373b70 in opal_event_loop (flags=1) at ../../../ompi-trunk/
opal/event/event.c:485
#3  0x00237ee0 in orte_iof_base_flush () at ../../../../ompi-trunk/
orte/mca/iof/base/iof_base_flush.c:111
#4  0x004cbb38 in orte_pls_fork_wait_proc (pid=9045, status=9,
cbdata=0x50c250) at ../../../../../ompi-trunk/orte/mca/pls/fork/
pls_fork_module.c:175
#5  0x002111f0 in do_waitall (options=0) at ../../ompi-trunk/orte/
runtime/orte_wait.c:500
#6  0x00210ac8 in orte_wait_signal_callback (fd=20, event=8,
arg=0x26f3f8) at ../../ompi-trunk/orte/runtime/orte_wait.c:366
#7  0x003737f8 in opal_event_process_active () at ../../../ompi-trunk/
opal/event/event.c:428
#8  0x00373ce8 in opal_event_loop (flags=1) at ../../../ompi-trunk/
opal/event/event.c:513
#9  0x00368714 in opal_progress () at ../../ompi-trunk/opal/runtime/
opal_progress.c:259
#10 0x004cdf48 in opal_condition_wait (c=0x4cf0f0, m=0x4cf0b0)
at ../../../../../ompi-trunk/opal/threads/condition.h:81
#11 0x004cde60 in orte_pls_fork_finalize () at ../../../../../ompi-
trunk/orte/mca/pls/fork/pls_fork_module.c:764
#12 0x002417d0 in orte_pls_base_finalize () at ../../../../ompi-trunk/
orte/mca/pls/base/pls_base_close.c:42
#13 0x000ddf58 in orte_rmgr_urm_finalize () at ../../../../../ompi-
trunk/orte/mca/rmgr/urm/rmgr_urm.c:521
#14 0x00254ec0 in orte_rmgr_base_close () at ../../../../ompi-trunk/
orte/mca/rmgr/base/rmgr_base_close.c:39
#15 0x0020e574 in orte_system_finalize () at ../../ompi-trunk/orte/
runtime/orte_system_finalize.c:65
#16 0x0020899c in orte_finalize () at ../../ompi-trunk/orte/runtime/
orte_finalize.c:42
#17 0x2ac8 in main (argc=19, argv=0xb17c) at ../../../../ompi-
trunk/orte/tools/orted/orted.c:377

Somehow, it wait for the pid 9045. But this was one of the kids, and
it get the SIG_KILL signal (I checked with strace). I wonder if we
don't have a race condition somewhere on the wait_signal code.

Hope that helps,
   george.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/