Strange that your code didn't generate any symbols - is that a mosix thing? 
Have you tried just adding opal_output (so it goes to a special diagnostic 
output channel) statements in your code to see where the segfault is occurring?

It looks like you are getting thru orte_init. You could add -mca 
grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, 
then you are probably failing in add_procs.


On Apr 25, 2012, at 5:05 AM, Alex Margolin wrote:

> Hi,
> 
> I'm getting a segv error off my build of the trunk. I know that my BTL module 
> is responsible ("-mca btl self,tcp" works, "-mca btl self,mosix" fails). 
> Smaller/simpler test applications pass, NPB doesn't. Can anyone suggest how 
> to proceed with debugging this? my attempts include some debug printouts, and 
> GDB which appears below... What can I do next?
> 
> I'll appreciate any input,
> Alex
> 
> alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun --debug-daemons -d -n 4 
> xterm -l -e gdb ft.S.4
> [singularity:07557] procdir: 
> /tmp/openmpi-sessions-alex@singularity_0/44228/0/0
> [singularity:07557] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/0
> [singularity:07557] top: openmpi-sessions-alex@singularity_0
> [singularity:07557] tmp: /tmp
> [singularity:07557] [[44228,0],0] hostfile: checking hostfile 
> /home/alex/huji/ompi/etc/openmpi-default-hostfile for nodes
> [singularity:07557] [[44228,0],0] hostfile: filtering nodes through hostfile 
> /home/alex/huji/ompi/etc/openmpi-default-hostfile
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_ADD_LOCAL_PROCS
> [singularity:07557] [[44228,0],0] orted_cmd: received add_local_procs
>  MPIR_being_debugged = 0
>  MPIR_debug_state = 1
>  MPIR_partial_attach_ok = 1
>  MPIR_i_am_starter = 0
>  MPIR_forward_output = 0
>  MPIR_proctable_size = 4
>  MPIR_proctable:
>    (i, host, exe, pid) = (0, singularity, /usr/bin/xterm, 7558)
>    (i, host, exe, pid) = (1, singularity, /usr/bin/xterm, 7559)
>    (i, host, exe, pid) = (2, singularity, /usr/bin/xterm, 7560)
>    (i, host, exe, pid) = (3, singularity, /usr/bin/xterm, 7561)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> [singularity:07592] procdir: 
> /tmp/openmpi-sessions-alex@singularity_0/44228/1/3
> [singularity:07592] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
> [singularity:07592] top: openmpi-sessions-alex@singularity_0
> [singularity:07592] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local 
> proc [[44228,1],3]
> [singularity:07592] [[44228,1],3] decode:nidmap decoding nodemap
> [singularity:07592] [[44228,1],3] decode:nidmap decoding 1 nodes
> [singularity:07592] [[44228,1],3] node[0].name singularity daemon 0
> [singularity:07594] procdir: 
> /tmp/openmpi-sessions-alex@singularity_0/44228/1/1
> [singularity:07594] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
> [singularity:07594] top: openmpi-sessions-alex@singularity_0
> [singularity:07594] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local 
> proc [[44228,1],1]
> [singularity:07594] [[44228,1],1] decode:nidmap decoding nodemap
> [singularity:07594] [[44228,1],1] decode:nidmap decoding 1 nodes
> [singularity:07594] [[44228,1],1] node[0].name singularity daemon 0
> [singularity:07596] procdir: 
> /tmp/openmpi-sessions-alex@singularity_0/44228/1/0
> [singularity:07596] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
> [singularity:07596] top: openmpi-sessions-alex@singularity_0
> [singularity:07596] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local 
> proc [[44228,1],0]
> [singularity:07596] [[44228,1],0] decode:nidmap decoding nodemap
> [singularity:07596] [[44228,1],0] decode:nidmap decoding 1 nodes
> [singularity:07596] [[44228,1],0] node[0].name singularity daemon 0
> [singularity:07598] procdir: 
> /tmp/openmpi-sessions-alex@singularity_0/44228/1/2
> [singularity:07598] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
> [singularity:07598] top: openmpi-sessions-alex@singularity_0
> [singularity:07598] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local 
> proc [[44228,1],2]
> [singularity:07598] [[44228,1],2] decode:nidmap decoding nodemap
> [singularity:07598] [[44228,1],2] decode:nidmap decoding 1 nodes
> [singularity:07598] [[44228,1],2] node[0].name singularity daemon 0
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS
> [singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs
> [singularity:07557] [[44228,0],0] orted:comm:message_local_procs delivering 
> message to job [44228,1] tag 30
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS
> [singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs
> [singularity:07557] [[44228,0],0] orted:comm:message_local_procs delivering 
> message to job [44228,1] tag 30
> [singularity:07557] [[44228,0],0]:errmgr_default_hnp.c(418) updating exit 
> status to 1
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_EXIT_CMD
> [singularity:07557] [[44228,0],0] orted_cmd: received exit cmd
> [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
> [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
> [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
> [singularity:07557] [[44228,0],0] orted_cmd: all routes and children gone - 
> exiting
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 2 with PID 7560 on
> node singularity exiting improperly. There are three reasons this could occur:
> 
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
> 
> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
> orte_create_session_dirs is set to false. In this case, the run-time cannot
> detect that the abort call was an abnormal termination. Hence, the only
> error message you will receive is this one.
> 
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> 
> You can avoid this message by specifying -quiet on the mpirun command line.
> 
> --------------------------------------------------------------------------
> [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
> exiting with status 1
> alex@singularity:~/huji/benchmarks/mpi/npb$ grep SIGSEGV *
> Xterm.log.singularity.2012.04.24.20.38.03.6992:During startup program 
> terminated with signal SIGSEGV, Segmentation fault.
> Xterm.log.singularity.2012.04.25.13.55.01.7560:During startup program 
> terminated with signal SIGSEGV, Segmentation fault.
> alex@singularity:~/huji/benchmarks/mpi/npb$ cat 
> Xterm.log.singularity.2012.04.25.13.55.01.7560
> GNU gdb (Ubuntu/Linaro 7.3-0ubuntu2) 7.3-2011.08
> Copyright (C) 2011 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> <http://bugs.launchpad.net/gdb-linaro/>...
> Reading symbols from 
> /home/alex/huji/benchmarks/mpi/NPB3.3.1/NPB3.3-MPI/bin/ft.S.4...(no debugging 
> symbols found)...done.
> (gdb) r
> Starting program: 
> /home/alex/huji/benchmarks/mpi/NPB3.3.1/NPB3.3-MPI/bin/ft.S.4
> warning: Error disabling address space randomization: Function not implemented
> During startup program terminated with signal SIGSEGV, Segmentation fault.
> (gdb) l
> No symbol table is loaded.  Use the "file" command.
> (gdb) bt
> No stack.
> (gdb) alex@singularity:~/huji/benchmarks/mpi/npb$
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to