Strange that your code didn't generate any symbols - is that a mosix thing? Have you tried just adding opal_output (so it goes to a special diagnostic output channel) statements in your code to see where the segfault is occurring?
It looks like you are getting thru orte_init. You could add -mca grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, then you are probably failing in add_procs. On Apr 25, 2012, at 5:05 AM, Alex Margolin wrote: > Hi, > > I'm getting a segv error off my build of the trunk. I know that my BTL module > is responsible ("-mca btl self,tcp" works, "-mca btl self,mosix" fails). > Smaller/simpler test applications pass, NPB doesn't. Can anyone suggest how > to proceed with debugging this? my attempts include some debug printouts, and > GDB which appears below... What can I do next? > > I'll appreciate any input, > Alex > > alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun --debug-daemons -d -n 4 > xterm -l -e gdb ft.S.4 > [singularity:07557] procdir: > /tmp/openmpi-sessions-alex@singularity_0/44228/0/0 > [singularity:07557] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/0 > [singularity:07557] top: openmpi-sessions-alex@singularity_0 > [singularity:07557] tmp: /tmp > [singularity:07557] [[44228,0],0] hostfile: checking hostfile > /home/alex/huji/ompi/etc/openmpi-default-hostfile for nodes > [singularity:07557] [[44228,0],0] hostfile: filtering nodes through hostfile > /home/alex/huji/ompi/etc/openmpi-default-hostfile > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_ADD_LOCAL_PROCS > [singularity:07557] [[44228,0],0] orted_cmd: received add_local_procs > MPIR_being_debugged = 0 > MPIR_debug_state = 1 > MPIR_partial_attach_ok = 1 > MPIR_i_am_starter = 0 > MPIR_forward_output = 0 > MPIR_proctable_size = 4 > MPIR_proctable: > (i, host, exe, pid) = (0, singularity, /usr/bin/xterm, 7558) > (i, host, exe, pid) = (1, singularity, /usr/bin/xterm, 7559) > (i, host, exe, pid) = (2, singularity, /usr/bin/xterm, 7560) > (i, host, exe, pid) = (3, singularity, /usr/bin/xterm, 7561) > MPIR_executable_path: NULL > MPIR_server_arguments: NULL > [singularity:07592] procdir: > /tmp/openmpi-sessions-alex@singularity_0/44228/1/3 > [singularity:07592] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 > [singularity:07592] top: openmpi-sessions-alex@singularity_0 > [singularity:07592] tmp: /tmp > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_SYNC_WANT_NIDMAP > [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local > proc [[44228,1],3] > [singularity:07592] [[44228,1],3] decode:nidmap decoding nodemap > [singularity:07592] [[44228,1],3] decode:nidmap decoding 1 nodes > [singularity:07592] [[44228,1],3] node[0].name singularity daemon 0 > [singularity:07594] procdir: > /tmp/openmpi-sessions-alex@singularity_0/44228/1/1 > [singularity:07594] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 > [singularity:07594] top: openmpi-sessions-alex@singularity_0 > [singularity:07594] tmp: /tmp > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_SYNC_WANT_NIDMAP > [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local > proc [[44228,1],1] > [singularity:07594] [[44228,1],1] decode:nidmap decoding nodemap > [singularity:07594] [[44228,1],1] decode:nidmap decoding 1 nodes > [singularity:07594] [[44228,1],1] node[0].name singularity daemon 0 > [singularity:07596] procdir: > /tmp/openmpi-sessions-alex@singularity_0/44228/1/0 > [singularity:07596] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 > [singularity:07596] top: openmpi-sessions-alex@singularity_0 > [singularity:07596] tmp: /tmp > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_SYNC_WANT_NIDMAP > [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local > proc [[44228,1],0] > [singularity:07596] [[44228,1],0] decode:nidmap decoding nodemap > [singularity:07596] [[44228,1],0] decode:nidmap decoding 1 nodes > [singularity:07596] [[44228,1],0] node[0].name singularity daemon 0 > [singularity:07598] procdir: > /tmp/openmpi-sessions-alex@singularity_0/44228/1/2 > [singularity:07598] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 > [singularity:07598] top: openmpi-sessions-alex@singularity_0 > [singularity:07598] tmp: /tmp > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_SYNC_WANT_NIDMAP > [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local > proc [[44228,1],2] > [singularity:07598] [[44228,1],2] decode:nidmap decoding nodemap > [singularity:07598] [[44228,1],2] decode:nidmap decoding 1 nodes > [singularity:07598] [[44228,1],2] node[0].name singularity daemon 0 > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS > [singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs > [singularity:07557] [[44228,0],0] orted:comm:message_local_procs delivering > message to job [44228,1] tag 30 > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS > [singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs > [singularity:07557] [[44228,0],0] orted:comm:message_local_procs delivering > message to job [44228,1] tag 30 > [singularity:07557] [[44228,0],0]:errmgr_default_hnp.c(418) updating exit > status to 1 > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_EXIT_CMD > [singularity:07557] [[44228,0],0] orted_cmd: received exit cmd > [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving > [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving > [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving > [singularity:07557] [[44228,0],0] orted_cmd: all routes and children gone - > exiting > -------------------------------------------------------------------------- > mpirun has exited due to process rank 2 with PID 7560 on > node singularity exiting improperly. There are three reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter > orte_create_session_dirs is set to false. In this case, the run-time cannot > detect that the abort call was an abnormal termination. Hence, the only > error message you will receive is this one. > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > > You can avoid this message by specifying -quiet on the mpirun command line. > > -------------------------------------------------------------------------- > [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving > exiting with status 1 > alex@singularity:~/huji/benchmarks/mpi/npb$ grep SIGSEGV * > Xterm.log.singularity.2012.04.24.20.38.03.6992:During startup program > terminated with signal SIGSEGV, Segmentation fault. > Xterm.log.singularity.2012.04.25.13.55.01.7560:During startup program > terminated with signal SIGSEGV, Segmentation fault. > alex@singularity:~/huji/benchmarks/mpi/npb$ cat > Xterm.log.singularity.2012.04.25.13.55.01.7560 > GNU gdb (Ubuntu/Linaro 7.3-0ubuntu2) 7.3-2011.08 > Copyright (C) 2011 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > <http://bugs.launchpad.net/gdb-linaro/>... > Reading symbols from > /home/alex/huji/benchmarks/mpi/NPB3.3.1/NPB3.3-MPI/bin/ft.S.4...(no debugging > symbols found)...done. > (gdb) r > Starting program: > /home/alex/huji/benchmarks/mpi/NPB3.3.1/NPB3.3-MPI/bin/ft.S.4 > warning: Error disabling address space randomization: Function not implemented > During startup program terminated with signal SIGSEGV, Segmentation fault. > (gdb) l > No symbol table is loaded. Use the "file" command. > (gdb) bt > No stack. > (gdb) alex@singularity:~/huji/benchmarks/mpi/npb$ > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel