On Tuesday, June 06/28/16, 2016 at 12:00:02 -0400, devel-requ...@open-mpi.org wrote:
I've opened https://github.com/open-mpi/ompi/issues/1826 for tracking the issue. Thanks, > Send devel mailing list submissions to > de...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://www.open-mpi.org/mailman/listinfo.cgi/devel > or, via email, send a message with subject or body 'help' to > devel-requ...@open-mpi.org > > You can reach the person managing the list at > devel-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of devel digest..." > > > Today's Topics: > > 1. Master: Segfault seen while running imb tests > (Potnuri Bharat Teja) > 2. Re: Master: Segfault seen while running imb tests > (Jeff Squyres (jsquyres)) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 28 Jun 2016 13:03:31 +0530 > From: Potnuri Bharat Teja <bha...@chelsio.com> > To: de...@open-mpi.org > Subject: [OMPI devel] Master: Segfault seen while running imb tests > Message-ID: <20160628073330.gc11...@t5fpga-b1.asicdesigners.com> > Content-Type: text/plain; charset=us-ascii > > Hi All, > I am seeing the following segfault with openmpi-master. > > > [root@maneybhanjang ~]# /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun > --allow-run-as-root --hostfile /root/mpd.hosts -np 8 --prefix > /usr/mpi/gcc/openmpi-2.0-dev/ --map-by node --display-allocation > --oversubscribe --mca btl openib,sm,self > /usr/mpi/gcc/openmpi-2.0-dev/tests/IMB/IMB-MPI1 > > ====================== ALLOCATED NODES ====================== > maneybhanjang: flags=0x01 slots=8 max_slots=0 slots_inuse=0 state=UP > 10.193.184.162: flags=0x03 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN > ================================================================= > [maneybhanjang:28532] *** Process received signal *** > [maneybhanjang:28532] Signal: Segmentation fault (11) > [maneybhanjang:28532] Signal code: Invalid permissions (2) > [maneybhanjang:28532] Failing at address: 0x106ca70 > [maneybhanjang:28532] [ 0] > /lib64/libpthread.so.0[0x3aea40f710] > [maneybhanjang:28532] [ 1] [0x106ca70] > [maneybhanjang:28532] *** End of error message *** > [tonglu:02068] *** Process received signal *** > [tonglu:02068] Signal: Segmentation fault (11) > [tonglu:02068] Signal code: Invalid permissions (2) > [tonglu:02068] Failing at address: 0x2478500 > [tonglu:02068] [ 0] /lib64/libpthread.so.0[0x3ef5c0f710] > [tonglu:02068] [ 1] [0x2478500] > [tonglu:02068] *** End of error message *** > bash: line 1: 2068 Segmentation fault (core > dumped) /usr/mpi/gcc/openmpi-2.0-dev/bin/orted > --hnp-topo-sig 0N:2S:0L3:4L2:8L1:8C:8H:x86_64 -mca ess > "env" -mca ess_base_jobid "3921674240" -mca > ess_base_vpid 1 -mca ess_base_num_procs "2" -mca > orte_hnp_uri > > "3921674240.0;usock;tcp://10.193.184.161,102.1.1.161,102.2.2.161:43160" > --mca btl "openib,sm,self" -mca plm "rsh" -mca > rmaps_base_mapping_policy "node" -mca orte_display_alloc > "1" -mca rmaps_base_oversubscribe "1" > Segmentation fault (core dumped) > [root@maneybhanjang ~]# dmesg > mpirun[28532]: segfault at 106ca70 ip 000000000106ca70 sp 00007fffc00a7f28 > error 15 > > Segfault is seen on the other peer too. > [root@tonglu ~]# dmesg > orted[2068]: segfault at 2478500 ip 0000000002478500 sp 00007fff521c2e68 > error 15 > > gdb on coredump points me to orted/pmix/pmix_server_gen.c:80 > Following is the Back trace. > [root@maneybhanjang ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun core.28532 > Program terminated with signal 11, Segmentation fault. > #0 0x000000000106ca70 in ?? () > Missing separate debuginfos, use: debuginfo-install > glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 > libudev-147-2.57.el6.x86_64 > (gdb) bt > #0 0x000000000106ca70 in ?? () > #1 0x00002b217f7a43aa in _client_conn (sd=-1, args=4, > cbdata=0x2b2188022260) > at orted/pmix/pmix_server_gen.c:80 > #2 0x00002b217fad5a7c in event_process_active_single_queue > (base=0xfcc730, flags=1) > at event.c:1370 > #3 event_process_active (base=0xfcc730, flags=1) at > event.c:1440 > #4 opal_libevent2022_event_base_loop (base=0xfcc730, flags=1) > at event.c:1644 > #5 0x00000000004014d3 in orterun (argc=16, argv=0x7fffc00a81e8) > at orterun.c:192 > #6 0x0000000000400f04 in main (argc=16, argv=0x7fffc00a81e8) at > main.c:13 > (gdb) frame > #0 0x000000000106ca70 in ?? () > (gdb) up > #1 0x00002b217f7a43aa in _client_conn (sd=-1, args=4, > cbdata=0x2b2188022260) at orted/pmix/pmix_server_gen.c:80 > 80 cd->cbfunc(OPAL_SUCCESS, cd->cbdata); > > > Here is the backtrace of peer machine, pointing to same line: > > [root@tonglu ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/orted core.2068 > Program terminated with signal 11, Segmentation fault. > #0 0x0000000002478500 in ?? () > Missing separate debuginfos, use: debuginfo-install > glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 > libudev-147-2.57.el6.x86_64 numactl-2.0.9-2.el6.x86_64 > (gdb) bt > #0 0x0000000002478500 in ?? () > #1 0x00002af4511433ba in _client_conn (sd=-1, args=4, > cbdata=0x2af458022260) > at orted/pmix/pmix_server_gen.c:80 > #2 0x00002af451474cac in event_process_active_single_queue > (base=0x2408e90, flags=1) > at event.c:1370 > #3 event_process_active (base=0x2408e90, flags=1) at > event.c:1440 > #4 opal_libevent2022_event_base_loop (base=0x2408e90, flags=1) > at event.c:1644 > #5 0x00002af451123c57 in orte_daemon (argc=33, > argv=0x7fff521c33d8) > at orted/orted_main.c:859 > #6 0x000000000040081a in main (argc=33, > argv=0x7fff521c33d8) at orted.c:60 > (gdb) frame > #0 0x0000000002478500 in ?? () > (gdb) up > #1 0x00002af4511433ba in _client_conn (sd=-1, args=4, > cbdata=0x2af458022260) > at orted/pmix/pmix_server_gen.c:80 > 80 cd->cbfunc(OPAL_SUCCESS, cd->cbdata); > > I am using the tot of openmpi-master : > commit 5795682aa56ce8f22e518462b22cfee49d407216 > Merge: 5d32282 1bb7788 > Author: Joshua Ladd <jladd.m...@gmail.com> > Date: Mon Jun 27 12:59:20 2016 -0400 > Merge pull request #1817 from shamisp/topic/oshmem_init > OSHMEM: Removing erroneous initialization check > > I am happy to provide any further information and would appreciate any > suggestions regarding the issue. > > Thanks, > Bharat. > > > ------------------------------ > > Message: 2 > Date: Tue, 28 Jun 2016 13:16:47 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: Open MPI Developers List <de...@open-mpi.org> > Subject: Re: [OMPI devel] Master: Segfault seen while running imb > tests > Message-ID: <40334c0f-5512-4eca-8a00-bafdecee1...@cisco.com> > Content-Type: text/plain; charset="us-ascii" > > This looks like a segv in mpirun itself -- can you file an issue on github so > that we can track this? > > Thanks. > > > > On Jun 28, 2016, at 3:33 AM, Potnuri Bharat Teja <bha...@chelsio.com> wrote: > > > > Hi All, > > I am seeing the following segfault with openmpi-master. > > > > > > [root@maneybhanjang ~]# /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun > > --allow-run-as-root --hostfile /root/mpd.hosts -np 8 --prefix > > /usr/mpi/gcc/openmpi-2.0-dev/ --map-by node --display-allocation > > --oversubscribe --mca btl openib,sm,self > > /usr/mpi/gcc/openmpi-2.0-dev/tests/IMB/IMB-MPI1 > > > > ====================== ALLOCATED NODES ====================== > > maneybhanjang: flags=0x01 slots=8 max_slots=0 slots_inuse=0 state=UP > > 10.193.184.162: flags=0x03 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN > > ================================================================= > > [maneybhanjang:28532] *** Process received signal *** > > [maneybhanjang:28532] Signal: Segmentation fault (11) > > [maneybhanjang:28532] Signal code: Invalid permissions (2) > > [maneybhanjang:28532] Failing at address: 0x106ca70 > > [maneybhanjang:28532] [ 0] > > /lib64/libpthread.so.0[0x3aea40f710] > > [maneybhanjang:28532] [ 1] [0x106ca70] > > [maneybhanjang:28532] *** End of error message *** > > [tonglu:02068] *** Process received signal *** > > [tonglu:02068] Signal: Segmentation fault (11) > > [tonglu:02068] Signal code: Invalid permissions (2) > > [tonglu:02068] Failing at address: 0x2478500 > > [tonglu:02068] [ 0] /lib64/libpthread.so.0[0x3ef5c0f710] > > [tonglu:02068] [ 1] [0x2478500] > > [tonglu:02068] *** End of error message *** > > bash: line 1: 2068 Segmentation fault (core > > dumped) /usr/mpi/gcc/openmpi-2.0-dev/bin/orted > > --hnp-topo-sig 0N:2S:0L3:4L2:8L1:8C:8H:x86_64 -mca ess > > "env" -mca ess_base_jobid "3921674240" -mca > > ess_base_vpid 1 -mca ess_base_num_procs "2" -mca > > orte_hnp_uri > > > > "3921674240.0;usock;tcp://10.193.184.161,102.1.1.161,102.2.2.161:43160" > > --mca btl "openib,sm,self" -mca plm "rsh" -mca > > rmaps_base_mapping_policy "node" -mca orte_display_alloc > > "1" -mca rmaps_base_oversubscribe "1" > > Segmentation fault (core dumped) > > [root@maneybhanjang ~]# dmesg > > mpirun[28532]: segfault at 106ca70 ip 000000000106ca70 sp 00007fffc00a7f28 > > error 15 > > > > Segfault is seen on the other peer too. > > [root@tonglu ~]# dmesg > > orted[2068]: segfault at 2478500 ip 0000000002478500 sp 00007fff521c2e68 > > error 15 > > > > gdb on coredump points me to orted/pmix/pmix_server_gen.c:80 > > Following is the Back trace. > > [root@maneybhanjang ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun > > core.28532 > > Program terminated with signal 11, Segmentation fault. > > #0 0x000000000106ca70 in ?? () > > Missing separate debuginfos, use: debuginfo-install > > glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 > > libudev-147-2.57.el6.x86_64 > > (gdb) bt > > #0 0x000000000106ca70 in ?? () > > #1 0x00002b217f7a43aa in _client_conn (sd=-1, args=4, > > cbdata=0x2b2188022260) > > at orted/pmix/pmix_server_gen.c:80 > > #2 0x00002b217fad5a7c in event_process_active_single_queue > > (base=0xfcc730, flags=1) > > at event.c:1370 > > #3 event_process_active (base=0xfcc730, flags=1) at > > event.c:1440 > > #4 opal_libevent2022_event_base_loop (base=0xfcc730, flags=1) > > at event.c:1644 > > #5 0x00000000004014d3 in orterun (argc=16, argv=0x7fffc00a81e8) > > at orterun.c:192 > > #6 0x0000000000400f04 in main (argc=16, argv=0x7fffc00a81e8) at > > main.c:13 > > (gdb) frame > > #0 0x000000000106ca70 in ?? () > > (gdb) up > > #1 0x00002b217f7a43aa in _client_conn (sd=-1, args=4, > > cbdata=0x2b2188022260) at orted/pmix/pmix_server_gen.c:80 > > 80 cd->cbfunc(OPAL_SUCCESS, cd->cbdata); > > > > > > Here is the backtrace of peer machine, pointing to same line: > > > > [root@tonglu ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/orted core.2068 > > Program terminated with signal 11, Segmentation fault. > > #0 0x0000000002478500 in ?? () > > Missing separate debuginfos, use: debuginfo-install > > glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 > > libudev-147-2.57.el6.x86_64 numactl-2.0.9-2.el6.x86_64 > > (gdb) bt > > #0 0x0000000002478500 in ?? () > > #1 0x00002af4511433ba in _client_conn (sd=-1, args=4, > > cbdata=0x2af458022260) > > at orted/pmix/pmix_server_gen.c:80 > > #2 0x00002af451474cac in event_process_active_single_queue > > (base=0x2408e90, flags=1) > > at event.c:1370 > > #3 event_process_active (base=0x2408e90, flags=1) at > > event.c:1440 > > #4 opal_libevent2022_event_base_loop (base=0x2408e90, flags=1) > > at event.c:1644 > > #5 0x00002af451123c57 in orte_daemon (argc=33, > > argv=0x7fff521c33d8) > > at orted/orted_main.c:859 > > #6 0x000000000040081a in main (argc=33, > > argv=0x7fff521c33d8) at orted.c:60 > > (gdb) frame > > #0 0x0000000002478500 in ?? () > > (gdb) up > > #1 0x00002af4511433ba in _client_conn (sd=-1, args=4, > > cbdata=0x2af458022260) > > at orted/pmix/pmix_server_gen.c:80 > > 80 cd->cbfunc(OPAL_SUCCESS, cd->cbdata); > > > > I am using the tot of openmpi-master : > > commit 5795682aa56ce8f22e518462b22cfee49d407216 > > Merge: 5d32282 1bb7788 > > Author: Joshua Ladd <jladd.m...@gmail.com> > > Date: Mon Jun 27 12:59:20 2016 -0400 > > Merge pull request #1817 from shamisp/topic/oshmem_init > > OSHMEM: Removing erroneous initialization check > > > > I am happy to provide any further information and would appreciate any > > suggestions regarding the issue. > > > > Thanks, > > Bharat. > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2016/06/19137.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > devel mailing list > de...@open-mpi.org > https://www.open-mpi.org/mailman/listinfo.cgi/devel > > ------------------------------ > > End of devel Digest, Vol 3283, Issue 1 > **************************************