Here's the backtrace: (gdb) where #0 0x0000000000000000 in ?? () #1 0x00007fac6b8d8921 in mca_bml_base_get (bml_btl=0x239a130, des=0x220e880) at ../../../../ompi/mca/bml/bml.h:326 #2 0x00007fac6b8db767 in mca_spml_yoda_get (src_addr=0x601500, size=4, dst_addr=0x7fff3b00b370, src=1) at spml_yoda.c:1091 #3 0x00007fac6f1ea56d in shmem_int_g (addr=0x601500, pe=1) at shmem_g.c:47 #4 0x0000000000400bc7 in main ()
On Aug 14, 2013, at 3:12 PM, Ralph Castain <r...@open-mpi.org> wrote: > Hmmm...well, it works fine as long as the procs are on the same node. > However, if they are on different nodes, it segfaults: > > [rhc@bend002 shmem]$ shmemrun -npernode 1 ./test_shmem > running on bend001 > running on bend002 > [bend001:06590] *** Process received signal *** > [bend001:06590] Signal: Segmentation fault (11) > [bend001:06590] Signal code: Address not mapped (1) > [bend001:06590] Failing at address: (nil) > [bend001:06590] [ 0] /lib64/libpthread.so.0() [0x307d40f500] > [bend001:06590] *** End of error message *** > [bend002][[62090,1],1][btl_tcp_frag.c:219:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > -------------------------------------------------------------------------- > shmemrun noticed that process rank 0 with PID 6590 on node bend001 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > I would have thought it should work in that situation - yes? > > > On Aug 14, 2013, at 2:52 PM, Joshua Ladd <josh...@mellanox.com> wrote: > >> The following simple test code will exercise the following: >> >> start_pes() >> >> shmalloc() >> >> shmem_int_get() >> >> shmem_int_put() >> >> shmem_barrier_all() >> >> To compile: >> >> shmemcc test_shmem.c -o test_shmem >> >> To launch: >> >> shmemrun -np 2 test_shmem >> >> or for those who prefer to launch with SLURM >> >> srun -n 2 test_shmem >> >> Josh >> >> >> -----Original Message----- >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain >> Sent: Wednesday, August 14, 2013 5:32 PM >> To: Open MPI Developers >> Subject: Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2 >> >> Can you point me to a test program that would exercise it? I'd like to give >> it a try first. >> >> I'm okay with on by default as it builds its own separate library, and with >> the RFC >> >> On Aug 14, 2013, at 2:03 PM, "Barrett, Brian W" <bwba...@sandia.gov> wrote: >> >>> Josh - >>> >>> In general, I don't have a strong opinion of whether OpenSHMEM is on >>> by default or not. It might cause unexpected behavior for some users >>> (like on Crays, where one should really use Cray's SHMEM), but maybe >>> it's better on other platforms. >>> >>> I also would have no objection to the RFC, provided the segfaults I >>> found get resolved. >>> >>> Brian >>> >>> On 8/14/13 2:08 PM, "Joshua Ladd" <josh...@mellanox.com> wrote: >>> >>>> Ralph, and Brian >>>> >>>> Thanks a bunch for taking the time to review this. It is extremely >>>> helpful. Let me comment of the building of OSHMEM and solicit some >>>> feedback from you guys (along with the rest of the community.) >>>> Originally we had planned to enable OSHMEM to build only if >>>> '--with-oshmem' flag was passed at configure time. However, >>>> (unbeknownst to me) this behavior was changed and now OSHMEM is built by >>>> default, i.e. >>>> yes, Ralph this is the intended behavior now. I am wondering if this >>>> is such a good idea. Do folks have a strong opinion on this one way >>>> or the other? From my perspective I can see arguments for both sides >>>> of the coin. >>>> >>>> Other than cleaning up warnings and resolving the segfault that Brian >>>> observed are we on a good course to getting this upstream? Is it >>>> reasonable to file an RFC for three weeks out? >>>> >>>> Josh >>>> >>>> -----Original Message----- >>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Barrett, >>>> Brian W >>>> Sent: Sunday, August 11, 2013 1:42 PM >>>> To: Open MPI Developers >>>> Subject: Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2 >>>> >>>> Ralph - >>>> >>>> I think those warnings are just because of when they last synced with >>>> the trunk; it looks like they haven't updated in the last week, when >>>> those (and some usnic fixes) went in. >>>> >>>> More concerning is the --enable-picky stuff and the disabling of >>>> SHMEM in the right places. >>>> >>>> Brian >>>> >>>> On 8/11/13 11:24 AM, "Ralph Castain" <r...@open-mpi.org> wrote: >>>> >>>>> Turning off the enable_picky, I get it to compile with the following >>>>> warnings: >>>>> >>>>> pget_elements_x_f.c:70: warning: no previous prototype for >>>>> 'ompi_get_elements_x_f' >>>>> pstatus_set_elements_x_f.c:70: warning: no previous prototype for >>>>> 'ompi_status_set_elements_x_f' >>>>> ptype_get_extent_x_f.c:69: warning: no previous prototype for >>>>> 'ompi_type_get_extent_x_f' >>>>> ptype_get_true_extent_x_f.c:69: warning: no previous prototype for >>>>> 'ompi_type_get_true_extent_x_f' >>>>> ptype_size_x_f.c:69: warning: no previous prototype for >>>>> 'ompi_type_size_x_f' >>>>> >>>>> I also found that OpenShmem is still building by default. Is that >>>>> intended? I thought you were only going to build if --with-shmem (or >>>>> whatever option) was given. >>>>> >>>>> Looks like some cleanup is required >>>>> >>>>> On Aug 10, 2013, at 8:54 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>>> FWIW, I couldn't get it to build - this is on a simple Xeon-based >>>>>> system under CentOS 6.2: >>>>>> >>>>>> cc1: warnings being treated as errors >>>>>> spml_yoda_getreq.c: In function 'mca_spml_yoda_get_completion': >>>>>> spml_yoda_getreq.c:98: error: pointer targets in passing argument 1 >>>>>> of 'opal_atomic_add_32' differ in signedness >>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: >>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *' >>>>>> spml_yoda_getreq.c:98: error: signed and unsigned type in >>>>>> conditional expression >>>>>> cc1: warnings being treated as errors >>>>>> spml_yoda_putreq.c: In function 'mca_spml_yoda_put_completion': >>>>>> spml_yoda_putreq.c:81: error: pointer targets in passing argument 1 >>>>>> of 'opal_atomic_add_32' differ in signedness >>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: >>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *' >>>>>> spml_yoda_putreq.c:81: error: signed and unsigned type in >>>>>> conditional expression >>>>>> make[2]: *** [spml_yoda_getreq.lo] Error 1 >>>>>> make[2]: *** Waiting for unfinished jobs.... >>>>>> make[2]: *** [spml_yoda_putreq.lo] Error 1 >>>>>> cc1: warnings being treated as errors >>>>>> spml_yoda.c: In function 'mca_spml_yoda_put_internal': >>>>>> spml_yoda.c:725: error: pointer targets in passing argument 1 of >>>>>> 'opal_atomic_add_32' differ in signedness >>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: >>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *' >>>>>> spml_yoda.c:725: error: signed and unsigned type in conditional >>>>>> expression >>>>>> spml_yoda.c: In function 'mca_spml_yoda_get': >>>>>> spml_yoda.c:1107: error: pointer targets in passing argument 1 of >>>>>> 'opal_atomic_add_32' differ in signedness >>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: >>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *' >>>>>> spml_yoda.c:1107: error: signed and unsigned type in conditional >>>>>> expression >>>>>> make[2]: *** [spml_yoda.lo] Error 1 >>>>>> make[1]: *** [all-recursive] Error 1 >>>>>> >>>>>> Only configure arguments: >>>>>> >>>>>> enable_picky=yes >>>>>> enable_debug=yes >>>>>> >>>>>> >>>>>> gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3) >>>>>> >>>>>> >>>>>> >>>>>> On Aug 10, 2013, at 7:21 PM, "Barrett, Brian W" >>>>>> <bwba...@sandia.gov> >>>>>> wrote: >>>>>> >>>>>>> On 8/6/13 10:30 AM, "Joshua Ladd" <josh...@mellanox.com> wrote: >>>>>>> >>>>>>>> Dear OMPI Community, >>>>>>>> >>>>>>>> Please find on Bitbucket the latest round of OSHMEM changes based >>>>>>>> on community feedback. Please git and test at your leisure. >>>>>>>> >>>>>>>> https://bitbucket.org/jladd_math/mlnx-oshmem.git >>>>>>> >>>>>>> Josh - >>>>>>> >>>>>>> In general, I think everything looks ok. However, the "right" >>>>>>> thing doesn't happen if the CM PML is used (at least, when using >>>>>>> the Portals >>>>>>> 4 >>>>>>> MTL). When configured with: >>>>>>> >>>>>>> ./configure >>>>>>> --enable-mca-no-build=pml-ob1,pml-bfo,pml-v,btl,bml,mpool >>>>>>> >>>>>>> The build segfaults trying to run a SHMEM program: >>>>>>> >>>>>>> mpirun -np 2 ./bcast >>>>>>> [shannon:90397] *** Process received signal *** [shannon:90397] >>>>>>> Signal: Segmentation fault (11) [shannon:90397] Signal code: >>>>>>> Address not mapped (1) [shannon:90397] Failing at address: (nil) >>>>>>> [shannon:90398] *** Process received signal *** [shannon:90398] >>>>>>> Signal: Segmentation fault (11) [shannon:90398] Signal code: >>>>>>> Address not mapped (1) [shannon:90398] Failing at address: (nil) >>>>>>> [shannon:90397] [ 0] /lib64/libpthread.so.0() [0x38b7a0f4a0] >>>>>>> [shannon:90397] *** End of error message *** [shannon:90398] [ 0] >>>>>>> /lib64/libpthread.so.0() [0x38b7a0f4a0] [shannon:90398] *** End of >>>>>>> error message *** >>>>>>> >>>>>>> ------------------------------------------------------------------ >>>>>>> --- >>>>>>> --- >>>>>>> -- >>>>>>> mpirun noticed that process rank 1 with PID 90398 on node shannon >>>>>>> exited on signal 11 (Segmentation fault). >>>>>>> >>>>>>> ------------------------------------------------------------------ >>>>>>> --- >>>>>>> --- >>>>>>> -- >>>>>>> >>>>>>> >>>>>>> >>>>>>> Brian >>>>>>> >>>>>>> -- >>>>>>> Brian W. Barrett >>>>>>> Scalable System Software Group >>>>>>> Sandia National Laboratories >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>> >>>> >>>> -- >>>> Brian W. Barrett >>>> Scalable System Software Group >>>> Sandia National Laboratories >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> >>> -- >>> Brian W. Barrett >>> Scalable System Software Group >>> Sandia National Laboratories >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <test_shmem.c>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >