Maybe this is a stupid question, but in this case (I believe this goes all the 
way back to our initial discussion on OSHMEM), how does one fall back onto 
send/recv semantics when the call is made at the SHMEM level to do a put? If a 
BTL doesn't support RDMA, then it doesn't seem reasonable to expect OSHMEM to 
support it through YODA. It seems more reasonable to check whether or not the 
bml_get is NULL and if this is the case, then one must disqualify YODA and 
hence SHMEM. How can you support put /get SHMEM semantics without an RDMA 
equipped BTL? Does it even make sense to try to emulate that behavior? I know 
the SHMEM developers have been going round in circles on this, so any insight 
you could provide would be greatly appreciated.

Josh                                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                 

-----Original Message-----
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, August 15, 2013 11:55 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2

I see the problem. Yoda is directly calling bml_get without first checking to 
see if the bml_btl supports rdma operations. If you only have the tcp btl, then 
rdma isn't supported, the bml_get function is NULL, and you segfault.

What you need to do is check for rdma, and then fall back to message-based 
transfers if rdma isn't available. I believe that's what our current PML's do - 
you can't just assume rdma (or any other support) is just present.


On Aug 14, 2013, at 4:02 PM, Joshua Ladd <josh...@mellanox.com> wrote:

> Thanks, Ralph. We'll have a look.  Admittedly, we've done little testing with 
> the tcp BTL - I was under the impression that the yoda interface was capable 
> of working with all BTLs, seems we need more testing. For sure it works with 
> SM and OpenIB BTLs. 
> 
> Josh
> 
> -----Original Message-----
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph 
> Castain
> Sent: Wednesday, August 14, 2013 6:13 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2
> 
> Here's the backtrace:
> 
> (gdb) where
> #0  0x0000000000000000 in ?? ()
> #1  0x00007fac6b8d8921 in mca_bml_base_get (bml_btl=0x239a130, 
> des=0x220e880) at ../../../../ompi/mca/bml/bml.h:326
> #2  0x00007fac6b8db767 in mca_spml_yoda_get (src_addr=0x601500, 
> size=4, dst_addr=0x7fff3b00b370, src=1) at spml_yoda.c:1091
> #3  0x00007fac6f1ea56d in shmem_int_g (addr=0x601500, pe=1) at 
> shmem_g.c:47
> #4  0x0000000000400bc7 in main ()
> 
> On Aug 14, 2013, at 3:12 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> Hmmm...well, it works fine as long as the procs are on the same node. 
>> However, if they are on different nodes, it segfaults:
>> 
>> [rhc@bend002 shmem]$ shmemrun -npernode 1 ./test_shmem running on
>> bend001 running on bend002 [bend001:06590] *** Process received 
>> signal
>> *** [bend001:06590] Signal: Segmentation fault (11) [bend001:06590] 
>> Signal code: Address not mapped (1) [bend001:06590] Failing at
>> address: (nil) [bend001:06590] [ 0] /lib64/libpthread.so.0() 
>> [0x307d40f500] [bend001:06590] *** End of error message *** 
>> [bend002][[62090,1],1][btl_tcp_frag.c:219:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> ---------------------------------------------------------------------
>> -
>> ---- shmemrun noticed that process rank 0 with PID 6590 on node
>> bend001 exited on signal 11 (Segmentation fault).
>> ---------------------------------------------------------------------
>> -
>> ----
>> 
>> I would have thought it should work in that situation - yes?
>> 
>> 
>> On Aug 14, 2013, at 2:52 PM, Joshua Ladd <josh...@mellanox.com> wrote:
>> 
>>> The following simple test code will exercise the following:
>>> 
>>> start_pes()
>>> 
>>> shmalloc()
>>> 
>>> shmem_int_get()
>>> 
>>> shmem_int_put()
>>> 
>>> shmem_barrier_all()
>>> 
>>> To compile:
>>> 
>>> shmemcc test_shmem.c -o test_shmem
>>> 
>>> To launch:
>>> 
>>> shmemrun -np 2  test_shmem
>>> 
>>> or for those who prefer to launch with SLURM
>>> 
>>> srun -n 2 test_shmem
>>> 
>>> Josh
>>> 
>>> 
>>> -----Original Message-----
>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph 
>>> Castain
>>> Sent: Wednesday, August 14, 2013 5:32 PM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2
>>> 
>>> Can you point me to a test program that would exercise it? I'd like to give 
>>> it a try first.
>>> 
>>> I'm okay with on by default as it builds its own separate library, 
>>> and with the RFC
>>> 
>>> On Aug 14, 2013, at 2:03 PM, "Barrett, Brian W" <bwba...@sandia.gov> wrote:
>>> 
>>>> Josh -
>>>> 
>>>> In general, I don't have a strong opinion of whether OpenSHMEM is 
>>>> on by default or not.  It might cause unexpected behavior for some 
>>>> users (like on Crays, where one should really use Cray's SHMEM), 
>>>> but maybe it's better on other platforms.
>>>> 
>>>> I also would have no objection to the RFC, provided the segfaults I 
>>>> found get resolved.
>>>> 
>>>> Brian
>>>> 
>>>> On 8/14/13 2:08 PM, "Joshua Ladd" <josh...@mellanox.com> wrote:
>>>> 
>>>>> Ralph, and Brian
>>>>> 
>>>>> Thanks a bunch for taking the time to review this. It is extremely 
>>>>> helpful. Let me comment of the building of OSHMEM and solicit some 
>>>>> feedback from you guys (along with the rest of the community.) 
>>>>> Originally we had planned to enable OSHMEM to build only if 
>>>>> '--with-oshmem' flag was passed at configure time. However, 
>>>>> (unbeknownst to me) this behavior was changed and now OSHMEM is built by 
>>>>> default, i.e.
>>>>> yes, Ralph this is the intended behavior now. I am wondering if 
>>>>> this is such a good idea. Do folks have a strong opinion on this 
>>>>> one way or the other? From my perspective I can see arguments for 
>>>>> both sides of the coin.
>>>>> 
>>>>> Other than cleaning up warnings and resolving the segfault that 
>>>>> Brian observed are we on a good course to getting this upstream? 
>>>>> Is it reasonable to file an RFC for three weeks out?
>>>>> 
>>>>> Josh
>>>>> 
>>>>> -----Original Message-----
>>>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of 
>>>>> Barrett, Brian W
>>>>> Sent: Sunday, August 11, 2013 1:42 PM
>>>>> To: Open MPI Developers
>>>>> Subject: Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2
>>>>> 
>>>>> Ralph -
>>>>> 
>>>>> I think those warnings are just because of when they last synced 
>>>>> with the trunk; it looks like they haven't updated in the last 
>>>>> week, when those (and some usnic fixes) went in.
>>>>> 
>>>>> More concerning is the --enable-picky stuff and the disabling of 
>>>>> SHMEM in the right places.
>>>>> 
>>>>> Brian
>>>>> 
>>>>> On 8/11/13 11:24 AM, "Ralph Castain" <r...@open-mpi.org> wrote:
>>>>> 
>>>>>> Turning off the enable_picky, I get it to compile with the 
>>>>>> following
>>>>>> warnings:
>>>>>> 
>>>>>> pget_elements_x_f.c:70: warning: no previous prototype for 
>>>>>> 'ompi_get_elements_x_f'
>>>>>> pstatus_set_elements_x_f.c:70: warning: no previous prototype for 
>>>>>> 'ompi_status_set_elements_x_f'
>>>>>> ptype_get_extent_x_f.c:69: warning: no previous prototype for 
>>>>>> 'ompi_type_get_extent_x_f'
>>>>>> ptype_get_true_extent_x_f.c:69: warning: no previous prototype 
>>>>>> for 'ompi_type_get_true_extent_x_f'
>>>>>> ptype_size_x_f.c:69: warning: no previous prototype for 
>>>>>> 'ompi_type_size_x_f'
>>>>>> 
>>>>>> I also found that OpenShmem is still building by default. Is that 
>>>>>> intended? I thought you were only going to build if --with-shmem 
>>>>>> (or whatever option) was given.
>>>>>> 
>>>>>> Looks like some cleanup is required
>>>>>> 
>>>>>> On Aug 10, 2013, at 8:54 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> 
>>>>>>> FWIW, I couldn't get it to build - this is on a simple 
>>>>>>> Xeon-based system under CentOS 6.2:
>>>>>>> 
>>>>>>> cc1: warnings being treated as errors
>>>>>>> spml_yoda_getreq.c: In function 'mca_spml_yoda_get_completion':
>>>>>>> spml_yoda_getreq.c:98: error: pointer targets in passing 
>>>>>>> argument
>>>>>>> 1 of 'opal_atomic_add_32' differ in signedness
>>>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: 
>>>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *'
>>>>>>> spml_yoda_getreq.c:98: error: signed and unsigned type in 
>>>>>>> conditional expression
>>>>>>> cc1: warnings being treated as errors
>>>>>>> spml_yoda_putreq.c: In function 'mca_spml_yoda_put_completion':
>>>>>>> spml_yoda_putreq.c:81: error: pointer targets in passing 
>>>>>>> argument
>>>>>>> 1 of 'opal_atomic_add_32' differ in signedness
>>>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: 
>>>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *'
>>>>>>> spml_yoda_putreq.c:81: error: signed and unsigned type in 
>>>>>>> conditional expression
>>>>>>> make[2]: *** [spml_yoda_getreq.lo] Error 1
>>>>>>> make[2]: *** Waiting for unfinished jobs....
>>>>>>> make[2]: *** [spml_yoda_putreq.lo] Error 1
>>>>>>> cc1: warnings being treated as errors
>>>>>>> spml_yoda.c: In function 'mca_spml_yoda_put_internal':
>>>>>>> spml_yoda.c:725: error: pointer targets in passing argument 1 of 
>>>>>>> 'opal_atomic_add_32' differ in signedness
>>>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: 
>>>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *'
>>>>>>> spml_yoda.c:725: error: signed and unsigned type in conditional 
>>>>>>> expression
>>>>>>> spml_yoda.c: In function 'mca_spml_yoda_get':
>>>>>>> spml_yoda.c:1107: error: pointer targets in passing argument 1 
>>>>>>> of 'opal_atomic_add_32' differ in signedness
>>>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: 
>>>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *'
>>>>>>> spml_yoda.c:1107: error: signed and unsigned type in conditional 
>>>>>>> expression
>>>>>>> make[2]: *** [spml_yoda.lo] Error 1
>>>>>>> make[1]: *** [all-recursive] Error 1
>>>>>>> 
>>>>>>> Only configure arguments:
>>>>>>> 
>>>>>>> enable_picky=yes
>>>>>>> enable_debug=yes
>>>>>>> 
>>>>>>> 
>>>>>>> gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Aug 10, 2013, at 7:21 PM, "Barrett, Brian W" 
>>>>>>> <bwba...@sandia.gov>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> On 8/6/13 10:30 AM, "Joshua Ladd" <josh...@mellanox.com> wrote:
>>>>>>>> 
>>>>>>>>> Dear OMPI Community,
>>>>>>>>> 
>>>>>>>>> Please find on Bitbucket the latest round of OSHMEM changes 
>>>>>>>>> based on community feedback. Please git and test at your leisure.
>>>>>>>>> 
>>>>>>>>> https://bitbucket.org/jladd_math/mlnx-oshmem.git
>>>>>>>> 
>>>>>>>> Josh -
>>>>>>>> 
>>>>>>>> In general, I think everything looks ok.  However, the "right" 
>>>>>>>> thing doesn't happen if the CM PML is used (at least, when 
>>>>>>>> using the Portals
>>>>>>>> 4
>>>>>>>> MTL).  When configured with:
>>>>>>>> 
>>>>>>>> ./configure
>>>>>>>> --enable-mca-no-build=pml-ob1,pml-bfo,pml-v,btl,bml,mpool
>>>>>>>> 
>>>>>>>> The build segfaults trying to run a SHMEM program:
>>>>>>>> 
>>>>>>>> mpirun -np 2 ./bcast
>>>>>>>> [shannon:90397] *** Process received signal *** [shannon:90397]
>>>>>>>> Signal: Segmentation fault (11) [shannon:90397] Signal code: 
>>>>>>>> Address not mapped (1) [shannon:90397] Failing at address: 
>>>>>>>> (nil) [shannon:90398] *** Process received signal *** 
>>>>>>>> [shannon:90398]
>>>>>>>> Signal: Segmentation fault (11) [shannon:90398] Signal code: 
>>>>>>>> Address not mapped (1) [shannon:90398] Failing at address: 
>>>>>>>> (nil) [shannon:90397] [ 0] /lib64/libpthread.so.0() 
>>>>>>>> [0x38b7a0f4a0] [shannon:90397] *** End of error message *** 
>>>>>>>> [shannon:90398] [ 0]
>>>>>>>> /lib64/libpthread.so.0() [0x38b7a0f4a0] [shannon:90398] *** End 
>>>>>>>> of error message ***
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> --
>>>>>>>> ---
>>>>>>>> ---
>>>>>>>> --
>>>>>>>> mpirun noticed that process rank 1 with PID 90398 on node 
>>>>>>>> shannon exited  on signal 11 (Segmentation fault).
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> --
>>>>>>>> ---
>>>>>>>> ---
>>>>>>>> --
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Brian
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Brian W. Barrett
>>>>>>>> Scalable System Software Group
>>>>>>>> Sandia National Laboratories
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Brian W. Barrett
>>>>> Scalable System Software Group
>>>>> Sandia National Laboratories
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Brian W. Barrett
>>>> Scalable System Software Group
>>>> Sandia National Laboratories
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> <test_shmem.c>_______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to