Hi all

Apologies for a 101 level question again, but here it is:

A new BTL layer I am implementing hangs in MPI_Send(). Please keep in mind
that at this stage, I am simply desperate to make MPI data move through
this fabric in any way possible, so I have thrown all good programming
practice out of the window and in the process might have added bugs.

The test code basically has a single call to MPI_Send() with 8 bytes of
data, the smallest amount the HCA can DMA. I have a very simple
mca_btl_component_progress() method that returns 0 if called before
mca_btl_endpoint_send() and returns 1 if called after. I use a static
variable to keep track whether endpoint_send() has been called.

With this, the MPI process hangs with the following stack:

(gdb) bt
#0  0x00007f7518c60b7d in poll () from /lib64/libc.so.6
#1  0x00007f75183e79f6 in poll_dispatch (base=0x19cf480, tv=0x7f75177efe80)
at poll.c:165
#2  0x00007f75183df690 in opal_libevent2022_event_base_loop
(base=0x19cf480, flags=1) at event.c:1630
#3  0x00007f75183613d4 in progress_engine (obj=0x19cedd8) at
runtime/opal_progress_threads.c:105
#4  0x00007f7518f3ddf5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7518c6b1ad in clone () from /lib64/libc.so.6

I am using code from master branch for this work.

Obviously I am not doing the progress handling right, and I don't even
understand how it should work, as the TCP btl does not even provide a
component progress function.

Any relevant pointer on how this should be done is highly appreciated.

Thanks
Durga


The surgeon general advises you to eat right, exercise regularly and quit
ageing.

Reply via email to