On May 19, 2011, at 5:00 PM, Joshua Randall wrote:

> Phil,
> 
> Yes, I have done these tests with MM_IMM_ACK="1" and
> OMX_FATAL_ERRORS="0" set for both server and client.
> 
> Since my last message I have tried setting (in mx.h):
>> #define BMX_DB_MASK (BMX_DB_ALL)
> 
> And running pvfs2-ping with PVFS2_DEBUGMASK="all" on both the remote
> client and the with client on the same host as the server.  Comparing
> the two outputs, I see the first lines that differ are:
> 
> client-server same host:
>> [D 20:35:21.445292] bmi_mx: entering bmx_connection_handlers.
>> [D 20:35:21.445308] bmi_mx: exiting  bmx_connection_handlers.
> 
> 
> client-server different hosts:
>> [D 20:42:13.754440] bmi_mx: entering bmx_connection_handlers.
>> [D 20:42:13.754463] bmi_mx: bmx_handle_icon_req returned for mx://
>> renton:0:3 with Success.
>> [D 20:42:13.754480] bmi_mx: bmx_handle_icon_req tx match=
>> 0xc000000100000100 length= 0.
>> [D 20:42:13.754505] bmi_mx: bmx_handle_conn_req returned TX match
>> 0xc000000100000100 with Success.
>> [D 20:42:13.754515] bmi_mx: CONN_REQ sent to mx://renton:0:3.
>> [D 20:42:13.754522] bmi_mx: entering bmx_peer_decref.
>> [D 20:42:13.754530] bmi_mx: exiting  bmx_peer_decref.
>> [D 20:42:13.754537] bmi_mx: entering bmx_ctx_init.
>> [D 20:42:13.754545] bmi_mx: exiting  bmx_ctx_init.
>> [D 20:42:13.754552] bmi_mx: exiting  bmx_connection_handlers.

The process called mx_iconnect() to the peer. It then completed which returns 
the peer's MX endpoint address. The process then packs a CONN_REQUEST message 
and sends it to the peer using mx_isend() and the new MX endpoint address.

In the first case, the mx_iconnect() to itself does not seem to complete. It 
may be an issue in Open-MX. A simple test would be to try to compile and run on 
Open-MX:

#include <stdio.h>
#include <inttypes.h>
#include "myriexpress.h"

int main(int argc, char *argv[])
{
    uint32_t ep_id, sid, result = 0;
    uint64_t nic_id;
    mx_return_t ret;
    mx_endpoint_t ep;
    mx_endpoint_addr_t epa;
    mx_request_t request;

    mx_init();

    ret = mx_open_endpoint(0, 0, 0, NULL, 0, &ep); /* use the first NIC, open 
endpoint 0, filter = 0, no params */
    if (ret) {
        printf("open_endpoint() returned %s\n", mx_strerr(ret));
        exit(1);
    }

    ret = mx_get_endpoint_addr(ep, &epa);
    if (ret) {
        printf("get_endpoint_addr() returned %s\n", mx_strerr(ret));
        exit(1);
    }

    ret = mx_decompose_endpoint_addr2(epa, &nic_id, &ep_id, &sid);
    if (ret) {
        printf("mx_decompose_endpoint_addr2() returned %s\n", mx_strerr(ret));
        exit(1);
    }

    ret = mx_iconnect(ep, nic_id, 0, 0, 0, NULL, &request);
    if (ret) {
        printf("iconnect() returned %s\n", mx_strerr(ret));
        exit(1);
    }

    do {
        ret = mx_test(ep, &request, &status, &result);
        if (result)
            printf("iconnect completed with status is %s\n", 
mx_strstatus(status.code));
    } while (!result);

    mx_fini();
    return 0;
}

> 
> 
> then, later on, with the client-server on different hosts, it gets:
>> 
>> [D 20:42:13.780057] bmi_mx: CONN_ACK from mx://renton:0:3 id= 3.
>> [D 20:42:13.780070] bmi_mx: setting mx://renton:0:3's state to READY.
> ...
> 
> With the client-server on the same host, there are no CONN_REQ or
> CONN_ACK messages.
> 
> I've run pvfs2-ping in gdb with a breakpoint for
> bmx_connection_handlers and I see that when it goes into
> bmx_handle_icon_req() on the same host as the server, the call to
> "mx_test_any(bmi_mx->bmx_ep, match, mask, &status, &result);" simply
> returns a 0 in result:
> 
> client-server same host:
>> 2560            bmx_handle_icon_req();
>> (gdb) step
>> bmx_handle_icon_req () at src/io/bmi/bmi_mx/mx.c:2218
>> 2218            uint32_t        result  = 0;
>> (gdb) step
>> 2221                    uint64_t        match   = (uint64_t)
>> BMX_MSG_ICON_REQ << BMX_MSG_SHIFT;
>> (gdb) step
>> 2222                    uint64_t        mask    = BMX_MASK_MSG;
>> (gdb) step
>> 2225                    mx_test_any(bmi_mx->bmx_ep, match, mask,
>> &status, &result);
>> (gdb) step
>> 2226                    if (result) {
>> (gdb) print result
>> $1 = 0
>> (gdb) print status
>> $2 = {code = 540697956, source = {stuff = {4210425200352911656,
>> 6072343580357116704}}, match_info = 4833952, msg_length = 7084832,
>> xfer_length = 0, context = 0x6c1b60}
> 
> client-server different hosts:
>> 2560            bmx_handle_icon_req();
>> (gdb) step
>> bmx_handle_icon_req () at src/io/bmi/bmi_mx/mx.c:2218
>> 2218            uint32_t        result  = 0;
>> (gdb) step
>> 2234                            debug(BMX_DB_CONN, "%s returned for
>> %s with %s", __func__,
>> (gdb) step
>> 2225                    mx_test_any(bmi_mx->bmx_ep, match, mask,
>> &status, &result);
>> (gdb) step
>> 2226                    if (result) {
>> (gdb) print result
>> $2 = 1
>> (gdb) print status
>> $3 = {code = MX_STATUS_SUCCESS, source = {stuff = {8235576,
>> 6207088827828273152}}, match_info = 12682136550675316736, msg_length
>> = 0, xfer_length = 0, context = 0x6bc910}
> 
> 
> I traced this into the open-mx library, using the debug library, and
> stepping through omx__test_any_common(ep, match_info, match_mask,
> status) on the client running on each of the configurations (same or
> different host as the server).
> 
> They do basically the same thing until libopen-mx/omx_test.c:276 where
> there is a test that req match_info matches match_info:
>>      if (likely((req->generic.status.match_info & match_mask) ==
>> match_info)) {
> 
> 
> with client-server on the same host, it is false:
>> (gdb) print match_info
>> $6 = 12682136550675316736
>> (gdb) print (req->generic.status.match_info & match_mask)
>> $7 = 0
>> (gdb) print req->generic.status.match_info
>> $8 = 8395144
>> (gdb) print ((req->generic.status.match_info & match_mask) ==
>> match_info)
>> $9 = 0
> 
> however, with client-server on different hosts, it is true:
>> (gdb) print match_info
>> $5 = 12682136550675316736
>> (gdb) print (req->generic.status.match_info & match_mask)
>> $6 = 12682136550675316736
>> (gdb) print req->generic.status.match_info
>> $7 = 12682136550675316736
>> (gdb) print ((req->generic.status.match_info & match_mask) ==
>> match_info)
>> $8 = 1
> 
> 
> I don't know either bmi_mx or open-mx well enough to have much of an
> idea of what is going on here.
> 
> Josh.

Thanks, Josh.

bmi_mx does not handle self-communication any differently. It tries to connect 
to itself using the normal code path for connecting to others.

My guess from the above is that Open-MX is not setting/saving the match_info 
for self-communications. It may not be setting other fields as well but Brice's 
team will need to look into that.

Scott
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to