Hi Josh,I'm probably retreading old ground here, but just to make sure: are you testing this with the MX_IMM_ACK=1 environment variable set for all of the PVFS processes (pvfs2-server, pvfs2-client, pvfs2-ping, etc.)?
Thanks for tracking down that bug! thanks, -Phil On 05/19/2011 03:06 PM, Joshua Randall wrote:
I have left my attempt to get PVFS2 working over Open-MX for the past 9 months or so, and just came back to it today in the hopes that it the issue would have been addressed in the intervening time. Unfortunately, after doing a fresh install of PVFS from orangefs.org (2.8.4-orangefs) and Open-MX 1.4.0, I still have exactly the same issues as described in this thread.To recap: I could not get PVFS2 to work with Open-MX at all, but Open-MX itself appears to work fine (tested using omx_perf). Also, testing with PVFS2's own BMI test procedure also works (tested using the "pingpong" test in orangefs/test/io/bmi/) in both expected and unexpected mode.The failure mode was with pvfs2-server already running, I attempt to connect to it using pvfs2-ping, which causes pvfs2-server to immediately segfault.I have since tracked down the bug and offered a patch to Open-MX (https://gforge.inria.fr/tracker/index.php?func=detail&aid=12705&group_id=889&atid=3614). Open-MX developers have confirmed this and say they will do a 1.4.1 release to address it. Anyone who is struggling to use PVFS2 with Open-MX should upgrade to 1.4.1 once it comes out.After patching this bug, pvfs2-server is no longer segfaulting, and PVFS2 appears to be able to connect with remote servers using Open-MX. I have 3 machines on the 10Gbe that I am using for Open-MX, and if I configure two of them as servers and use the third as a client, everything appears to work just fine. I can cp files in and the output of pvfs2-ping shows all working ok:jrandall@begbie:~$ /usr/local/bin/pvfs2-ping -m /ggeu (1) Parsing tab file... (2) Initializing system interface... (3) Initializing each file system found in tab file: /etc/pvfs2tab... PVFS2 servers: mx://renton:0:3 Storage name: pvfs2-fs Local mount point: /ggeu /ggeu: Ok (4) Searching for /ggeu in pvfstab... PVFS2 servers: mx://renton:0:3 Storage name: pvfs2-fs Local mount point: /ggeu meta servers: mx://renton:0:3 mx://tommy:0:3 data servers: mx://renton:0:3 mx://tommy:0:3 (5) Verifying that all servers are responding... meta servers: mx://renton:0:3 Ok mx://tommy:0:3 Ok data servers: mx://renton:0:3 Ok mx://tommy:0:3 Ok (6) Verifying that fsid 561142527 is acceptable to all servers... Ok; all servers understand fs_id 561142527 (7) Verifying that root handle is owned by one server... Root handle: 1048576 Ok; root handle is owned by exactly one server. ============================================================= The PVFS2 filesystem at /ggeu appears to be correctly configured.However, for some reason if I try to do the same pvfs2-ping from one of the two servers that are running pvfs2-server (or indeed if I configure all three servers as servers), it has errors in communicating to the pvfs2-server running on the same host as pvfs2-ping:jrandall@renton:~$ /usr/local/bin/pvfs2-ping -m /ggeu (1) Parsing tab file... (2) Initializing system interface... (3) Initializing each file system found in tab file: /etc/fstab... PVFS2 servers: mx://renton:0:3 Storage name: pvfs2-fs Local mount point: /ggeu[E 19:51:04.782160] Warning: msgpair failed to mx://renton:0:3, will retry: Network dropped connection on reset [E 19:51:27.832042] Warning: msgpair failed to mx://renton:0:3, will retry: Network dropped connection on reset [E 19:51:50.752020] Warning: msgpair failed to mx://renton:0:3, will retry: Network dropped connection on reset [E 19:52:13.782025] Warning: msgpair failed to mx://renton:0:3, will retry: Network dropped connection on reset [E 19:52:36.752016] Warning: msgpair failed to mx://renton:0:3, will retry: Network dropped connection on reset [E 19:52:59.762033] Warning: msgpair failed to mx://renton:0:3, will retry: Network dropped connection on reset [E 19:52:59.762119] *** msgpairarray_completion_fn: msgpair to server [UNKNOWN] failed: Network dropped connection on reset[E 19:52:59.762128] *** Out of retries. /ggeu: FAILURE! Failure: could not initialze at least one of the target file systems. (4) Searching for /ggeu in pvfstab...[E 19:52:59.762217] Error: /ggeu/ resides on a PVFS2 file system that has not yet been initialized.Failure: could not find filesystem for /ggeu in pvfs2tab /etc/fstab Entry 0: /ggeuIt appears that self communications with Open-MX are not working with PVFS2, although I have tested self communications directly under Open-MX by doing "omx_loopback_test -S" and it works just fine with no problems:jrandall@renton:~$ /opt/open-mx/bin/tests/omx_loopback_test -S Using board #0 name 'eth3' hostname 'renton:0' tiny (13 bytes) latency 74 us small (95 bytes) latency 209 us medium (13274 bytes) latency 27070 us large (9327485 bytes) latency 19873032 usWhen I run the pvfs2-ping program with PVFS2_DEBUGMASK="all", I get:... (3) Initializing each file system found in tab file: /etc/fstab... PVFS2 servers: mx://renton:0:3 Storage name: pvfs2-fs Local mount point: /ggeu [D 20:01:42.142232] [SM Frame PUSH]: (0x26a5bf0) frame: 0x26a5cf0 [D 20:01:42.142312] [SM Locating]: (0x26a5bf0) op-id: 19 [D 20:01:42.142321] client_op_state_get_machine 19[D 20:01:42.142333] [SM push_state]: (0x26a5bf0) op-id: 19 stk-ptr: 0 base-frm: 0 [D 20:01:42.142341] [SM Locating]: (0x26a5bf0) located: server_get_config_nested_sm:setup_msgpair [D 20:01:42.142349] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 0 [D 20:01:42.142358] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 0 [D 20:01:42.142367] PINT_client_state_machine_post smcb 0x26a5bf0, op: PVFS_SYS_FS_ADD [D 20:01:42.142376] [SM Entering]: (0x26a5bf0) server_get_config_nested_sm:setup_msgpair (status: 0) [D 20:01:42.142402] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 0[D 20:01:42.142411] get_config state: server_get_config_setup_msgpair [D 20:01:42.142420] [SM Frame PUSH]: (0x26a5bf0) frame: 0x26a5e30[D 20:01:42.142427] [SM Exiting]: (0x26a5bf0) server_get_config_nested_sm:setup_msgpair (error code: 0), (action: COMPLETE) [D 20:01:42.142437] [SM push_state]: (0x26a5bf0) op-id: 19 stk-ptr: 1 base-frm: 0 [D 20:01:42.142444] [SM Entering]: (0x26a5bf0) msgpairarray_sm:init (status: 0) [D 20:01:42.142452] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 1[D 20:01:42.142459] Executing msgpairarray_init... [D 20:01:42.142466] base frame:1 frame count:2 [D 20:01:42.142473] (0x26a5bf0) msgpairarray state: init (1 msgpair(s))[D 20:01:42.142480] [SM Exiting]: (0x26a5bf0) msgpairarray_sm:init (error code: 0), (action: COMPLETE) [D 20:01:42.142488] [SM Entering]: (0x26a5bf0) msgpairarray_sm:post (status: 0) [D 20:01:42.142496] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 1 [D 20:01:42.142503] msgpairarray_post: sm 0x26a5bf0 2 total message(s) with 2 incomplete[D 20:01:42.142513] PINT_encode [D 20:01:42.142521] Executing lebf_encode_req... [D 20:01:42.142528] req->op:13 [D 20:01:42.142535] encode_common [D 20:01:42.142542] maxsize:508 initializing_sizes:0 [D 20:01:42.142552] lebf_encode_req [D 20:01:42.142560] PINT_encode_calc_max_size[D 20:01:42.142639] msgpairarray_post: sm 0x26a5bf0 msgpair 0: posting recv [D 20:01:42.142651] BMI_post_recv: addr: 3, offset: 0x7f21becc1010, size: 65568, tag: 1 [D 20:01:42.142680] msgpairarray_post: sm 0x26a5bf0 msgpair 0: posting send [D 20:01:42.142691] BMI_post_sendunexpected_list: addr: 3, count: 1, total_size: 28, tag: 1[D 20:01:42.142698] element 0: offset: 0x7f21be8c0010, size: 28[D 20:01:42.142709] [SM Exiting]: (0x26a5bf0) msgpairarray_sm:post (error code: 0), (action: DEFERRED) [D 20:01:42.142718] Posted PVFS_SYS_FS_ADD (6) (waiting for test)(-1073741839)[D 20:01:42.142740] PINT_client_state_machine_test id 6 [D 20:01:42.152764] PINT_client_state_machine_test id 6 [D 20:01:42.163226] PINT_client_state_machine_test id 6 [D 20:01:42.183226] PINT_client_state_machine_test id 6 [D 20:01:42.203226] PINT_client_state_machine_test id 6 [D 20:01:42.223222] PINT_client_state_machine_test id 6 [D 20:01:42.243219] PINT_client_state_machine_test id 6 [D 20:01:42.263226] PINT_client_state_machine_test id 6 [D 20:01:42.283283] PINT_client_state_machine_test id 6 [D 20:01:42.303224] PINT_client_state_machine_test id 6 [D 20:01:42.323255] PINT_client_state_machine_test id 6 [D 20:01:42.343225] PINT_client_state_machine_test id 6 [D 20:01:42.363230] PINT_client_state_machine_test id 6 [D 20:01:42.383231] PINT_client_state_machine_test id 6 [D 20:01:42.403223] PINT_client_state_machine_test id 6 [D 20:01:42.423231] PINT_client_state_machine_test id 6 [D 20:01:42.443239] PINT_client_state_machine_test id 6 [D 20:01:42.463228] PINT_client_state_machine_test id 6 [D 20:01:42.483252] PINT_client_state_machine_test id 6 [D 20:01:42.503232] PINT_client_state_machine_test id 6 [D 20:01:42.523215] PINT_client_state_machine_test id 6 [D 20:01:42.543219] PINT_client_state_machine_test id 6 [D 20:01:42.563270] PINT_client_state_machine_test id 6 [D 20:01:42.583225] PINT_client_state_machine_test id 6 [D 20:01:42.603232] PINT_client_state_machine_test id 6 [D 20:01:42.623227] PINT_client_state_machine_test id 6 [D 20:01:42.643225] PINT_client_state_machine_test id 6 [D 20:01:42.663229] PINT_client_state_machine_test id 6 [D 20:01:42.683218] PINT_client_state_machine_test id 6 [D 20:01:42.703220] PINT_client_state_machine_test id 6 [D 20:01:42.723250] PINT_client_state_machine_test id 6 [D 20:01:42.743231] PINT_client_state_machine_test id 6 [D 20:01:42.763221] PINT_client_state_machine_test id 6 [D 20:01:42.783219] PINT_client_state_machine_test id 6 [D 20:01:42.803250] PINT_client_state_machine_test id 6 [D 20:01:42.823220] PINT_client_state_machine_test id 6 [D 20:01:42.843222] PINT_client_state_machine_test id 6 [D 20:01:42.863222] PINT_client_state_machine_test id 6 [D 20:01:42.883226] PINT_client_state_machine_test id 6 [D 20:01:42.903224] PINT_client_state_machine_test id 6 [D 20:01:42.923224] PINT_client_state_machine_test id 6 [D 20:01:42.943266] PINT_client_state_machine_test id 6 [D 20:01:42.963235] PINT_client_state_machine_test id 6 [D 20:01:42.983254] PINT_client_state_machine_test id 6 [D 20:01:43.003237] PINT_client_state_machine_test id 6 [D 20:01:43.023236] PINT_client_state_machine_test id 6 [D 20:01:43.043235] PINT_client_state_machine_test id 6 [D 20:01:43.063228] PINT_client_state_machine_test id 6 [D 20:01:43.083224] PINT_client_state_machine_test id 6 [D 20:01:43.103224] PINT_client_state_machine_test id 6 [D 20:01:43.123244] PINT_client_state_machine_test id 6 [D 20:01:43.123377] sm control op 300[D 20:01:43.123392] [SM Entering]: (0x2695cd0) client_job_timer_sm:do_work (status: 0) [D 20:01:43.123405] [SM frame get]: (0x2695cd0) op-id: 300 index: 0 base-frm: 0 [D 20:01:43.123426] [SM Exiting]: (0x2695cd0) client_job_timer_sm:do_work (error code: 0), (action: DEFERRED)[D 20:01:43.123436] PINT_client_state_machine_test id 6 [D 20:01:43.143231] PINT_client_state_machine_test id 6 [D 20:01:43.163240] PINT_client_state_machine_test id 6 ...This is repeated many times until finally:[D 20:03:57.650959] PINT_client_state_machine_test id 6 [D 20:03:57.670740] PINT_client_state_machine_test id 6 [D 20:03:57.690766] PINT_client_state_machine_test id 6 [D 20:03:57.710740] sm control op 19[D 20:03:57.710998] [SM Entering]: (0x26a5bf0) msgpairarray_sm:complete (status: 1) [D 20:03:57.738986] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 1 [D 20:03:57.739066] msgpairarray_complete: sm 0x26a5bf0 status_user_tag 1 msgarray_count 1[D 20:03:57.739077] msgpairarray: 1 operations remain[D 20:03:57.739084] [SM Exiting]: (0x26a5bf0) msgpairarray_sm:complete (error code: -1073742003), (action: DEFERRED)[D 20:03:57.739093] sm control op 19[D 20:03:57.739102] [SM Entering]: (0x26a5bf0) msgpairarray_sm:complete (status: 0) [D 20:03:57.739109] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 1 [D 20:03:57.739117] msgpairarray_complete: sm 0x26a5bf0 status_user_tag 0 msgarray_count 1[D 20:03:57.739125] msgpairarray: all operations complete[D 20:03:57.739132] [SM Exiting]: (0x26a5bf0) msgpairarray_sm:complete (error code: 190), (action: COMPLETE) [D 20:03:57.739140] [SM Entering]: (0x26a5bf0) msgpairarray_sm:completion_fn (status: 0) [D 20:03:57.739150] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 1[D 20:03:57.739163] (0x26a5bf0) msgpairarray state: completion_fn [D 20:03:57.739171] Executing msgpairarray_completion_fn..[E 20:03:57.739194] Warning: msgpair failed to mx://renton:0:3, will retry: Network dropped connection on reset [E 20:03:57.739204] *** msgpairarray_completion_fn: msgpair to server [UNKNOWN] failed: Network dropped connection on reset[E 20:03:57.739213] *** Out of retries.[D 20:03:57.739220] [SM Exiting]: (0x26a5bf0) msgpairarray_sm:completion_fn (error code: -1073742003), (action: COMPLETE) [D 20:03:57.739227] [SM Entering]: (0x26a5bf0) msgpairarray_sm:done (status: 0)[D 20:03:57.739236] [SM Frame POP]: (0x26a5bf0) frame: 0x26a5e30[D 20:03:57.739243] [SM Exiting]: (0x26a5bf0) msgpairarray_sm:done (error code: -1073742003), (action: COMPLETE) [D 20:03:57.739250] [SM pop_state]: (0x26a5bf0) op-id: 19 stk-ptr: 2 base-frm: 1 [D 20:03:57.739259] [SM Entering]: (0x26a5bf0) server_get_config_nested_sm:cleanup (status: 0) [D 20:03:57.739267] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 0 [D 20:03:57.739275] [SM Exiting]: (0x26a5bf0) server_get_config_nested_sm:cleanup (error code: -1073742003), (action: COMPLETE) [D 20:03:57.739283] [SM pop_state]: (0x26a5bf0) op-id: 19 stk-ptr: 1 base-frm: 0 [D 20:03:57.739291] [SM Entering]: (0x26a5bf0) fs_add_sm:parent_cleanup (status: 0) [D 20:03:57.739344] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 0 [D 20:03:57.739356] [SM Exiting]: (0x26a5bf0) fs_add_sm:parent_cleanup (error code: -1073742003), (action: COMPLETE) [D 20:03:57.739364] [SM Entering]: (0x26a5bf0) fs_add_sm:final_cleanup (status: 0) [D 20:03:57.739371] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 0 [D 20:03:57.739378] [SM Exiting]: (0x26a5bf0) fs_add_sm:final_cleanup (error code: -1073742003), (action: TERMINATE) [D 20:03:57.739386] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 0[D 20:03:57.739393] client_state_machine_terminate smcb 0x26a5bf0[D 20:03:57.739400] client_state_machine_terminate smcb 0x26a5bf0 completing[D 20:03:57.739408] add smcb 0x26a5bf0 to completion list[D 20:03:57.739416] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 0 [D 20:03:57.739426] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 0[D 20:03:57.739437] PINT_sys_release: id 6[D 20:03:57.739447] [SM frame get]: (0x26a5bf0) op-id: 19 index: 0 base-frm: 0[D 20:03:57.739455] PINT_smcb_free: smcb:0x26a5bf0 [D 20:03:57.739462] PINT_smcb_free: frame:0x26a5cf0 task-id:0 /ggeu: FAILURE!I have also run pvfs2-server with all logging enabled, and don't see any attempts to connect when pvfs2-ping is run on the same machine as the server.Any ideas? Josh. On 28 Aug 2010, at 21:00, Scott Atchley wrote:Josh, Nice digging.Let me ping the Open-MX developer and see if the Open-MX session ids are maintained the same as MX does. We made some changes in the last release or two.Scott On Aug 28, 2010, at 3:01 PM, [email protected] wrote:FWIW, I've run the server in gdb to see what is causing the seg fault whena client tries to connect remotely:[D 08/28 19:40] bmi_mx: CONN_REQ from mx://begbie:0:0.[D 08/28 19:40] bmi_mx: bmx_unexpected_recv rx match= 0xc000000100000100length= 16.[D 08/28 19:40] bmi_mx: bmx_handle_conn_req returned RX match0xc000000100000100 with Success.Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff06c0910 (LWP 3026)] bmx_handle_conn_req () at src/io/bmi/bmi_mx/mx.c:2403 2403 } else if (sid != peer->mxp_sid) { /*reconnecting peer */(gdb) bt #0 bmx_handle_conn_req () at src/io/bmi/bmi_mx/mx.c:2403 #1 bmx_connection_handlers () at src/io/bmi/bmi_mx/mx.c:2561 #2 0x0000000000476102 in BMI_mx_testunexpected (incount=196610,outcount=0xb8f31b2a007fc118, ui=0x7ffff06bfe38, max_idle_time=-261357980)at src/io/bmi/bmi_mx/mx.c:2820 #3 0x00000000004549b2 in BMI_testunexpected (incount=<value optimisedout>, outcount=<value optimised out>, info_array=<value optimised out>, max_idle_time_ms=0)at src/io/bmi/bmi.c:1000 #4 0x000000000044d5c0 in bmi_thread_function (ptr=<value optimisedout>) > at src/io/job/thread-mgr.c:182#5 0x00007ffff7292a04 in start_thread () from /lib/libpthread.so.0 #6 0x00007ffff6bcdd4d in clone () from /lib/libc.so.6 #7 0x0000000000000000 in ?? ()Line 2403 of mx.c is:} else if (sid != peer->mxp_sid) { /* reconnecting peer */So I examined those variables in gdb:(gdb) print sid $1 = 3102939946 (gdb) print peer $2 = (struct bmx_peer *) 0x5fb7569100030002 (gdb) print peer->mxp_sid Cannot access memory at address 0x5fb756910003002aI looked back to where peer was getting set, to lines 2358 to 2362:{ void *peerp = &peer;mx_get_endpoint_addr_context(status.source,&peerp);peer = (struct bmx_peer *) peerp; }I checked status source, which seems to be ok:(gdb) print status.source $3 = {stuff = {8415496, 13327025589530378520}}I set a breakpoint at line 2361 and ran it again:Breakpoint 1, bmx_handle_conn_req () at src/io/bmi/bmi_mx/mx.c:23612361 peer = (struct bmx_peer *) peerp;(gdb) print peerp $2 = (void *) 0xcfa4b15e00030001 (gdb) step 2363 if (peer == NULL) { /* new peer */ (gdb) print peer $3 = (struct bmx_peer *) 0x0 (gdb) print peer->mxp_sid Cannot access memory at address 0x28It seems that mx_get_endpoint_addr_context() is potentially not returningthe expected structure? Josh.
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
