Can you try repeating your experiment with the OMX_FATAL_ERRORS environment variable set to 0?

With OMX_FATAL_ERRORS=0 and MX_IMM_ACK=1, the servers now start and connect to each other. Perhaps someone should add the need for OMX_FATAL_ERRORS=0 to the FAQ entry that discusses MX_IMM_ACK (or perhaps it can be set programmatically via the API?). In any case, thanks for the help in getting the servers started!

However, I still cannot get the filesystem to actually work. With the 3 servers running, I tried pvfs2-ping, but all I get are connection errors.

Depending on which of the 3 servers I set in fstab, I get different errors with pvfs2-ping.

With the server set to the same host as I am running pvfs2-ping on (in this case begbie), all three servers keep running without recognizing any connection request, and I get this output:

begbie:~$ sudo MX_IMM_ACK=1 OMX_FATAL_ERRORS=0 pvfs2-ping -m /ggeu

(1) Parsing tab file...

(2) Initializing system interface...

(3) Initializing each file system found in tab file: /etc/pvfs2tab...

   PVFS2 servers: mx://begbie:0:3
   Storage name: pvfs2-fs
   Local mount point: /ggeu
[E 15:36:10.272413] Warning: msgpair failed to mx://begbie:0:3, will retry: Network dropped connection on reset [E 15:36:33.203054] Warning: msgpair failed to mx://begbie:0:3, will retry: Network dropped connection on reset [E 15:36:56.162436] Warning: msgpair failed to mx://begbie:0:3, will retry: Network dropped connection on reset [E 15:37:19.112399] Warning: msgpair failed to mx://begbie:0:3, will retry: Network dropped connection on reset [E 15:37:42.092670] Warning: msgpair failed to mx://begbie:0:3, will retry: Network dropped connection on reset [E 15:38:05.052442] Warning: msgpair failed to mx://begbie:0:3, will retry: Network dropped connection on reset [E 15:38:05.052484] *** msgpairarray_completion_fn: msgpair to server [UNKNOWN] failed: Network dropped connection on reset
[E 15:38:05.052499] *** Out of retries.
   /ggeu: FAILURE!

Failure: could not initialze at least one of the target file systems.

(4) Searching for /ggeu in pvfstab...
[E 15:38:05.052535] Error: /ggeu/ resides on a PVFS2 file system that has not yet been initialized.
Failure: could not find filesystem for /ggeu in pvfs2tab /etc/pvfs2tab
Entry 0: /ggeu

If I set the host in pvfs2tab to one of the other two hosts, the server on that host immediately crashes with a segmentation fault, and the pvfs2-ping output looks like this:

begbie:~$ sudo MX_IMM_ACK=1 OMX_FATAL_ERRORS=0 pvfs2-ping -m /ggeu

(1) Parsing tab file...

(2) Initializing system interface...

(3) Initializing each file system found in tab file: /etc/pvfs2tab...

   PVFS2 servers: mx://renton:0:3
   Storage name: pvfs2-fs
   Local mount point: /ggeu
[E 15:39:37.761174] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 4. [E 15:39:37.761355] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 5. [E 15:39:37.771212] Warning: msgpair failed to mx://renton:0:3, will retry: Operation cancelled (possibly due to timeout) [E 15:40:07.031576] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 37. [E 15:40:07.031600] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 38. [E 15:40:07.041632] Warning: msgpair failed to mx://renton:0:3, will retry: Operation cancelled (possibly due to timeout) [E 15:40:37.312411] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 68. [E 15:40:37.312457] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 69. [E 15:40:37.322401] Warning: msgpair failed to mx://renton:0:3, will retry: Operation cancelled (possibly due to timeout) [E 15:41:07.631773] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 100. [E 15:41:07.631822] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 101. [E 15:41:07.641782] Warning: msgpair failed to mx://renton:0:3, will retry: Operation cancelled (possibly due to timeout) [E 15:41:37.941766] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 132. [E 15:41:37.941808] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 133. [E 15:41:37.951778] Warning: msgpair failed to mx://renton:0:3, will retry: Operation cancelled (possibly due to timeout)

[E 15:42:07.271684] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 164. [E 15:42:07.271729] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 165. [E 15:42:07.281753] Warning: msgpair failed to mx://renton:0:3, will retry: Operation cancelled (possibly due to timeout) [E 15:42:07.281766] *** msgpairarray_completion_fn: msgpair to server [UNKNOWN] failed: Operation cancelled (possibly due to timeout)
[E 15:42:07.281781] *** Out of retries.
   /ggeu: FAILURE!

Failure: could not initialze at least one of the target file systems.

(4) Searching for /ggeu in pvfstab...
[E 15:42:07.281824] Error: /ggeu/ resides on a PVFS2 file system that has not yet been initialized.
Failure: could not find filesystem for /ggeu in pvfs2tab /etc/pvfs2tab
Entry 0: /ggeu


And the pvfs2-server output before the seg fault looks like:
[P 08/25 15:39] Start times (hr:min:sec): 15:39:06.049 15:39:04.999 15:39:03.968 15:39:02.919 15:39:01.829 15:39:00.779 [P 08/25 15:39] Intervals (hr:min:sec) : 00:00:01.090 00:00:01.050 00:00:01.031 00:00:01.049 00:00:01.090 00:00:01.050 [P 08/25 15:39] ------------------------------------------------------------------------------------------------------------- [P 08/25 15:39] bytes read : 0 0 0 0 0 0 [P 08/25 15:39] bytes written : 0 0 0 0 0 0 [P 08/25 15:39] metadata reads : 0 0 0 0 0 0 [P 08/25 15:39] metadata writes : 0 0 0 0 0 0 [P 08/25 15:39] metadata dspace ops : 0 0 0 0 0 0 [P 08/25 15:39] metadata keyval ops : 2 2 2 2 2 2 [P 08/25 15:39] request scheduler : 0 0 0 0 0 0 [D 08/25 15:39] [SM Exiting]: (0xc8f140) perf_update_sm:do_work (error code: 0), (action: DEFERRED) [D 08/25 15:39] [SM Entering]: (0xc904b0) job_timer_sm:do_work (status: 0) [D 08/25 15:39] [SM Exiting]: (0xc904b0) job_timer_sm:do_work (error code: 0), (action: DEFERRED)
[D 08/25 15:39] bmi_mx: CONN_REQ from mx://begbie:0:0.
[D 08/25 15:39] bmi_mx: bmx_unexpected_recv rx match= 0xc000000100000100 length= 16. [D 08/25 15:39] bmi_mx: bmx_handle_conn_req returned RX match 0xc000000100000100 with Success. [E 08/25 15:39] PVFS2 server: signal 11, faulty address is (nil), from 0x475818
[E 08/25 15:39] [bt] pvfs2-server [0x475818]
[E 08/25 15:39] [bt] pvfs2-server [0x475818]
[E 08/25 15:39] [bt] pvfs2-server [0x476102]
[E 08/25 15:39] [bt] pvfs2-server(BMI_testunexpected+0x392) [0x4549b2]
[E 08/25 15:39] [bt] pvfs2-server [0x44d5c0]
[E 08/25 15:39] [bt] /lib/libpthread.so.0 [0x7fe7ed8a6a04]
[E 08/25 15:39] [bt] /lib/libc.so.6(clone+0x6d) [0x7fe7ed1e1d4d]
Segmentation fault


_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to