Can you try repeating your experiment with the OMX_FATAL_ERRORS
environment variable set to 0?
With OMX_FATAL_ERRORS=0 and MX_IMM_ACK=1, the servers now start and
connect to each other. Perhaps someone should add the need for
OMX_FATAL_ERRORS=0 to the FAQ entry that discusses MX_IMM_ACK (or
perhaps it can be set programmatically via the API?). In any case,
thanks for the help in getting the servers started!
However, I still cannot get the filesystem to actually work. With the
3 servers running, I tried pvfs2-ping, but all I get are connection
errors.
Depending on which of the 3 servers I set in fstab, I get different
errors with pvfs2-ping.
With the server set to the same host as I am running pvfs2-ping on (in
this case begbie), all three servers keep running without recognizing
any connection request, and I get this output:
begbie:~$ sudo MX_IMM_ACK=1 OMX_FATAL_ERRORS=0 pvfs2-ping -m /ggeu
(1) Parsing tab file...
(2) Initializing system interface...
(3) Initializing each file system found in tab file: /etc/pvfs2tab...
PVFS2 servers: mx://begbie:0:3
Storage name: pvfs2-fs
Local mount point: /ggeu
[E 15:36:10.272413] Warning: msgpair failed to mx://begbie:0:3, will
retry: Network dropped connection on reset
[E 15:36:33.203054] Warning: msgpair failed to mx://begbie:0:3, will
retry: Network dropped connection on reset
[E 15:36:56.162436] Warning: msgpair failed to mx://begbie:0:3, will
retry: Network dropped connection on reset
[E 15:37:19.112399] Warning: msgpair failed to mx://begbie:0:3, will
retry: Network dropped connection on reset
[E 15:37:42.092670] Warning: msgpair failed to mx://begbie:0:3, will
retry: Network dropped connection on reset
[E 15:38:05.052442] Warning: msgpair failed to mx://begbie:0:3, will
retry: Network dropped connection on reset
[E 15:38:05.052484] *** msgpairarray_completion_fn: msgpair to
server [UNKNOWN] failed: Network dropped connection on reset
[E 15:38:05.052499] *** Out of retries.
/ggeu: FAILURE!
Failure: could not initialze at least one of the target file systems.
(4) Searching for /ggeu in pvfstab...
[E 15:38:05.052535] Error: /ggeu/ resides on a PVFS2 file system
that has not yet been initialized.
Failure: could not find filesystem for /ggeu in pvfs2tab /etc/pvfs2tab
Entry 0: /ggeu
If I set the host in pvfs2tab to one of the other two hosts, the
server on that host immediately crashes with a segmentation fault, and
the pvfs2-ping output looks like this:
begbie:~$ sudo MX_IMM_ACK=1 OMX_FATAL_ERRORS=0 pvfs2-ping -m /ggeu
(1) Parsing tab file...
(2) Initializing system interface...
(3) Initializing each file system found in tab file: /etc/pvfs2tab...
PVFS2 servers: mx://renton:0:3
Storage name: pvfs2-fs
Local mount point: /ggeu
[E 15:39:37.761174] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 4.
[E 15:39:37.761355] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 5.
[E 15:39:37.771212] Warning: msgpair failed to mx://renton:0:3, will
retry: Operation cancelled (possibly due to timeout)
[E 15:40:07.031576] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 37.
[E 15:40:07.031600] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 38.
[E 15:40:07.041632] Warning: msgpair failed to mx://renton:0:3, will
retry: Operation cancelled (possibly due to timeout)
[E 15:40:37.312411] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 68.
[E 15:40:37.312457] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 69.
[E 15:40:37.322401] Warning: msgpair failed to mx://renton:0:3, will
retry: Operation cancelled (possibly due to timeout)
[E 15:41:07.631773] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 100.
[E 15:41:07.631822] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 101.
[E 15:41:07.641782] Warning: msgpair failed to mx://renton:0:3, will
retry: Operation cancelled (possibly due to timeout)
[E 15:41:37.941766] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 132.
[E 15:41:37.941808] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 133.
[E 15:41:37.951778] Warning: msgpair failed to mx://renton:0:3, will
retry: Operation cancelled (possibly due to timeout)
[E 15:42:07.271684] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 164.
[E 15:42:07.271729] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 165.
[E 15:42:07.281753] Warning: msgpair failed to mx://renton:0:3, will
retry: Operation cancelled (possibly due to timeout)
[E 15:42:07.281766] *** msgpairarray_completion_fn: msgpair to
server [UNKNOWN] failed: Operation cancelled (possibly due to timeout)
[E 15:42:07.281781] *** Out of retries.
/ggeu: FAILURE!
Failure: could not initialze at least one of the target file systems.
(4) Searching for /ggeu in pvfstab...
[E 15:42:07.281824] Error: /ggeu/ resides on a PVFS2 file system
that has not yet been initialized.
Failure: could not find filesystem for /ggeu in pvfs2tab /etc/pvfs2tab
Entry 0: /ggeu
And the pvfs2-server output before the seg fault looks like:
[P 08/25 15:39] Start times (hr:min:sec): 15:39:06.049
15:39:04.999 15:39:03.968 15:39:02.919 15:39:01.829 15:39:00.779
[P 08/25 15:39] Intervals (hr:min:sec) : 00:00:01.090
00:00:01.050 00:00:01.031 00:00:01.049 00:00:01.090 00:00:01.050
[P 08/25 15:39]
-------------------------------------------------------------------------------------------------------------
[P 08/25 15:39] bytes read : 0
0 0 0 0 0
[P 08/25 15:39] bytes written : 0
0 0 0 0 0
[P 08/25 15:39] metadata reads : 0
0 0 0 0 0
[P 08/25 15:39] metadata writes : 0
0 0 0 0 0
[P 08/25 15:39] metadata dspace ops : 0
0 0 0 0 0
[P 08/25 15:39] metadata keyval ops : 2
2 2 2 2 2
[P 08/25 15:39] request scheduler : 0
0 0 0 0 0
[D 08/25 15:39] [SM Exiting]: (0xc8f140) perf_update_sm:do_work
(error code: 0), (action: DEFERRED)
[D 08/25 15:39] [SM Entering]: (0xc904b0) job_timer_sm:do_work
(status: 0)
[D 08/25 15:39] [SM Exiting]: (0xc904b0) job_timer_sm:do_work (error
code: 0), (action: DEFERRED)
[D 08/25 15:39] bmi_mx: CONN_REQ from mx://begbie:0:0.
[D 08/25 15:39] bmi_mx: bmx_unexpected_recv rx match=
0xc000000100000100 length= 16.
[D 08/25 15:39] bmi_mx: bmx_handle_conn_req returned RX match
0xc000000100000100 with Success.
[E 08/25 15:39] PVFS2 server: signal 11, faulty address is (nil),
from 0x475818
[E 08/25 15:39] [bt] pvfs2-server [0x475818]
[E 08/25 15:39] [bt] pvfs2-server [0x475818]
[E 08/25 15:39] [bt] pvfs2-server [0x476102]
[E 08/25 15:39] [bt] pvfs2-server(BMI_testunexpected+0x392) [0x4549b2]
[E 08/25 15:39] [bt] pvfs2-server [0x44d5c0]
[E 08/25 15:39] [bt] /lib/libpthread.so.0 [0x7fe7ed8a6a04]
[E 08/25 15:39] [bt] /lib/libc.so.6(clone+0x6d) [0x7fe7ed1e1d4d]
Segmentation fault
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users