I modified the header file, recompiled, and ran it again -- here is the
relevant portion of the debug output:
[D 08/24 20:06] Passing mx://renton:0:3 as BMI listen address.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] Server using shm key hint: 1937657261
[D 08/24 20:06] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 11
[D 08/24 20:06] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 12
[D 08/24 20:06] dbpf_thread_initialize: initialized
[D 08/24 20:06] dbpf_thread_function started
[D 08/24 20:06] [SYNC_COALESCE]: dbpf_sync_context_init for context 0 called
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] bmi_mx: Setting peer mx://begbie:0:3 to BMX_PEER_WAIT.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 1.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] bmi_mx: Setting peer mx://tommy:0:3 to BMX_PEER_WAIT.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 1.
OMX: Completing iconnect request: Remote Endpoint is Closed
I don't really understand what is supposed to happen here -- the other two machines are
not running a pvfs2 server at the moment because all three of them have this error and
close before the others can be started. Surely what should happen is some kind of
polling loop waiting for the other servers to be ready? That seems to be what is implied
by going into the "BMX_PEER_WAIT" state, but it seems to be having a problem
maintaining that state for some reason.
Josh.
This output is from renton. It tries to connect to begbie and tommy, but they
do not have an open MX endpoint. The connect fails and PVFS2 gives up.
I have not experimented much with multiple servers. Perhaps someone else can
chime in as to whether there should be specific order to bringing up servers
(e.g. in Lustre the metadata server must come up before the storage servers).
Another possibility is that PVFS2 tries again with socket connections but is
not with MX. Can anyone verify this?
Lastly, I expected to see some more message from bmi_mx. Is BMX_DB_CONN set in
the BMX_DB_MASK?
Scott
PVFS does sit in a loop and wait for other servers to come up. It
doesn't matter what order they are started as long as they all
eventually start. My suspicion would be that the open-mx library might
be calling exit() or abort() when it encounters an error, causing the
server to quit before it gets a chance to retry communication.
What version of open-mx are you using?
-Phil
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users