On 08/24/2010 07:30 PM, Phil Carns wrote:
I modified the header file, recompiled, and ran it again -- here is
the relevant portion of the debug output:
[D 08/24 20:06] Passing mx://renton:0:3 as BMI listen address.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] Server using shm key hint: 1937657261
[D 08/24 20:06] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 11
[D 08/24 20:06] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 12
[D 08/24 20:06] dbpf_thread_initialize: initialized
[D 08/24 20:06] dbpf_thread_function started
[D 08/24 20:06] [SYNC_COALESCE]: dbpf_sync_context_init for
context 0 called
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] bmi_mx: Setting peer mx://begbie:0:3 to
BMX_PEER_WAIT.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 1.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] bmi_mx: Setting peer mx://tommy:0:3 to BMX_PEER_WAIT.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 1.
OMX: Completing iconnect request: Remote Endpoint is Closed
I don't really understand what is supposed to happen here -- the
other two machines are not running a pvfs2 server at the moment
because all three of them have this error and close before the
others can be started. Surely what should happen is some kind of
polling loop waiting for the other servers to be ready? That seems
to be what is implied by going into the "BMX_PEER_WAIT" state, but
it seems to be having a problem maintaining that state for some
reason.
Josh.
This output is from renton. It tries to connect to begbie and tommy,
but they do not have an open MX endpoint. The connect fails and
PVFS2 gives up.
I have not experimented much with multiple servers. Perhaps someone
else can chime in as to whether there should be specific order to
bringing up servers (e.g. in Lustre the metadata server must come up
before the storage servers).
Another possibility is that PVFS2 tries again with socket
connections but is not with MX. Can anyone verify this?
Lastly, I expected to see some more message from bmi_mx. Is
BMX_DB_CONN set in the BMX_DB_MASK?
Scott
PVFS does sit in a loop and wait for other servers to come up. It
doesn't matter what order they are started as long as they all
eventually start. My suspicion would be that the open-mx library
might be calling exit() or abort() when it encounters an error,
causing the server to quit before it gets a chance to retry
communication.
What version of open-mx are you using?
This faq entry looks helpful in this regard:
http://open-mx.gforge.inria.fr/FAQ/#running-errors
Can you try repeating your experiment with the OMX_FATAL_ERRORS
environment variable set to 0?
-Phil
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users