On 08/24/2010 07:30 PM, Phil Carns wrote:

I modified the header file, recompiled, and ran it again -- here is the relevant portion of the debug output:

[D 08/24 20:06] Passing mx://renton:0:3 as BMI listen address.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] Server using shm key hint: 1937657261
[D 08/24 20:06] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 11
[D 08/24 20:06] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 12
[D 08/24 20:06] dbpf_thread_initialize: initialized
[D 08/24 20:06] dbpf_thread_function started
[D 08/24 20:06] [SYNC_COALESCE]: dbpf_sync_context_init for context 0 called
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] bmi_mx: Setting peer mx://begbie:0:3 to BMX_PEER_WAIT.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 1.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] bmi_mx: Setting peer mx://tommy:0:3 to BMX_PEER_WAIT.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 1.
OMX: Completing iconnect request: Remote Endpoint is Closed
I don't really understand what is supposed to happen here -- the other two machines are not running a pvfs2 server at the moment because all three of them have this error and close before the others can be started. Surely what should happen is some kind of polling loop waiting for the other servers to be ready? That seems to be what is implied by going into the "BMX_PEER_WAIT" state, but it seems to be having a problem maintaining that state for some reason.

Josh.
This output is from renton. It tries to connect to begbie and tommy, but they do not have an open MX endpoint. The connect fails and PVFS2 gives up.

I have not experimented much with multiple servers. Perhaps someone else can chime in as to whether there should be specific order to bringing up servers (e.g. in Lustre the metadata server must come up before the storage servers).

Another possibility is that PVFS2 tries again with socket connections but is not with MX. Can anyone verify this?

Lastly, I expected to see some more message from bmi_mx. Is BMX_DB_CONN set in the BMX_DB_MASK?

Scott

PVFS does sit in a loop and wait for other servers to come up. It doesn't matter what order they are started as long as they all eventually start. My suspicion would be that the open-mx library might be calling exit() or abort() when it encounters an error, causing the server to quit before it gets a chance to retry communication.

What version of open-mx are you using?


This faq entry looks helpful in this regard:

http://open-mx.gforge.inria.fr/FAQ/#running-errors

Can you try repeating your experiment with the OMX_FATAL_ERRORS environment variable set to 0?

-Phil
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to