Re: [Pvfs2-users] "Remote Endpoint is Closed" error starting pvfs2-server

Phil Carns Tue, 24 Aug 2010 16:51:56 -0700

On 08/24/2010 07:30 PM, Phil Carns wrote:

I modified the header file, recompiled, and ran it again -- here isthe relevant portion of the debug output:
[D 08/24 20:06] Passing mx://renton:0:3 as BMI listen address.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] Server using shm key hint: 1937657261
[D 08/24 20:06] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 11
[D 08/24 20:06] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 12
[D 08/24 20:06] dbpf_thread_initialize: initialized
[D 08/24 20:06] dbpf_thread_function started
[D 08/24 20:06] [SYNC_COALESCE]: dbpf_sync_context_init forcontext 0 called
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] bmi_mx: Setting peer mx://begbie:0:3 toBMX_PEER_WAIT.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 1.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
[D 08/24 20:06] bmi_mx: Setting peer mx://tommy:0:3 to BMX_PEER_WAIT.
[D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 1.
OMX: Completing iconnect request: Remote Endpoint is Closed
I don't really understand what is supposed to happen here -- theother two machines are not running a pvfs2 server at the momentbecause all three of them have this error and close before theothers can be started. Surely what should happen is some kind ofpolling loop waiting for the other servers to be ready? That seemsto be what is implied by going into the "BMX_PEER_WAIT" state, butit seems to be having a problem maintaining that state for somereason.
Josh.
This output is from renton. It tries to connect to begbie and tommy,but they do not have an open MX endpoint. The connect fails andPVFS2 gives up.
I have not experimented much with multiple servers. Perhaps someoneelse can chime in as to whether there should be specific order tobringing up servers (e.g. in Lustre the metadata server must come upbefore the storage servers).
Another possibility is that PVFS2 tries again with socketconnections but is not with MX. Can anyone verify this?
Lastly, I expected to see some more message from bmi_mx. IsBMX_DB_CONN set in the BMX_DB_MASK?
Scott
PVFS does sit in a loop and wait for other servers to come up. Itdoesn't matter what order they are started as long as they alleventually start. My suspicion would be that the open-mx librarymight be calling exit() or abort() when it encounters an error,causing the server to quit before it gets a chance to retrycommunication.
What version of open-mx are you using?


This faq entry looks helpful in this regard:

http://open-mx.gforge.inria.fr/FAQ/#running-errors

Can you try repeating your experiment with the OMX_FATAL_ERRORSenvironment variable set to 0?


-Phil
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] "Remote Endpoint is Closed" error starting pvfs2-server

Reply via email to