Re: [Pvfs2-users] "Remote Endpoint is Closed" error starting pvfs2-server

Joshua Randall Wed, 25 May 2011 04:53:36 -0700

I now seem to have solved the problem by recompiling PVFS2 with:

/* disable shmem to allow clients and servers on the same machine */
setenv("MX_DISABLE_SHMEM", "1",1);



Changed to:

/* disable shmem to allow clients and servers on the same machine */
setenv("MX_DISABLE_SHMEM", "0",1);

With shared communication now enabled in Open-MX, everything appearsto be working and I can run pvfs2-server on all three machines and usethe filesystem from a client on any of the machines!

Can you solve the mystery of why shared memory needed to be disabledin MX, but seems to be required in Open-MX?


Josh.



On 25 May 2011, at 00:55, Joshua Randall wrote:

Scott,

From the Open-MX developers, I got the following response:
"self" is only about communicating between one endpoint and itself(within the same single process then). I don't think it applies toyour case. What you need is either shared communication (1) or alocal network communication (2).
(1) means that you would reenable MX_DISABLE_SHMEM=0. But I don'tthink it can work because PVFS doesn't check whetherMX_DISABLE_SHMEM is already defined before forcing it to 1.However, setting OMX_DISABLE_SHARED to 0 may work because Open-MXuses its own env variables before looking at the MX specific ones.
(2) means that you attach the "lo" interface to the Open-MX driver.This is the usual way of testing Open-MX locally. But your logtells me that you didn't do this (I see eth3 and 00:1b:21:4f:4d:5a).
This has left me a bit confused about how PVFS2 is supposed to work.
In the PVFS2 source, the bmi_mx code explicitly setsMX_DISABLE_SHMEM to 1 in src/io/bmi/bmi_mx/mx.c:
/* disable shmem to allow clients and servers on the same machine */
setenv("MX_DISABLE_SHMEM", "1",1);
This seems to imply that "shared" mode has to be disabled in orderto enable communication between clients and servers on the samemachine, but the Open-MX developers seem to be saying exactly theopposite (that shared mode must be enabled to allow two endpoints indifferent processes to communicate). I am not familiar with how thisworks in MX as I have only tried Open-MX -- is it possible that Open-MX and MX handle self/shared communications differently?As for the "lo" interface, I have tested with my setup with the "lo"interface attached. The "lo" interface works just fine from withinOpen-MX (testing with omx_loopback_test), but PVFS2 does not seem torecognize that the "lo" interface can be used as an alternative toconnect to the local host, although perhaps I am missing some PVFS2configuration option that would enable that???
The aliases section of my pvfs2-fs.conf is:
<Aliases>
       Alias begbie mx://begbie:0:3
       Alias renton mx://renton:0:3
       Alias tommy mx://tommy:0:3
</Aliases>
Is there some way to specify that an alias can refer to twodifferent mx endpoints so that we can tell PVFS2 which MX board isthe localhost interface? (when I have both "eth3" and "lo" boardsattached in Open-MX, I wind up with e.g. begbie:0 for "eth3" andbegbie:1 for "lo"). I've tried changing the configuration line to"Alias begbie mx://begbie:0:3 mx://begbie:1:3" but that gives aconfig error. I had previously searched the documentation and listarchive for a way to do this, but found nothing much except a notethat one could not attach two MX boards on the same server(something I actually would like to do since I have two 10Gbe cardsin each server).Please let me know if I've missed something about how I can tellPVFS2 to use the localhost interface -- if it is documentedsomewhere, I've managed to miss it despite lots of looking!If the "lo" interface is not supported and shared/self communicationis what allows two endpoints on the same host to talk to each other,why does PVFS2 disable the only mode (shared) that would actuallyallow Open-MX to talk to endpoints on the same host but runningunder different processes?
Josh.


On 23 May 2011, at 21:55, Atchley, Scott wrote:
On May 23, 2011, at 4:08 PM, Joshua Randall wrote:
Scott,

I compiled and ran your test program (with a few small errors
corrected):
I am surprised it was only a few small errors since it was writtenin my email client... ;-)
Running it is successful:
jrandall@begbie:/tmp$ ./omxtestself
iconnect completed with status Success
However, I'm not really sure what this is testing -- it seems likeit
is only opening one endpoint, looking up that endpoint address, and
then trying to connect to it.  I thought the failure I was observing
was that when the client (on one endpoint) tried to connect to the
server (on another endpoint?) on the same host.  Am I mistaken about
that?
Yes, and they should be in different processes too. Good catch.
In any case, I've tried to extend your test to cover that
scenario:
<snip>
This results in:
jrandall@begbie:/tmp$ OMX_CONNECT_POLLALL=1 ./omxtestself2
OMX: Forcing connect polling all endpoints to enabled
iconnect completed with status Success
iconnect to nic_id from hostname completed with status Success
(it does not exit, but seems to loop forever in the final mx_testloop)
Hmmm. Not sure what to make of it not returning. Can you print outthe value of result?
I'm not sure that I'm doing this right, and have actually also
implemented it as two separate test programs, one for each endpoint
and that has exactly the same issue.  I'll pass this test case along
to the Open-MX people and see if they can point out what is wrong.

Thanks!

Josh.
Ok, you wrote it as two apps and it connects or does not connect?You may also want to set MX_VERBOSE=3 in your environment orsomething like that to see if Open-MX is complaining internally.They may also have a separate OMX_VERBOSE environment flag. Checktheir docs.
Scott

Scott


_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] "Remote Endpoint is Closed" error starting pvfs2-server

Reply via email to