Scott,

From the Open-MX developers, I got the following response:

"self" is only about communicating between one endpoint and itself (within the same single process then). I don't think it applies to your case. What you need is either shared communication (1) or a local network communication (2).

(1) means that you would reenable MX_DISABLE_SHMEM=0. But I don't think it can work because PVFS doesn't check whether MX_DISABLE_SHMEM is already defined before forcing it to 1. However, setting OMX_DISABLE_SHARED to 0 may work because Open-MX uses its own env variables before looking at the MX specific ones.

(2) means that you attach the "lo" interface to the Open-MX driver. This is the usual way of testing Open-MX locally. But your log tells me that you didn't do this (I see eth3 and 00:1b:21:4f:4d:5a).

This has left me a bit confused about how PVFS2 is supposed to work.

In the PVFS2 source, the bmi_mx code explicitly sets MX_DISABLE_SHMEM to 1 in src/io/bmi/bmi_mx/mx.c:
/* disable shmem to allow clients and servers on the same machine */
setenv("MX_DISABLE_SHMEM", "1",1);
This seems to imply that "shared" mode has to be disabled in order to enable communication between clients and servers on the same machine, but the Open-MX developers seem to be saying exactly the opposite (that shared mode must be enabled to allow two endpoints in different processes to communicate). I am not familiar with how this works in MX as I have only tried Open-MX -- is it possible that Open-MX and MX handle self/shared communications differently? As for the "lo" interface, I have tested with my setup with the "lo" interface attached. The "lo" interface works just fine from within Open-MX (testing with omx_loopback_test), but PVFS2 does not seem to recognize that the "lo" interface can be used as an alternative to connect to the local host, although perhaps I am missing some PVFS2 configuration option that would enable that???
The aliases section of my pvfs2-fs.conf is:
<Aliases>
        Alias begbie mx://begbie:0:3
        Alias renton mx://renton:0:3
        Alias tommy mx://tommy:0:3
</Aliases>

Is there some way to specify that an alias can refer to two different mx endpoints so that we can tell PVFS2 which MX board is the localhost interface? (when I have both "eth3" and "lo" boards attached in Open- MX, I wind up with e.g. begbie:0 for "eth3" and begbie:1 for "lo"). I've tried changing the configuration line to "Alias begbie mx:// begbie:0:3 mx://begbie:1:3" but that gives a config error. I had previously searched the documentation and list archive for a way to do this, but found nothing much except a note that one could not attach two MX boards on the same server (something I actually would like to do since I have two 10Gbe cards in each server). Please let me know if I've missed something about how I can tell PVFS2 to use the localhost interface -- if it is documented somewhere, I've managed to miss it despite lots of looking! If the "lo" interface is not supported and shared/self communication is what allows two endpoints on the same host to talk to each other, why does PVFS2 disable the only mode (shared) that would actually allow Open-MX to talk to endpoints on the same host but running under different processes?
Josh.


On 23 May 2011, at 21:55, Atchley, Scott wrote:


On May 23, 2011, at 4:08 PM, Joshua Randall wrote:

Scott,

I compiled and ran your test program (with a few small errors
corrected):

I am surprised it was only a few small errors since it was written in my email client... ;-)


Running it is successful:
jrandall@begbie:/tmp$ ./omxtestself
iconnect completed with status Success


However, I'm not really sure what this is testing -- it seems like it
is only opening one endpoint, looking up that endpoint address, and
then trying to connect to it.  I thought the failure I was observing
was that when the client (on one endpoint) tried to connect to the
server (on another endpoint?) on the same host.  Am I mistaken about
that?

Yes, and they should be in different processes too. Good catch.

In any case, I've tried to extend your test to cover that
scenario:

<snip>

This results in:

jrandall@begbie:/tmp$ OMX_CONNECT_POLLALL=1 ./omxtestself2
OMX: Forcing connect polling all endpoints to enabled
iconnect completed with status Success
iconnect to nic_id from hostname completed with status Success

(it does not exit, but seems to loop forever in the final mx_test loop)

Hmmm. Not sure what to make of it not returning. Can you print out the value of result?

I'm not sure that I'm doing this right, and have actually also
implemented it as two separate test programs, one for each endpoint
and that has exactly the same issue.  I'll pass this test case along
to the Open-MX people and see if they can point out what is wrong.

Thanks!

Josh.

Ok, you wrote it as two apps and it connects or does not connect? You may also want to set MX_VERBOSE=3 in your environment or something like that to see if Open-MX is complaining internally. They may also have a separate OMX_VERBOSE environment flag. Check their docs.

Scott

Scott

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to