Scott,
From the Open-MX developers, I got the following response:
"self" is only about communicating between one endpoint and itself
(within the same single process then). I don't think it applies to
your case. What you need is either shared communication (1) or a
local network communication (2).
(1) means that you would reenable MX_DISABLE_SHMEM=0. But I don't
think it can work because PVFS doesn't check whether
MX_DISABLE_SHMEM is already defined before forcing it to 1. However,
setting OMX_DISABLE_SHARED to 0 may work because Open-MX uses its
own env variables before looking at the MX specific ones.
(2) means that you attach the "lo" interface to the Open-MX driver.
This is the usual way of testing Open-MX locally. But your log tells
me that you didn't do this (I see eth3 and 00:1b:21:4f:4d:5a).
This has left me a bit confused about how PVFS2 is supposed to work.
In the PVFS2 source, the bmi_mx code explicitly sets MX_DISABLE_SHMEM
to 1 in src/io/bmi/bmi_mx/mx.c:
/* disable shmem to allow clients and servers on the same machine */
setenv("MX_DISABLE_SHMEM", "1",1);
This seems to imply that "shared" mode has to be disabled in order to
enable communication between clients and servers on the same machine,
but the Open-MX developers seem to be saying exactly the opposite
(that shared mode must be enabled to allow two endpoints in different
processes to communicate). I am not familiar with how this works in MX
as I have only tried Open-MX -- is it possible that Open-MX and MX
handle self/shared communications differently?
As for the "lo" interface, I have tested with my setup with the "lo"
interface attached. The "lo" interface works just fine from within
Open-MX (testing with omx_loopback_test), but PVFS2 does not seem to
recognize that the "lo" interface can be used as an alternative to
connect to the local host, although perhaps I am missing some PVFS2
configuration option that would enable that???
The aliases section of my pvfs2-fs.conf is:
<Aliases>
Alias begbie mx://begbie:0:3
Alias renton mx://renton:0:3
Alias tommy mx://tommy:0:3
</Aliases>
Is there some way to specify that an alias can refer to two different
mx endpoints so that we can tell PVFS2 which MX board is the localhost
interface? (when I have both "eth3" and "lo" boards attached in Open-
MX, I wind up with e.g. begbie:0 for "eth3" and begbie:1 for "lo").
I've tried changing the configuration line to "Alias begbie mx://
begbie:0:3 mx://begbie:1:3" but that gives a config error. I had
previously searched the documentation and list archive for a way to do
this, but found nothing much except a note that one could not attach
two MX boards on the same server (something I actually would like to
do since I have two 10Gbe cards in each server).
Please let me know if I've missed something about how I can tell PVFS2
to use the localhost interface -- if it is documented somewhere, I've
managed to miss it despite lots of looking!
If the "lo" interface is not supported and shared/self communication
is what allows two endpoints on the same host to talk to each other,
why does PVFS2 disable the only mode (shared) that would actually
allow Open-MX to talk to endpoints on the same host but running under
different processes?
Josh.
On 23 May 2011, at 21:55, Atchley, Scott wrote:
On May 23, 2011, at 4:08 PM, Joshua Randall wrote:
Scott,
I compiled and ran your test program (with a few small errors
corrected):
I am surprised it was only a few small errors since it was written
in my email client... ;-)
Running it is successful:
jrandall@begbie:/tmp$ ./omxtestself
iconnect completed with status Success
However, I'm not really sure what this is testing -- it seems like it
is only opening one endpoint, looking up that endpoint address, and
then trying to connect to it. I thought the failure I was observing
was that when the client (on one endpoint) tried to connect to the
server (on another endpoint?) on the same host. Am I mistaken about
that?
Yes, and they should be in different processes too. Good catch.
In any case, I've tried to extend your test to cover that
scenario:
<snip>
This results in:
jrandall@begbie:/tmp$ OMX_CONNECT_POLLALL=1 ./omxtestself2
OMX: Forcing connect polling all endpoints to enabled
iconnect completed with status Success
iconnect to nic_id from hostname completed with status Success
(it does not exit, but seems to loop forever in the final mx_test
loop)
Hmmm. Not sure what to make of it not returning. Can you print out
the value of result?
I'm not sure that I'm doing this right, and have actually also
implemented it as two separate test programs, one for each endpoint
and that has exactly the same issue. I'll pass this test case along
to the Open-MX people and see if they can point out what is wrong.
Thanks!
Josh.
Ok, you wrote it as two apps and it connects or does not connect?
You may also want to set MX_VERBOSE=3 in your environment or
something like that to see if Open-MX is complaining internally.
They may also have a separate OMX_VERBOSE environment flag. Check
their docs.
Scott
Scott
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users