I now seem to have solved the problem by recompiling PVFS2 with:
/* disable shmem to allow clients and servers on the same machine */
setenv("MX_DISABLE_SHMEM", "1",1);
Changed to:
/* disable shmem to allow clients and servers on the same machine */
setenv("MX_DISABLE_SHMEM", "0",1);
With shared communication now enabled in Open-MX, everything appears
to be working and I can run pvfs2-server on all three machines and use
the filesystem from a client on any of the machines!
Can you solve the mystery of why shared memory needed to be disabled
in MX, but seems to be required in Open-MX?
Josh.
On 25 May 2011, at 00:55, Joshua Randall wrote:
Scott,
From the Open-MX developers, I got the following response:
"self" is only about communicating between one endpoint and itself
(within the same single process then). I don't think it applies to
your case. What you need is either shared communication (1) or a
local network communication (2).
(1) means that you would reenable MX_DISABLE_SHMEM=0. But I don't
think it can work because PVFS doesn't check whether
MX_DISABLE_SHMEM is already defined before forcing it to 1.
However, setting OMX_DISABLE_SHARED to 0 may work because Open-MX
uses its own env variables before looking at the MX specific ones.
(2) means that you attach the "lo" interface to the Open-MX driver.
This is the usual way of testing Open-MX locally. But your log
tells me that you didn't do this (I see eth3 and 00:1b:21:4f:4d:5a).
This has left me a bit confused about how PVFS2 is supposed to work.
In the PVFS2 source, the bmi_mx code explicitly sets
MX_DISABLE_SHMEM to 1 in src/io/bmi/bmi_mx/mx.c:
/* disable shmem to allow clients and servers on the same machine */
setenv("MX_DISABLE_SHMEM", "1",1);
This seems to imply that "shared" mode has to be disabled in order
to enable communication between clients and servers on the same
machine, but the Open-MX developers seem to be saying exactly the
opposite (that shared mode must be enabled to allow two endpoints in
different processes to communicate). I am not familiar with how this
works in MX as I have only tried Open-MX -- is it possible that Open-
MX and MX handle self/shared communications differently?
As for the "lo" interface, I have tested with my setup with the "lo"
interface attached. The "lo" interface works just fine from within
Open-MX (testing with omx_loopback_test), but PVFS2 does not seem to
recognize that the "lo" interface can be used as an alternative to
connect to the local host, although perhaps I am missing some PVFS2
configuration option that would enable that???
The aliases section of my pvfs2-fs.conf is:
<Aliases>
Alias begbie mx://begbie:0:3
Alias renton mx://renton:0:3
Alias tommy mx://tommy:0:3
</Aliases>
Is there some way to specify that an alias can refer to two
different mx endpoints so that we can tell PVFS2 which MX board is
the localhost interface? (when I have both "eth3" and "lo" boards
attached in Open-MX, I wind up with e.g. begbie:0 for "eth3" and
begbie:1 for "lo"). I've tried changing the configuration line to
"Alias begbie mx://begbie:0:3 mx://begbie:1:3" but that gives a
config error. I had previously searched the documentation and list
archive for a way to do this, but found nothing much except a note
that one could not attach two MX boards on the same server
(something I actually would like to do since I have two 10Gbe cards
in each server).
Please let me know if I've missed something about how I can tell
PVFS2 to use the localhost interface -- if it is documented
somewhere, I've managed to miss it despite lots of looking!
If the "lo" interface is not supported and shared/self communication
is what allows two endpoints on the same host to talk to each other,
why does PVFS2 disable the only mode (shared) that would actually
allow Open-MX to talk to endpoints on the same host but running
under different processes?
Josh.
On 23 May 2011, at 21:55, Atchley, Scott wrote:
On May 23, 2011, at 4:08 PM, Joshua Randall wrote:
Scott,
I compiled and ran your test program (with a few small errors
corrected):
I am surprised it was only a few small errors since it was written
in my email client... ;-)
Running it is successful:
jrandall@begbie:/tmp$ ./omxtestself
iconnect completed with status Success
However, I'm not really sure what this is testing -- it seems like
it
is only opening one endpoint, looking up that endpoint address, and
then trying to connect to it. I thought the failure I was observing
was that when the client (on one endpoint) tried to connect to the
server (on another endpoint?) on the same host. Am I mistaken about
that?
Yes, and they should be in different processes too. Good catch.
In any case, I've tried to extend your test to cover that
scenario:
<snip>
This results in:
jrandall@begbie:/tmp$ OMX_CONNECT_POLLALL=1 ./omxtestself2
OMX: Forcing connect polling all endpoints to enabled
iconnect completed with status Success
iconnect to nic_id from hostname completed with status Success
(it does not exit, but seems to loop forever in the final mx_test
loop)
Hmmm. Not sure what to make of it not returning. Can you print out
the value of result?
I'm not sure that I'm doing this right, and have actually also
implemented it as two separate test programs, one for each endpoint
and that has exactly the same issue. I'll pass this test case along
to the Open-MX people and see if they can point out what is wrong.
Thanks!
Josh.
Ok, you wrote it as two apps and it connects or does not connect?
You may also want to set MX_VERBOSE=3 in your environment or
something like that to see if Open-MX is complaining internally.
They may also have a separate OMX_VERBOSE environment flag. Check
their docs.
Scott
Scott
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users