from:"Galen Shipman"

[OMPI devel] For Open MPI + BPROC users

2006-11-29 Thread Galen Shipman



We have found a potential issue with BPROC that may effect Open MPI.
Open MPI by default uses PTYs for I/O forwarding, if PTYs aren't  
setup on the compute nodes, Open MPI will revert to using pipes.  
Recently (today) we found a potential issue with PTYs and BPROC. A  
simple reader/writer using PTYs causes the writer to hang in  
uninterruptible sleep. The consistency of the process table from the  
head node and the back end nodes is also effected, that is "bps"  
shows no writer process, while "bpsh NODE ps aux" shows the writer  
process in uninterruptible sleep.


Since Open MPI uses PTYs by default on BPROC this results in ORTED or  
MPI processes being orphaned on compute nodes. The workaround for  
this issue is to configure Open MPI with --disable-pty-support and  
rebuild.



- Open MPI Team

Re: [OMPI devel] Create success (r1.3a1r13481)

2007-02-02 Thread Galen Shipman

Expect this rev to die all over the place.. I had a bug in my r13481  
checkin that prevented OB1 from getting selected, I corrected this in  
r13482.


sorry bout that..


- Galen

On Feb 2, 2007, at 7:44 PM, MPI Team wrote:


Creating nightly snapshot SVN tarball was a success.

Snapshot:   1.3a1r13481
Start time: Fri Feb  2 21:21:47 EST 2007
End time:   Fri Feb  2 21:44:05 EST 2007

Your friendly daemon,
Cyrador
___
testing mailing list
test...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/testing

Re: [OMPI devel] OMPI over ofed udapl over iwarp

2007-05-11 Thread Galen Shipman



More like trying to work around the race condition that exists:  The
server side sends an rdma message first thus violating the iwarp
protocol.  For those who want the gory details:  when the server sends
first -and- that rdma message arrives at the client _before_ the  
client
transitions into rdma mode, then that rdma message gets passed up  
to the
host driver as streaming mode data.  So when I originally ran OMPI/ 
udapl

on chelsio's rnic, the client actually received the mpa start response
-and- the first fpdu from the server while still in streaming mode.
This results in a connection abort because it violates the mpa startup
protocol...



I would recommend making a state transition diagram of the lazy  
connection establishment. I did this during implementation of the  
Open IB BTL. Of course, I threw it out as soon as the code was  
committed :-).
This shouldn't take very long and then you can simply modify the  
state diagram to prevent these race conditions, perhaps identifying  
an existing state where you can post any buffers that you need. Don't  
forget to throw away the state diagram as soon as the code is  
committed ;-).


- Galen




By causing the server to sleep just after accepting the connection, I
give the client time to transition into rdma mode...

It was just a hack to continue testing OMPI/udapl/chelsio.  And it
revealed the problem being discussed in this thread:  OMPI udapl btl
doesn't pre-post the recvs for the sendrecv exchange.



Neither the client nor the server side of the udapl btl
connection setup pre-post RECV buffers before connecting.
This can allow a SEND to arrive before a RECV buffer is
available.  I _think_ IB will handle this issue by
retransmitting the SEND.  Chelsio's iWARP device, however,
TERMINATEs the connection.  My sleep() makes this condition
happen every time.





A compliant DAPL program also ensures that there are adequate
receive buffers in place before the remote peer Sends. It is
explicitly noted that failure to follow this real will invoke
a transport/device dependent penalty. It may be that the sendq
will be fenced, or it may be that the connection will be terminated.

So any RDMA BTL should pre-post recv buffers before initiating or
accepting a connection.



I know of no udapl restiction saying a recv must be posted before  
a send.

And yes we do pre post recv buffers but since the BTL creates 2
connections per peer, one for eager size messages and one for max  
size
messages the BTL needs to know which connection the current  
endpoint is

to service so that it can post the proper size recv buffer.

Also, I agree in theory the btl could potentially post the recv which
currently occurs in mca_btl_udapl_sendrecv before the connect or  
accept
but I think in practise we had issue doing this and we had to wait  
until

a DAT_CONNECTION_EVENT_ESTABLISHED was received.



Well I'm trying it now.  It should work.  If it doesn't, then dapl or
the underlying providers are broken.





From what I can tell, the udapl btl exchanges memory info as a  
first




order of business after connection establishment
(mba_btl_udapl_sendrecv().  The RECV buffer post for this
exchange, however, should really be done _before_ the
dat_ep_connect() on the active side, and _before_ the
dat_cr_accept() on the server side.
Currently its done after the ESTABLISHED event is dequeued,
thus allowing the race condition.

I believe the rules are the ULP must ensure that a RECV is
posted before the client can post a SEND for that buffer.
And further, the ULP must enforce flow control somehow so
that a SEND never arrives without a RECV buffer being available.



maybe this is a rule iwarp imposes on its ULPs but not uDAPL.


Perhaps this is just a bug and I opened it up with my sleep()

Or is the uDAPL btl assuming the transport will deal with
lack of RECV buffer at the time a SEND arrives?



There may be a race condition here but you really have to try hard to
see it.



I agree its a small race condition that I made very large by my
sleep! :-)  But I can kill 2 birds with one stone here:  I'm  
testing now

a change to the sendrecv exchange to:

1) prepost the recvs just after dat_ep_create

2) force the side that issues the dat_ep_connect to send its addr-data
first

3) force the side that issues the dat_cr_accept to wait for the send
from the peer, then post its addr-data send

This will plug both race conditions.  I'll post the patch once I debug
it and we can discuss if you thinks a good solution and/or if it work
for solaris udapl.


 From Steve  previously.
"Also: Given there is a message exchange _always_ after connection
setup, then we can change that exchange to support the
client-must-send-first issue..."

I agree I am sure we can do something but if it includes an  
additional

message we should consider a mca parameter to govern this because the
connection wireup is already costly enough.



It won't add an additional message.  Stay tuned for a patch!

Steve

Re: [OMPI devel] OMPI over ofed udapl over iwarp

2007-05-11 Thread Galen Shipman



As an aside, my personal feeling is that even when running over IB the
preposting of recvs is worth the small overhead of piggybacking a  
credit

system on the messages that already cross the wire.  If nothing else,
this avoids adding congestion of RNR-NAKS and the resends they  
trigger.

 Put another way, I favor programming for IB as if it lacked the
link-level flow control that the current BTL apparently assumes.



We avoid the RNR-NAKS in the Open IB BTL via a credit system.
I would have to review the udapl BTL but I believe it does something  
similar.
I believe the problem only exists during lazy connection  
establishment, when credits are probably initialized to the defaults  
on both ends. We should really just set the credits as part of the  
handshake (after the receiver has posted the receive buffers).




- Galen




-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768

2007-05-25 Thread Galen Shipman



On May 24, 2007, at 2:48 PM, George Bosilca wrote:

I see the problem this patch try to solve, but I fail to correctly  
understand the implementation. The patch affect all PML and BTL in  
the code base by adding one more argument to some of the most often  
called functions. And there is only one BTL (openib) who seems to  
use it while all others completely ignore it. Moreover, there seems  
to be already a very similar mechanism based on the  
MCA_BTL_DES_FLAGS_PRIORITY flag, which can be set by the PML level  
into the btl_descriptor.


So what's the difference between the additional argument and a  
correct usage of the MCA_BTL_DES_FLAGS_PRIORITY flag ?



The problem is that MCA_BTL_DES_FLAGS_PRIORITY  was meant to indicate  
that the fragment was higher priority, but the fragment isn't higher  
priority. It simply needs to be ordered w.r.t. a previous fragment,  
an RDMA in this case.
This being said, we could have just added an rdma fin flag, but this  
would mix protocol a bit too much between the BTL and the PML in my  
opinion.
What we have with this fix is that the BTL can assign an order tag to  
any descriptor if it wishes, this order tag is only valid after a  
call to btl_send or btl_put/get. This order tag can then be used to  
request another descriptor later that will enforce ordering. The  
semantics here are clear, and the BTL doesn't have to do anything if  
it doesn't wish (w.r.t. assigning a valid order tag). So this was the  
clearest semantics I could come up with that allowed for numerous  
implementations at the BTL level. For example, even specifying an  
rdma fin flag directly to the BTL would restrict the BTL further than  
these semantics because then all RDMA's must be sent on the same  
endpoint/QP as all the PML would be able to indicate is that a FIN is  
being sent, and the BTL wouldn't have the context to know which RDMA  
the FIN belonged to and hence couldn't enforce ordering easily.


The only reason OpenIB is the only one to use this new functionality  
is because I haven't had a chance to fix up udapl, which I plan to do  
next week.
Note that GM semantics expose a similar problem (ordering is only  
guaranteed for messages of the same priority), but myrinet doesn't  
buffer like some of the IB/IWARP stuff can so we won't see it there.


There are also a number of optimizations that these semantics allow,  
for example, the BTL doesn't have to give local completion callback  
on an RDMA anymore, as the fin message can be used for local  
completion of both.


I am also looking at adding a BTL_PUT_IMMEDIATE which provides remote  
completion via an active message tag callback along with 64 bits of  
data, this would allow us to bypass the FIN entirely if the network  
supports it, this would be useful for MX as an example. OpenIB also  
supports a similar mechanism but there are problems that would need  
to be addressed as OpenIB only delivers 32 bits with the remote  
completion.



- Galen










  george.

On May 24, 2007, at 3:51 PM, gship...@osl.iu.edu wrote:


Author: gshipman
Date: 2007-05-24 15:51:26 EDT (Thu, 24 May 2007)
New Revision: 14768
URL: https://svn.open-mpi.org/trac/ompi/changeset/14768

Log:
Add optional ordering to the BTL interface.
This is required to tighten up the BTL semantics. Ordering is not  
guaranteed,

but, if the BTL returns a order tag in a descriptor (other than
MCA_BTL_NO_ORDER) then we may request another descriptor that will  
obey

ordering w.r.t. to the other descriptor.


This will allow sane behavior for RDMA networks, where local  
completion of an
RDMA operation on the active side does not imply remote completion  
on the
passive side. If we send a FIN message after local completion and  
the FIN is
not ordered w.r.t. the RDMA operation then badness may occur as  
the passive
side may now try to deregister the memory and the RDMA operation  
may still be

pending on the passive side.

Note that this has no impact on networks that don't suffer from this
limitation as the ORDER tag can simply always be specified as
MCA_BTL_NO_ORDER.





Text files modified:
   trunk/ompi/mca/bml/bml.h |29 +++ 

   trunk/ompi/mca/btl/btl.h |10 +++ 
+
   trunk/ompi/mca/btl/gm/btl_gm.c   | 8 +++ 
+++

   trunk/ompi/mca/btl/gm/btl_gm.h   | 3 ++
   trunk/ompi/mca/btl/mx/btl_mx.c   | 8 +++ 
+++

   trunk/ompi/mca/btl/mx/btl_mx.h   | 3 ++
   trunk/ompi/mca/btl/openib/btl_openib.c   |49 +++ 
+++-

   trunk/ompi/mca/btl/openib/btl_openib.h   | 3 ++
   trunk/ompi/mca/btl/openib/btl_openib_endpoint.c  | 7 +++--
   trunk/ompi/mca/btl/openib/btl_openib_frag.c  | 7 +
   trunk/ompi/mca/btl/portals/btl_portals.c | 8 -
   trunk/ompi/mca/btl/portals/btl_portals.h

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768

2007-05-27 Thread Galen Shipman



With current code this is not the case. Order tag is set during a  
fragment
allocation. It seems wrong according to your description. Attached  
patch fixes
this. If no specific ordering tag is provided to allocation  
function order of
the fragment is set to be MCA_BTL_NO_ORDER. After call to send/put/ 
get order
is set to whatever QP was used for communication. If order is set  
before send call

it is used to choose QP.



I do set the order tag during allocation/prepare, but the defined  
semantics are that the tag is only valid after send/put/get. We can  
set them up any where we wish in the BTL, the PML however cannot rely  
on anything until after the send/put/get call. So really this is an  
issue of semantics versus implementation. The implementation I  
believe does conform to the semantics as the upper layer (PML)  
doesn't use the tag value until after a call to send/put/get.


I will look over the patch however, might make more sense to delay  
setting the value until the actual send/put/get call.


Thanks,

Galen







--
Gleb.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768

2007-05-27 Thread Galen Shipman




The problem is that MCA_BTL_DES_FLAGS_PRIORITY  was meant to indicate
that the fragment was higher priority, but the fragment isn't higher
priority. It simply needs to be ordered w.r.t. a previous fragment,
an RDMA in this case.

But after the change priority flags is totally ignored.




So the priority flag was broken I think, in OpenIB we used the  
priority flag to choose a QP on which to send the fragment, but there  
was no checking for if the fragment could be sent on the specified  
QP. So a max send size fragment could be set as priority and the BTL  
would use the high priority QP and we would go bang. This is how I  
read the code, I might have missed something.


If the priority flag is simply a "hint" and has hard requirements  
than we can still use it in the OpenIB BTL but it won't have any  
effect as only eager size fragments can be marked high priority and  
we already send these over the high priority QP.


- Galen






--
Gleb.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768

2007-05-27 Thread Galen Shipman


doh,, correction..

On May 27, 2007, at 10:23 AM, Galen Shipman wrote:




The problem is that MCA_BTL_DES_FLAGS_PRIORITY  was meant to  
indicate

that the fragment was higher priority, but the fragment isn't higher
priority. It simply needs to be ordered w.r.t. a previous fragment,
an RDMA in this case.

But after the change priority flags is totally ignored.




So the priority flag was broken I think, in OpenIB we used the
priority flag to choose a QP on which to send the fragment, but there
was no checking for if the fragment could be sent on the specified
QP. So a max send size fragment could be set as priority and the BTL
would use the high priority QP and we would go bang. This is how I
read the code, I might have missed something.

If the priority flag is simply a "hint" and has

*NO*

hard requirements
than we can still use it in the OpenIB BTL but it won't have any
effect as only eager size fragments can be marked high priority and
we already send these over the high priority QP.

- Galen






--
Gleb.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14780

2007-05-27 Thread Galen Shipman

Can we get rid of mca_pml_ob1_send_fin_btl and just have  
mca_pml_ob1_send_fin? It seems we should just always send the fin  
over the same btl and this would clean up the code a bit.


Thanks,

Galen




On May 27, 2007, at 2:29 AM, g...@osl.iu.edu wrote:


Author: gleb
Date: 2007-05-27 04:29:38 EDT (Sun, 27 May 2007)
New Revision: 14780
URL: https://svn.open-mpi.org/trac/ompi/changeset/14780

Log:
Fix out of resource handling for FIN packets broken by r14768.

Text files modified:
   trunk/ompi/mca/pml/ob1/pml_ob1.c | 7 +++
   trunk/ompi/mca/pml/ob1/pml_ob1.h |14 --
   2 files changed, 15 insertions(+), 6 deletions(-)

Modified: trunk/ompi/mca/pml/ob1/pml_ob1.c
== 


--- trunk/ompi/mca/pml/ob1/pml_ob1.c(original)
+++ trunk/ompi/mca/pml/ob1/pml_ob1.c	2007-05-27 04:29:38 EDT (Sun,  
27 May 2007)

@@ -249,7 +249,7 @@
 MCA_PML_OB1_PROGRESS_PENDING(bml_btl);
 }

-int mca_pml_ob1_send_fin(
+int mca_pml_ob1_send_fin_btl(
 ompi_proc_t* proc,
 mca_bml_base_btl_t* bml_btl,
 void *hdr_des,
@@ -260,9 +260,8 @@
 mca_pml_ob1_fin_hdr_t* hdr;
 int rc;

-MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,  sizeof 
(mca_pml_ob1_fin_hdr_t));
+MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, sizeof 
(mca_pml_ob1_fin_hdr_t));

 if(NULL == fin) {
-MCA_PML_OB1_ADD_FIN_TO_PENDING(proc, hdr_des, bml_btl,  
order);

 return OMPI_ERR_OUT_OF_RESOURCE;
 }
 fin->des_flags |= MCA_BTL_DES_FLAGS_PRIORITY;
@@ -349,7 +348,7 @@
 }
 break;
 case MCA_PML_OB1_HDR_TYPE_FIN:
-rc = mca_pml_ob1_send_fin(pckt->proc, send_dst,
+rc = mca_pml_ob1_send_fin_btl(pckt->proc, send_dst,
   pckt- 
>hdr.hdr_fin.hdr_des.pval,

   pckt->order);
 MCA_PML_OB1_PCKT_PENDING_RETURN(pckt);

Modified: trunk/ompi/mca/pml/ob1/pml_ob1.h
== 


--- trunk/ompi/mca/pml/ob1/pml_ob1.h(original)
+++ trunk/ompi/mca/pml/ob1/pml_ob1.h	2007-05-27 04:29:38 EDT (Sun,  
27 May 2007)

@@ -283,9 +283,19 @@
 } while(0)


-int mca_pml_ob1_send_fin(ompi_proc_t* proc, mca_bml_base_btl_t*  
bml_btl,

- void *hdr_des, uint8_t order);
+int mca_pml_ob1_send_fin_btl(ompi_proc_t* proc,  
mca_bml_base_btl_t* bml_btl,

+void *hdr_des, uint8_t order);

+static inline int mca_pml_ob1_send_fin(ompi_proc_t* proc, void  
*hdr_des,

+mca_bml_base_btl_t* bml_btl, uint8_t order)
+{
+ if(mca_pml_ob1_send_fin_btl(proc, bml_btl, hdr_des, order) ==  
OMPI_SUCCESS)

+ return OMPI_SUCCESS;
+
+MCA_PML_OB1_ADD_FIN_TO_PENDING(proc, hdr_des, bml_btl, order);
+
+return OMPI_ERR_OUT_OF_RESOURCE;
+}

 /* This function tries to resend FIN/ACK packets from pckt_pending  
queue.
  * Packets are added to the queue when sending of FIN or ACK is  
failed due to

___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14782

2007-05-27 Thread Galen Shipman

Actually, we still need  MCA_BTL_FLAGS_FAKE_RDMA , it can be used as  
a hint for components such as one-sided.



Galen

On May 27, 2007, at 5:25 AM, g...@osl.iu.edu wrote:


Author: gleb
Date: 2007-05-27 07:25:39 EDT (Sun, 27 May 2007)
New Revision: 14782
URL: https://svn.open-mpi.org/trac/ompi/changeset/14782

Log:
No need for MCA_BTL_FLAGS_NEED_ACK any more. As of commit r14768  
this is the

default behaviour.

Text files modified:
   trunk/ompi/mca/btl/btl.h   | 3 ---
   trunk/ompi/mca/btl/tcp/btl_tcp_component.c | 3 +--
   2 files changed, 1 insertions(+), 5 deletions(-)

Modified: trunk/ompi/mca/btl/btl.h
== 


--- trunk/ompi/mca/btl/btl.h(original)
+++ trunk/ompi/mca/btl/btl.h	2007-05-27 07:25:39 EDT (Sun, 27 May  
2007)

@@ -157,9 +157,6 @@
 #define MCA_BTL_FLAGS_NEED_ACK 0x10
 #define MCA_BTL_FLAGS_NEED_CSUM 0x20

-/* btl can report put/get completion before data hits the other  
side */

-#define MCA_BTL_FLAGS_FAKE_RDMA 0x40
-
 /* btl needs local rdma completion */
 #define MCA_BTL_FLAGS_RDMA_COMPLETION 0x80


Modified: trunk/ompi/mca/btl/tcp/btl_tcp_component.c
== 


--- trunk/ompi/mca/btl/tcp/btl_tcp_component.c  (original)
+++ trunk/ompi/mca/btl/tcp/btl_tcp_component.c	2007-05-27 07:25:39  
EDT (Sun, 27 May 2007)

@@ -224,8 +224,7 @@
 mca_btl_tcp_module.super.btl_flags = MCA_BTL_FLAGS_PUT |
MCA_BTL_FLAGS_SEND_INPLACE |
MCA_BTL_FLAGS_NEED_CSUM |
-   MCA_BTL_FLAGS_NEED_ACK |
-   MCA_BTL_FLAGS_FAKE_RDMA;
+   MCA_BTL_FLAGS_NEED_ACK;
 mca_btl_tcp_module.super.btl_bandwidth = 100;
 mca_btl_tcp_module.super.btl_latency = 0;
 mca_btl_base_param_register 
(&mca_btl_tcp_component.super.btl_version,

___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768

2007-06-07 Thread Galen Shipman



On Jun 7, 2007, at 9:11 AM, George Bosilca wrote:

There is something weird with this change, and the patch reflect  
it. The new argument "order" come from the PML level and might be  
MCA_BTL_NO_ORDER (which is kind of global) or BTL_OPENIB_LP_QP or  
BTL_OPENIB_HP_QP (which are definitively Open IB related). Do you  
really intend to let the PML knows about Open IB internal constants ?


No, the PML knows only one thing about the order tag, it is either  
MCA_BTL_NO_ORDER or it is something that the BTL assigns.
The PML has no idea about BTL_OPENIB_LP_QP or BTL_OPENIB_HP_QP, to  
the PML it is just an order tag assigned to a fragment by the BTL.


So the semantics are that after a btl_send/put/get an order tag may  
be assigned by the BTL to the descriptor,
This order tag can then be specified to subsequent calls to btl_alloc  
or btl_prepare. The PML has no idea what the value means, other than  
he is requesting a descriptor that will be ordered w.r.t. a  
previously transmitted descriptor.




If it's the case (which seems to be true from the following snippet
if(MCA_BTL_NO_ORDER == order) {
frag->base.order = BTL_OPENIB_LP_QP;
} else {
frag->base.order = order;
}
So I am choosing some ordering to use here because the PML told me he  
doesn't care, what is wrong with this?




) I expect you to revise the patch in order to propose a generic  
solution or I'll trigger a vote against the patch.
This exports no knowledge of the Open IB BTL to the PML layer, the  
PML doesn't know that this is a QP index, he doesn't care! The PML  
simply uses this value (if it wants to) to request ordering with  
subsequent fragments. We use the QP index only as a BTL optimization,  
it could have been anything. So the only new knowledge that the PML  
has is how to request that ordering of fragments be enforced, and the  
BTL doesn't even have to provide this if it doesn't want, that is the  
reason for MCA_BTL_NO_ORDER.



Please describe a use case where this is not a generic solution. Keep  
in mind that MX, TCP, GM all can provide ordering guarantees if they  
wish, in fact for MX you can simply always assign an order tag, say  
the value is 1. MX can then guarantee ordering of all fragments sent  
over the same BTL.



I vote to be backed out of the trunk as it export way to much  
knowledge from the Open IB BTL into the PML layer.


The only other option that I have identified that doesn't push PML  
level protocol into the BTL is to require that BTLs always guarantee  
ordering of fragments sent/put/get over the same BTL.





  george.

PS: With Gleb changes the problem is the same. The following  
snippet reflect exactly the same behavior as the original patch.


Gleb's changes don't change the semantic guarantees that I have  
described above.






frag->base.order = order;
assert(frag->base.order != BTL_OPENIB_HP_QP);

On Jun 7, 2007, at 9:49 AM, Gleb Natapov wrote:


Hi Galen,

On Sun, May 27, 2007 at 10:19:09AM -0600, Galen Shipman wrote:



With current code this is not the case. Order tag is set during a
fragment
allocation. It seems wrong according to your description. Attached
patch fixes
this. If no specific ordering tag is provided to allocation
function order of
the fragment is set to be MCA_BTL_NO_ORDER. After call to send/put/
get order
is set to whatever QP was used for communication. If order is set
before send call
it is used to choose QP.



I do set the order tag during allocation/prepare, but the defined
semantics are that the tag is only valid after send/put/get. We can
set them up any where we wish in the BTL, the PML however cannot  
rely

on anything until after the send/put/get call. So really this is an
issue of semantics versus implementation. The implementation I
believe does conform to the semantics as the upper layer (PML)
doesn't use the tag value until after a call to send/put/get.

I will look over the patch however, might make more sense to delay
setting the value until the actual send/put/get call.


Have you had a chance to look over the patch?

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] BTL Semantics Teleconference: was : Re: [OMPI svn] svn:open-mpi r14768

2007-06-07 Thread Galen Shipman


I just had a discussion with Rich regarding the BTL semantics.
I think what might be helpful here is for us to have telecon to  
discuss this further.


I only have one goal out of this, and that is to firmly define the  
ordering semantics of the BTL, or alternatively local/remote  
completion semantics of the BTL, whatever they may be.


I have created a wiki here to help describe the issue as I currently  
see it, please feel free to add to this with suggestions/etc..


https://svn.open-mpi.org/trac/ompi/wiki/BTLSemantics


- Galen






On Jun 7, 2007, at 9:55 AM, Galen Shipman wrote:



On Jun 7, 2007, at 9:11 AM, George Bosilca wrote:


There is something weird with this change, and the patch reflect
it. The new argument "order" come from the PML level and might be
MCA_BTL_NO_ORDER (which is kind of global) or BTL_OPENIB_LP_QP or
BTL_OPENIB_HP_QP (which are definitively Open IB related). Do you
really intend to let the PML knows about Open IB internal constants ?


No, the PML knows only one thing about the order tag, it is either
MCA_BTL_NO_ORDER or it is something that the BTL assigns.
The PML has no idea about BTL_OPENIB_LP_QP or BTL_OPENIB_HP_QP, to
the PML it is just an order tag assigned to a fragment by the BTL.

So the semantics are that after a btl_send/put/get an order tag may
be assigned by the BTL to the descriptor,
This order tag can then be specified to subsequent calls to btl_alloc
or btl_prepare. The PML has no idea what the value means, other than
he is requesting a descriptor that will be ordered w.r.t. a
previously transmitted descriptor.



If it's the case (which seems to be true from the following snippet
if(MCA_BTL_NO_ORDER == order) {
frag->base.order = BTL_OPENIB_LP_QP;
} else {
frag->base.order = order;
}

So I am choosing some ordering to use here because the PML told me he
doesn't care, what is wrong with this?




) I expect you to revise the patch in order to propose a generic
solution or I'll trigger a vote against the patch.

This exports no knowledge of the Open IB BTL to the PML layer, the
PML doesn't know that this is a QP index, he doesn't care! The PML
simply uses this value (if it wants to) to request ordering with
subsequent fragments. We use the QP index only as a BTL optimization,
it could have been anything. So the only new knowledge that the PML
has is how to request that ordering of fragments be enforced, and the
BTL doesn't even have to provide this if it doesn't want, that is the
reason for MCA_BTL_NO_ORDER.


Please describe a use case where this is not a generic solution. Keep
in mind that MX, TCP, GM all can provide ordering guarantees if they
wish, in fact for MX you can simply always assign an order tag, say
the value is 1. MX can then guarantee ordering of all fragments sent
over the same BTL.



I vote to be backed out of the trunk as it export way to much
knowledge from the Open IB BTL into the PML layer.


The only other option that I have identified that doesn't push PML
level protocol into the BTL is to require that BTLs always guarantee
ordering of fragments sent/put/get over the same BTL.




  george.

PS: With Gleb changes the problem is the same. The following
snippet reflect exactly the same behavior as the original patch.


Gleb's changes don't change the semantic guarantees that I have
described above.





frag->base.order = order;
assert(frag->base.order != BTL_OPENIB_HP_QP);

On Jun 7, 2007, at 9:49 AM, Gleb Natapov wrote:


Hi Galen,

On Sun, May 27, 2007 at 10:19:09AM -0600, Galen Shipman wrote:



With current code this is not the case. Order tag is set during a
fragment
allocation. It seems wrong according to your description. Attached
patch fixes
this. If no specific ordering tag is provided to allocation
function order of
the fragment is set to be MCA_BTL_NO_ORDER. After call to send/ 
put/

get order
is set to whatever QP was used for communication. If order is set
before send call
it is used to choose QP.



I do set the order tag during allocation/prepare, but the defined
semantics are that the tag is only valid after send/put/get. We can
set them up any where we wish in the BTL, the PML however cannot
rely
on anything until after the send/put/get call. So really this is an
issue of semantics versus implementation. The implementation I
believe does conform to the semantics as the upper layer (PML)
doesn't use the tag value until after a call to send/put/get.

I will look over the patch however, might make more sense to delay
setting the value until the actual send/put/get call.


Have you had a chance to look over the patch?

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listi

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768

2007-06-07 Thread Galen Shipman



Are people available today to discuss this over the phone?

- Galen



On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote:


On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote:

) I expect you to revise the patch in order to propose a generic
solution or I'll trigger a vote against the patch. I vote to be
backed out of the trunk as it export way to much knowledge from the
Open IB BTL into the PML layer.
The patch solves real problem. If we want to back it out we need to  
find
another solution. I also didn't like this change too much, but I  
thought

about other solutions and haven't found something better that what
Galen did. If you have something in mind lets discuss it.

As a general comment this kind of discussion is why I prefer to send
significant changes as a patch to the list for discussion before
committing.



  george.

PS: With Gleb changes the problem is the same. The following snippet
reflect exactly the same behavior as the original patch.

I didn't try to change the semantic. Just make the code to match the
semantic that Galen described.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768

2007-06-07 Thread Galen Shipman


Okay, how is  2:30 mountain time for everyone?

I will setup a a call in if this works.

Thanks,

Galen


On Jun 7, 2007, at 12:39 PM, George Bosilca wrote:


I'm available this afternoon.

  george.

On Jun 7, 2007, at 2:35 PM, Galen Shipman wrote:



Are people available today to discuss this over the phone?

- Galen



On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote:


On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote:

) I expect you to revise the patch in order to propose a generic
solution or I'll trigger a vote against the patch. I vote to be
backed out of the trunk as it export way to much knowledge from the
Open IB BTL into the PML layer.

The patch solves real problem. If we want to back it out we need to
find
another solution. I also didn't like this change too much, but I
thought
about other solutions and haven't found something better that what
Galen did. If you have something in mind lets discuss it.

As a general comment this kind of discussion is why I prefer to send
significant changes as a patch to the list for discussion before
committing.



  george.

PS: With Gleb changes the problem is the same. The following  
snippet

reflect exactly the same behavior as the original patch.

I didn't try to change the semantic. Just make the code to match the
semantic that Galen described.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768

2007-06-07 Thread Galen Shipman



On Jun 7, 2007, at 12:49 PM, Don Kerr wrote:

It would be difficult for me to attend this afternoon.  Tomorrow is  
much

better for me.



Brian and I are both out tomorrow. I think what we will do is have a  
call today, report back to the group and then if necessary have  
another call on Monday/Tuesday.


- Galen



-DON

George Bosilca wrote:


I'm available this afternoon.

   george.

On Jun 7, 2007, at 2:35 PM, Galen Shipman wrote:



Are people available today to discuss this over the phone?

- Galen



On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote:


On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote:


) I expect you to revise the patch in order to propose a generic
solution or I'll trigger a vote against the patch. I vote to be
backed out of the trunk as it export way to much knowledge from  
the

Open IB BTL into the PML layer.


The patch solves real problem. If we want to back it out we need to
find
another solution. I also didn't like this change too much, but I
thought
about other solutions and haven't found something better that what
Galen did. If you have something in mind lets discuss it.

As a general comment this kind of discussion is why I prefer to  
send

significant changes as a patch to the list for discussion before
committing.



  george.

PS: With Gleb changes the problem is the same. The following  
snippet

reflect exactly the same behavior as the original patch.


I didn't try to change the semantic. Just make the code to match  
the

semantic that Galen described.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



- 
---


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768

2007-06-07 Thread Galen Shipman


Call in details:

I have scheduled your requested audio conference "Open MPI" for today  
beginning at 2:30pm to 3:30pm mountain time with 7 ports.

Dial in number: 5-4165 local  866-260-0475  toll free


- Galen


On Jun 7, 2007, at 1:47 PM, Galen Shipman wrote:



On Jun 7, 2007, at 12:49 PM, Don Kerr wrote:


It would be difficult for me to attend this afternoon.  Tomorrow is
much
better for me.



Brian and I are both out tomorrow. I think what we will do is have a
call today, report back to the group and then if necessary have
another call on Monday/Tuesday.

- Galen



-DON

George Bosilca wrote:


I'm available this afternoon.

   george.

On Jun 7, 2007, at 2:35 PM, Galen Shipman wrote:



Are people available today to discuss this over the phone?

- Galen



On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote:


On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote:


) I expect you to revise the patch in order to propose a generic
solution or I'll trigger a vote against the patch. I vote to be
backed out of the trunk as it export way to much knowledge from
the
Open IB BTL into the PML layer.


The patch solves real problem. If we want to back it out we  
need to

find
another solution. I also didn't like this change too much, but I
thought
about other solutions and haven't found something better that what
Galen did. If you have something in mind lets discuss it.

As a general comment this kind of discussion is why I prefer to
send
significant changes as a patch to the list for discussion before
committing.



  george.

PS: With Gleb changes the problem is the same. The following
snippet
reflect exactly the same behavior as the original patch.


I didn't try to change the semantic. Just make the code to match
the
semantic that Galen described.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



 
-

---

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768

2007-06-07 Thread Galen Shipman


Everyone:

George thought this was okay after the discussion, I should have made  
the wiki prior to my commit as it did look very Open IB specific.


please review:

https://svn.open-mpi.org/trac/ompi/wiki/BTLSemantics

Let me know if you want to discuss this further and we can setup a  
call early next week.



Thanks,

Galen



On Jun 7, 2007, at 12:49 PM, Don Kerr wrote:

It would be difficult for me to attend this afternoon.  Tomorrow is  
much

better for me.

-DON

George Bosilca wrote:


I'm available this afternoon.

   george.

On Jun 7, 2007, at 2:35 PM, Galen Shipman wrote:



Are people available today to discuss this over the phone?

- Galen



On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote:


On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote:


) I expect you to revise the patch in order to propose a generic
solution or I'll trigger a vote against the patch. I vote to be
backed out of the trunk as it export way to much knowledge from  
the

Open IB BTL into the PML layer.


The patch solves real problem. If we want to back it out we need to
find
another solution. I also didn't like this change too much, but I
thought
about other solutions and haven't found something better that what
Galen did. If you have something in mind lets discuss it.

As a general comment this kind of discussion is why I prefer to  
send

significant changes as a patch to the list for discussion before
committing.



  george.

PS: With Gleb changes the problem is the same. The following  
snippet

reflect exactly the same behavior as the original patch.


I didn't try to change the semantic. Just make the code to match  
the

semantic that Galen described.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



- 
---


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] threaded builds

2007-06-11 Thread Galen Shipman



On Jun 11, 2007, at 8:25 AM, Jeff Squyres wrote:


I leave it to the thread subgroup to decide...  Should we discuss on
the call tomorrow?

I don't have a strong opinion; I was just testing both because it was
easy to do so.  If we want to concentrate on the trunk, I can adjust
my MTT setup.



I think trying to worry about 1.2 would just be a time sink. We know  
that there are architectural issues with threads in some parts of the  
code. I don't see us re-architecting 1.2 in this regard.

Seems we should only focus on the trunk.


- Galen




On Jun 11, 2007, at 10:17 AM, Brian Barrett wrote:


Yes, this is a known issue.  I don't know -- are we trying to make
threads work on the 1.2 branch, or just the trunk?  I had thought
just the trunk?

Brian


On Jun 11, 2007, at 8:13 AM, Tim Prins wrote:


I had similar problems on the trunk, which was fixed by Brian with
r14877.

Perhaps 1.2 needs something similar?

Tim

On Monday 11 June 2007 10:08:15 am Jeff Squyres wrote:

Per the teleconf last week, I have started to revamp the Cisco MTT
infrastructure to do simplistic thread testing.  Specifically, I'm
building the OMPI trunk and v1.2 branches with "--with-threads --
enable-mpi-threads".

I haven't switched this into my production MTT setup yet, but in  
the

first trial runs, I'm noticing a segv in the test/threads/
opal_condition program.

It seems that in the thr1 test on the v1.2 branch, when it calls
opal_progress() underneath the condition variable wait, at some
point
in there current_base is getting to be NULL.  Hence, the following
segv's because the passed in value of "base" is NULL (event.c):

int
opal_event_base_loop(struct event_base *base, int flags)
{
 const struct opal_eventop *evsel = base->evsel;
...

Here's the full call stack:

#0  0x002a955a020e in opal_event_base_loop (base=0x0, flags=5)
 at event.c:520
#1  0x002a955a01f9 in opal_event_loop (flags=5) at event.c:514
#2  0x002a95599111 in opal_progress () at runtime/
opal_progress.c:
259
#3  0x004012c8 in opal_condition_wait (c=0x5025a0,
m=0x502600)
 at ../../opal/threads/condition.h:81
#4  0x00401146 in thr1_run (obj=0x503110) at
opal_condition.c:46
#5  0x0036e290610a in start_thread () from /lib64/tls/
libpthread.so.0
#6  0x0036e1ec68c3 in clone () from /lib64/tls/libc.so.6
#7  0x in ?? ()

This test seems to work fine on the trunk (at least, it didn't segv
in my small number of trail runs).

Is this a known problem in the 1.2 branch?  Should I skip the  
thread

testing on the 1.2 branch and concentrate on the trunk?

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Problem with openib on demand connection bring up.

2007-06-13 Thread Galen Shipman


Hi Gleb,

As we have discussed before I am working on adding support for  
multiple QPs with either per peer resources or shared resources.
As a result of this I am trying to clean up a lot of the OpenIB code.  
It has grown up organically over the years and needs some attention.
Perhaps we can coordinate on commits or even work from the same temp  
branch to do an overall cleanup as well as addressing the issue you  
describe in this email.


I bring this up because this commit will conflict quite a bit with  
what I am working on, I can always merge it by hand but it may make  
sense for us to get this all done in one area and then bring it all  
over?


Thanks,

Galen


On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:


Hello everyone,

  I encountered a problem with openib on depend connection code.  
Basically
it works only by pure luck if you have more then one endpoint for  
the same

proc and sometimes breaks in mysterious ways.

The algo works like this: A wants to connect to B so it creates QP  
and sends it
to B. B receives the QP from A and looks for endpoint that is not  
yet associated
with remote endpoint, creates QP for it and sends info back. Now A  
receives
the QP and goes through the same logic as B i.e looks for endpoint  
that is not
yet connected, BUT there is no guaranty that it will find the  
endpoint that
initiated the connection in the first place! And if it finds  
another one it will
create QP for it and will send it back to B and so on and so forth.  
In the end
I sometimes receive a peculiar mesh of connection where no QP has a  
connection

back to it from the peer process.

To overcome this problem B needs to send back some info that will  
allow A to
determine the endpoint that initiated a connection request. The  
lid:qp pair
will allow for this. But even then the problem will remain if two  
procs initiate
connection at the same time. To dial with simultaneous connection  
asymmetry
protocol have to be used one peer became master another slave.  
Slave alway
initiate a connection to master. Master choose local endpoint to  
satisfy
incoming request and sends info back to a slave. If master wants to  
initiate a
connection it send message to a slave and slave initiate connection  
back to

master.

Included patch implements an algorithm described above and work for  
all

scenarios for which current code fails to create a connection.

--
Gleb.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Problem with openib on demand connection bring up.

2007-06-13 Thread Galen Shipman



On Jun 13, 2007, at 9:49 AM, Torsten Hoefler wrote:


Hi Galen,Gleb,
there is also something weird going on if I call the basic alltoall
during the module_init() of a collective module (I need to wire up my
own QPs in my coll component). It takes 7 seconds for 4 nodes and more
than 30 minutes for 120 nodes. It seems to be an OpenIB wireup issue
because if I start with -mca btl tcp,self this goes as fast as  
expected

(<2 seconds).

Will this issue be fixed with your patch?


No, this is a separate issue.

Try:
-mca mpi_preconnect_oob 1

then try:

-mca mpi_preconnect_all 1

and let us know what the times are.

thx,

galen




Thanks,
  Torsten

--
 bash$ :(){ :|:&};: - http://www.unixer.de/ -
Indiana University| http://www.indiana.edu
Open Systems Lab  | http://osl.iu.edu/
150 S. Woodlawn Ave.  | Bloomington, IN, 474045-7104 | USA
Lindley Hall Room 135 | +01 (812) 855-3608
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Problem with openib on demand connection bring up.

2007-06-13 Thread Galen Shipman



On Jun 13, 2007, at 10:48 AM, Jeff Squyres wrote:


I wonder if this is bringing up the point that there are several of
us working in the openib code base -- I wonder if it would be
worthwhile to have a [short] teleconference to discuss what we're all
doing in openib, where we're doing it (trunk, branch, whatever), when
we expect to have it done, what version we need it in, etc.  Just a
coordination kind of teleconference.  If people think this is a good
idea, I can setup the call.


sounds good to me.

- Galen



For example, don't forget that Nysal and I have the openib btl port-
selection stuff off in /tmp/jnysal-openib-wireup (the btl_openib_if_
[in|ex]clude MCA params).  Per my prior e-mail, if no one objects, I
will be bringing that stuff in to the trunk tomorrow evening (I'm
pretty sure it won't conflict with what Galen is doing; Galen and I
discussed on the phone this morning).




On Jun 13, 2007, at 11:38 AM, Galen Shipman wrote:


Hi Gleb,

As we have discussed before I am working on adding support for
multiple QPs with either per peer resources or shared resources.
As a result of this I am trying to clean up a lot of the OpenIB code.
It has grown up organically over the years and needs some attention.
Perhaps we can coordinate on commits or even work from the same temp
branch to do an overall cleanup as well as addressing the issue you
describe in this email.

I bring this up because this commit will conflict quite a bit with
what I am working on, I can always merge it by hand but it may make
sense for us to get this all done in one area and then bring it all
over?

Thanks,

Galen


On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:


Hello everyone,

  I encountered a problem with openib on depend connection code.
Basically
it works only by pure luck if you have more then one endpoint for
the same
proc and sometimes breaks in mysterious ways.

The algo works like this: A wants to connect to B so it creates QP
and sends it
to B. B receives the QP from A and looks for endpoint that is not
yet associated
with remote endpoint, creates QP for it and sends info back. Now A
receives
the QP and goes through the same logic as B i.e looks for endpoint
that is not
yet connected, BUT there is no guaranty that it will find the
endpoint that
initiated the connection in the first place! And if it finds
another one it will
create QP for it and will send it back to B and so on and so forth.
In the end
I sometimes receive a peculiar mesh of connection where no QP has a
connection
back to it from the peer process.

To overcome this problem B needs to send back some info that will
allow A to
determine the endpoint that initiated a connection request. The
lid:qp pair
will allow for this. But even then the problem will remain if two
procs initiate
connection at the same time. To dial with simultaneous connection
asymmetry
protocol have to be used one peer became master another slave.
Slave alway
initiate a connection to master. Master choose local endpoint to
satisfy
incoming request and sends info back to a slave. If master wants to
initiate a
connection it send message to a slave and slave initiate connection
back to
master.

Included patch implements an algorithm described above and work for
all
scenarios for which current code fails to create a connection.

--
Gleb.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Problem with openib on demand connection bring up.

2007-06-13 Thread Galen Shipman



On Jun 13, 2007, at 11:15 AM, Nysal Jan wrote:




I was just bitten yesterday by a problem that I've known about for a
while but had never gotten around to looking into (I could have sworn
that there was an open trac ticket on this, but I can't find one
anywhere).

I have 2 hosts: one with 3 active ports and one with 2 active ports.
If I run an MPI job between them, the openib BTL wireup got badly and
it aborts.  So handling a heterogeneous number of ports is not
currently handled properly in the code.

I don't know if Gleb's patch addresses this situation or not; I'll
look at his patch this afternoon.



There is a ticket (closed) here: https://svn.open-mpi.org/trac/ompi/ 
ticket/548
It was fixed by Galen for 1.2. There is a FAQ entry also about  
this  http://www.open-mpi.org/faq/?category=openfabrics#ofa-port- 
wireup


I think Gleb's patch addresses a potential race condition when both  
sides attempt to connect at the same time.




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Problem with openib on demand connection bring up.

2007-06-13 Thread Galen Shipman



On Jun 13, 2007, at 11:33 AM, Jeff Squyres wrote:


On Jun 13, 2007, at 1:15 PM, Nysal Jan wrote:


There is a ticket (closed) here: https://svn.open-mpi.org/trac/ompi/
ticket/548
It was fixed by Galen for 1.2.


Ah -- I forgot to look at closed tickets.  I think we broke it again;
it certainly fails on the trunk (perhaps related to what Gleb
found?).  I did not test 1.2.


There is a FAQ entry also about this  http://www.open-mpi.org/faq/?
category=openfabrics#ofa-port-wireup


That's what it *should* be doing, but I wonder if that's what it
*actually* is doing.


So it has been a while but we tested this on our local cluster with  
differing number of ports and it worked, but I was doing simple ping- 
pongs.
If both sides try to open a connection at the same time however,  
badness can occur, from my understanding of this.

- Galen





--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Problem with openib on demand connection bring up.

2007-06-13 Thread Galen Shipman



On Jun 13, 2007, at 12:07 PM, Gleb Natapov wrote:


On Wed, Jun 13, 2007 at 02:05:00PM -0400, Jeff Squyres wrote:

On Jun 13, 2007, at 1:54 PM, Jeff Squyres wrote:


With today's trunk, I still see the problem:


Same thing happens on v1.2 branch.  I'll re-open #548.


I am sure it was never tested with multiple subnets. I'll try to get
such setup.


I tested  this with multiple subnets but it was quite some time ago.
- Galen



--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] openib coord teleconf (was: Problem with openib on demand connection bring up)

2007-06-13 Thread Galen Shipman



On Jun 13, 2007, at 12:23 PM, Jeff Squyres wrote:


On Jun 13, 2007, at 1:40 PM, Gleb Natapov wrote:


[snip]
coordination kind of teleconference.  If people think this is a  
good

idea, I can setup the call.


sounds good to me.

Sounds good to me to. Pasha also works on async event thread. This
patch
is not something I planned to work on. This problem prevented me from
testing my changes to OB1 an is serious enough to be fixed on v1.2.


Pasha tells me that the best times for Ishai and him are:

- 2000-2030 Israel time
- 1300-1300 US Eastern
- 1100-1130 US Mountain
- 2230-2300 India (Bangalore)


These times work for me but not until next week.

- Galen



Although they could also do the preceding half hour as well.

Does this work for everyone?

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] openib coord teleconf (was: Problem with openib on demand connection bring up)

2007-06-13 Thread Galen Shipman



On Jun 13, 2007, at 12:52 PM, Gleb Natapov wrote:


On Wed, Jun 13, 2007 at 02:48:02PM -0400, Jeff Squyres wrote:

On Jun 13, 2007, at 2:41 PM, Gleb Natapov wrote:


Pasha tells me that the best times for Ishai and him are:

- 2000-2030 Israel time
- 1300-1300 US Eastern
- 1100-1130 US Mountain
- 2230-2300 India (Bangalore)

Although they could also do the preceding half hour as well.


Depends on the date. The closest I can at 20:00 is June 19.


Oops!  I left out the date -- sorry.  I meant to say Monday, June
18th.  And I got the US eastern time wrong; that should have been
noon, not 1300.

20:00 Israel June 19th is right after the weekly OMPI teleconf; want
to do it then?


Yes.



On my calendar.

- Galen



--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Problem with openib on demand connection bring up.

2007-06-14 Thread Galen Shipman



The patch applies to ib_multifrag as is without a conflict. But the  
branch
doesn't compile with or without the patch so I was not able to test  
it.

Do you have some uncommitted changes that may generate a conflict? Can
you commit them so they can be resolved? If there is no conflict  
between

your work and this patch may be it is a good idea to commit it to your
branch and trunk for testing?



I have a whole pile of changes that need to be committed, and even  
with these changes, it still doesn't compile as I am reworking names,  
and data structures, etc.
I will commit what I have now, and will work on this a bit more over  
the weekend.

- Galen






Thanks,

Galen


On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:


Hello everyone,

  I encountered a problem with openib on depend connection code.
Basically
it works only by pure luck if you have more then one endpoint for
the same
proc and sometimes breaks in mysterious ways.

The algo works like this: A wants to connect to B so it creates QP
and sends it
to B. B receives the QP from A and looks for endpoint that is not
yet associated
with remote endpoint, creates QP for it and sends info back. Now A
receives
the QP and goes through the same logic as B i.e looks for endpoint
that is not
yet connected, BUT there is no guaranty that it will find the
endpoint that
initiated the connection in the first place! And if it finds
another one it will
create QP for it and will send it back to B and so on and so forth.
In the end
I sometimes receive a peculiar mesh of connection where no QP has a
connection
back to it from the peer process.

To overcome this problem B needs to send back some info that will
allow A to
determine the endpoint that initiated a connection request. The
lid:qp pair
will allow for this. But even then the problem will remain if two
procs initiate
connection at the same time. To dial with simultaneous connection
asymmetry
protocol have to be used one peer became master another slave.
Slave alway
initiate a connection to master. Master choose local endpoint to
satisfy
incoming request and sends info back to a slave. If master wants to
initiate a
connection it send message to a slave and slave initiate connection
back to
master.

Included patch implements an algorithm described above and work for
all
scenarios for which current code fails to create a connection.

--
Gleb.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] flags in openib btl

2007-06-16 Thread Galen Shipman


These two:


MCA_BTL_FLAGS_NEED_ACK MCA_BTL_FLAGS_NEED_CSUM


Are used by DR. They aren't used by OB1.

- Galen


On Jun 15, 2007, at 9:27 AM, Jeff Squyres wrote:


I notice that our help message for the btl_openib_flags MCA parameter
seems to be a bit out of date:

CHECK(reg_int("flags", "BTL flags, added together: SEND=1, PUT=2,
GET=4 "
   "(cannot be 0)",
   MCA_BTL_FLAGS_RDMA | MCA_BTL_FLAGS_NEED_ACK |
   MCA_BTL_FLAGS_NEED_CSUM, &ival, REGINT_GE_ZERO));
mca_btl_openib_module.super.btl_flags = (uint32_t) ival;

Specifically, we only list values of 1, 2, and 4.  But the default
value is 54.  So clearly, there's quite a few more flags that can be
set there.

What are they?

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [devel-core] Collective Communications Optimization - MeetingScheduled in Albuquerque!

2007-07-10 Thread Galen Shipman

Hotels near the airport / university area, I pulled this off of this  
site: http://www.airnav.com/airport/KABQ



Miles
 Price ($)
FAIRFIELD INN BY MARRIOTT ALBUQUERQUE UNIVERSITY AREA

 4.8
79-80
COMFORT INN AIRPORT

 1.3
52-101
COURTYARD BY MARRIOTT ALBUQUERQUE AIRPORT

 1.6
74-139
SLEEP INN AIRPORT

 1.6
45-71
LA QUINTA INN ALBUQUERQUE AIRPORT

 1.4
69-91
AMERISUITES ALBUQUERQUE AIRPORT

 1.5
59-109
RAMADA LTD ALBUQUERQUE AIRPORT

 1.7
62-89
SUBURBAN EXTENDED STAY

 4.6
51-52
HOWARD JOHNSON - ALBUQUERQUE (EAST)

 5.4
54-69
WYNDHAM ALBUQUERQUE AIRPORT

 1.0
80-109
COMFORT INN EAST

 6.3
55-56
PARK PLAZA HOTEL ALBUQUERQUE

 4.7
60-61

Other hotels near Albuquerque International Sunport Airport


Miles
 Price ($)
HILTON GARDEN INN ALBUQUERQUE AIRPORT

 1.2
85-180
BEST WESTERN INNSUITES (AIRPORT)

 1.2
49-71
HAMPTON INN ALBUQERQUE AP

 1.4
71-105
ESA ALBUQUERQUE-AIRPORT

 1.6
54-70
HAWTHORN INN & SUITES - ALBUQUERQUE (AIRPORT)

 1.8
69-70
QUALITY SUITES ALBUQUERQUE

 1.8
58-65
VAGABOND EXECUTIVE INN - FORMERLY THE AIRPORT UNIVERSITY INN

 1.9
49-56
COUNTRY INN & SUITES BY CARLSON - ALBUQUERQUE AIRPORT

 1.9
79-109
HOMEWOOD STE ALBUQUERQUE ARPT


On Jul 10, 2007, at 2:49 PM, Gil Bloch wrote:

What time do we plan to start on Aug. 6? I am trying to figure out  
if I have to be there the day before.


Also, is there any specific hotel you would recommend?

Regards,
Gil Bloch


-Original Message-
From: devel-core-boun...@open-mpi.org [mailto:devel-core- 
boun...@open-mpi.org] On Behalf Of Galen Shipman

Sent: ב 09 יולי 2007 15:44
To: Open MPI Developers
Subject: Re: [devel-core] Collective Communications Optimization -  
MeetingScheduled in Albuquerque!



All,

I have confirmed the meeting to be held at the HPC facility at UNM on
Aug 6,7,8.

Here is a link to the HPC center:

http://www.hpc.unm.edu/

Here is the visitor information link:

http://www.hpc.unm.edu/info/visitor-information


I hope everyone who expressed interest is able to attend!

Thanks,

Galen







On Jun 29, 2007, at 6:23 PM, Galen Shipman wrote:



So we are looking at a change of venue for this meeting.
Santa Fe turned out to be a bit too costly in terms of hotel rooms
for some participants.
I am looking into getting the HPC conference room in Albuquerque.
This is a convenient location for most and the hotels are cheaper.
I am firming up the details with the new HPC director at UNM, the
dates will remain August 6,7,8.

Thanks,

Galen


On Jun 6, 2007, at 2:43 PM, Galen Shipman wrote:


Updated Attendees as of  June 6th

(5 tentative, 12 confirmed):

Cisco
Jeff (tentative)

IU
Tim
Andrew (tentative)
Josh (tentative)
Torsten

LANL
Brian
Ollie
Galen

Mellanox
Gil

Myricom
Patrick (tentative)

ORNL
Rich

SNL
Ron

UH
Edgar

UT
George
Jelena (tentative)

SUN
Rolf


QLogic
Christian



On Jun 5, 2007, at 10:10 AM, Galen Shipman wrote:



Sorry for duplicate (included a reasonable subject line):

Okay, so we tried to get the Hilton at a reasonable rate, didn't
happen. Instead we got the eldorado hotel:

http://www.eldoradohotel.com/

So the meeting will be held here.

The room rates at the hotel are probably a bit high, but there  
are a

number of other hotels in and around the area. I will try to get a
list from our admin.

I have the following attendees so far, if you are on the list and
marked as tentative please let me know ASAP if you are definitely
coming. If you are on the list and not marked as tentative, then we
are expecting you, so please let me know today if you are unable to
make it.

This should be a good meeting, we will be located in the heart of
Santa Fe so travel will be easier (you still need a car from ABQ  
but

it is less than one hour) and there are lots of things to do/see
before and after the meetings.

Thanks,

Galen




Updated Attendees (15 in total):

Cisco
Jeff (tentative)

IU
Tim
Andrew (tentative)
Josh (tentative)
Torsten

LANL
Brian
Ollie
Galen

Mellanox
Gil

Myricom
Patrick (tentative)

ORNL
Rich

SNL
Ron

UH
Edgar

UT
George
Jelena

___
devel-core mailing list
devel-c...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel-core
___
devel-core mailing list
devel-c...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel-core


___
devel-core mailing list
devel-c...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel-core




___
devel-core mailing list
devel-c...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel-core

Re: [OMPI devel] OpenIB BTL and SRQs

2007-07-12 Thread Galen Shipman



On Jul 12, 2007, at 10:29 AM, Don Kerr wrote:


Through mca parameters one can select the use of shared receive queues
in the openib btl, other than having fewer queues I am wondering what
are the benefits of using this option. Can anyone eleborate on using
them vs the default?

In the trunk the number of queue pairs is the same, regardless of SRQ  
or NON-SRQ hence forth named PP (per-peer).
The difference is that PP receive resources scale with the number of  
active QP connections. SRQ receive resources do not.
So the real difference is the memory footprint of the the receive  
resources. SRQ is potentially much smaller. This comes at a cost; SRQ  
does not have flow control as we cannot reserve resources for a  
particular peer, so we do have the possibility of an RNR (receiver  
not ready) NAK if all the shared receive resources are consumed and  
some peer is still transmitting messages. This has a performance  
penalty as an RNR NAK stalls the IB pipeline. With PP, we can  
guarantee that resources are available to the peer and thereby avoid  
RNR (although there is a bug in the trunk right now in that sometimes  
we get RNR even with PP, but this is being worked on).


I have been working on a modification to the OpenIB BTL which allows  
the user to specify SRQ and PP QPs arbitrarily. That is we can use a  
mix of PP and SRQ with a mix of receive sizes for each. This is  
coming into the trunk very soon, perhaps tomorrow but we need to  
verify the branch with some additional testing.


I hope this helps, I have a paper at EuroPVM/MPI that discusses much  
of this, I will send you a copy off list.


- Galen



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] OMPI_FREE_LIST improvements

2007-07-12 Thread Galen Shipman

In working on my changes in the ib_multifrag branch I modified the  
ompi_free_list.
The change enables a free list to have a bit more personality than  
what is dictated by the type of the item on the free list. The  
overall problem was that we often use different free list item types  
to simply distinguish sizes of the free list item. In an ideal world  
we would just have constructors that accepted arguments. There are  
numerous problems with this approach, but mostly it would require a  
major change to the object system, don't think we want that. So  
instead I modified the free list to allow for an optional "post  
constructor" initialization function to be run on each free list item  
which includes optional opaque data to be passed to the  
initialization routine.


Here is the signature of the initialization routine:


typedef void (*ompi_free_list_item_init_fn_t) (struct  
ompi_free_list_item_t*, void* ctx);



I also added two new items to the free list struct:

struct ompi_free_list_t
{
.
ompi_free_list_item_init_fn_t item_init;
void* ctx;
};




The current ompi_free_list_init function didn't change at all,  
instead I added these optional params to ompi_free_list_init_ex:


OMPI_DECLSPEC int ompi_free_list_init_ex(
ompi_free_list_t *free_list,
size_t element_size,
size_t alignment,
opal_class_t* element_class,
int num_elements_to_alloc,
int max_elements_to_alloc,
int num_elements_per_alloc,
struct mca_mpool_base_module_t*,
ompi_free_list_item_init_fn_t item_init,
void *ctx
);


So all the free list does is run the function specified by  
"item_init" on created free list item (after calling  
OBJ_CONSTRUCT_INTERNAL)



For those that don't need this new functionality, simply pass two  
NULL's to ompi_free_lint_init_ex :



ompi_free_list_init_ex(&btl->udapl_frag_eager,
sizeof(mca_btl_udapl_frag_eager_t) +
mca_btl_udapl_component.udapl_eager_frag_size,
mca_btl_udapl_component.udapl_buffer_alignment,
OBJ_CLASS(mca_btl_udapl_frag_eager_t),
mca_btl_udapl_component.udapl_free_list_num,
mca_btl_udapl_component.udapl_free_list_max,
mca_btl_udapl_component.udapl_free_list_inc,
   btl->super.btl_mpool,
   NULL,
   NULL);


Again, if you are using ompi_free_list_init you won't be affected.


I think this functionality makes sense, it reduced the number of  
different free list item types in the OpenIB BTL and allows me to  
have numerous free lists of the same item type but with slightly  
different characteristics.


Here is an example of how I use this in the OpenIB BTL:

init_data = (mca_btl_openib_frag_init_data_t*)
malloc(sizeof(mca_btl_openib_frag_init_data_t));

  init_data->length = length;
  init_data->type = MCA_BTL_OPENIB_FRAG_SEND_USER;
  init_data->order = mca_btl_openib_component.rdma_qp;
  init_data->list = &openib_btl->send_user_free;

ompi_free_list_init_ex( &openib_btl->send_user_free,
   length,
   2,
   OBJ_CLASS 
(mca_btl_openib_send_user_frag_t),


mca_btl_openib_component.ib_free_list_num,

mca_btl_openib_component.ib_free_list_max,

mca_btl_openib_component.ib_free_list_inc,
   NULL,
   mca_btl_openib_frag_init,
   (void*)init_data))




Thanks,

Galen

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Galen Shipman



I think we need to take a step back from micro-optimizations such as  
header caching.


Rich, George, Brian and I are currently looking into latency  
improvements. We came up with several areas of performance  
enhancements that can be done with minimal disruption. The progress  
issue that Christian and others have pointed does appear to be a  
problem, but will take a bit more work. I would like to see progress  
in these areas first as I really don't like the idea of caching more  
endpoint state in OMPI for micro-benchmark latency improvements until  
we are certain we have done the ground work for improving latency in  
the general case.





Here are the items we have identified:


 



1) remove 0 byte optimization of not initializing the convertor
 This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an  
“if“ in mca_pml_ob1_send_request_start_copy

+++
Measure the convertor initialization before taking any other action.
 



 



2) get rid of mca_pml_ob1_send_request_start_prepare and  
mca_pml_ob1_send_request_start_copy by removing the  
MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send  
return OMPI_SUCCESS if the fragment can be marked as completed and  
OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This  
solves another problem, with IB if there are a bunch of isends  
outstanding we end up buffering them all in the btl, marking  
completion and never get them on the wire because the BTL runs out of  
credits, we never get credits back until finalize because we never  
call progress cause the requests are complete.  There is one issue  
here, start_prepare calls prepare_src and start_copy calls alloc, I  
think we can work around this by just always using prepare_src,  
OpenIB BTL will give a fragment off the free list anyway because the  
fragment is less than the eager limit.

+++
Make the BTL return different return codes for the send. If the  
fragment is gone, then the PML is responsible of marking the MPI  
request as completed and so on. Only the updated BTLs will get any  
benefit from this feature. Add a flag into the descriptor to allow or  
not the BTL to free the fragment.


Add a 3 level flag:
- BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after  
the send, and then it report back a special return to the PML
- BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released  
by the BTL once the completion callback was triggered.
- PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment  
at all (the PML is responsible for this).


Return codes:
- done and there will be no callbacks
- not done, wait for a callback later
- error state
 



 



3) Change the remote callback function (and tag value based on what  
data we are sending), don't use mca_pml_ob1_recv_frag_callback for  
everything!

I think we need:

mca_pml_ob1_recv_frag_match
mca_pml_ob1_recv_frag_rndv
mca_pml_ob1_recv_frag_rget

mca_pml_ob1_recv_match_ack_copy
mca_pml_ob1_recv_match_ack_pipeline

mca_pml_ob1_recv_copy_frag
mca_pml_ob1_recv_put_request
mca_pml_ob1_recv_put_fin
+++
Pass the callback as parameter to the match function will save us 2  
switches. Add more registrations in the BTL in order to jump directly  
in the correct function (the first 3 require a match while the others  
don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags  
[i.e. first 4 bits for the protocol tag and lower 4 bits they are up  
to the protocol] and the registration table will still be local to  
each component.
 



 



4) Get rid of mca_pml_ob1_recv_request_progress; this does the same  
switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
	I think what we can do here is modify mca_pml_ob1_recv_frag_match to  
take a function pointer for what it should call on a successful match.
	So based on the receive callback we can pass the correct scheduling  
function to invoke into the generic mca_pml_ob1_recv_frag_match


Recv_request progress is call in a generic way from multiple places,  
and we do a big switch inside. In the match function we might want to  
pass a function pointer to the successful match progress function.  
This way we will be able to specialize what happens after the match,  
in a more optimized way. Or the recv_request_match can return the  
match and then the caller will have to specialize it's action.
---

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Galen Shipman




Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header  
caching

decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73



Is this just convertor initialization cost?

- Galen



And here ompi make some latency jump.

In mvapich the header caching decrease the header size from  
56bytes to

12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already
optimized ...

So for 0 bytes message we are sending only 16bytes on the wire , is it
correct ?


Pasha.


  george.



Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[O-MPI devel] Mellanox VAPI SRQ.

2005-07-28 Thread Galen Shipman

Using the mvapi btl you can now set OMPI_MCA_btl_mvapi_use_srq=1 which 
will cause mvapi to use a shared receive queue. This will allow much 
better scaling as receives are posted per interface port and not per 
queue pair. Note: older versions of mellanox firmware may see a 
substantial performance impact on small message latency but the latest 
firmware shows only a small cost on the order of 2/10 uSec.



- Galen

Re: [O-MPI devel] [PATCH] for ompi_free_list.c

2005-08-08 Thread Galen Shipman


Changes are in the trunk

Thanks,

Gaeln
On Aug 8, 2005, at 7:38 AM, Gleb Natapov wrote:


Hello,

Included patch fixes bugs in ompi_free_list in the case ompi_free_list 
was

created with NULL class and/or mpool parameters.

Index: ompi/class/ompi_free_list.c
===
--- ompi/class/ompi_free_list.c (revision 6760)
+++ ompi/class/ompi_free_list.c (working copy)
@@ -75,7 +75,7 @@ int ompi_free_list_grow(ompi_free_list_t
 unsigned char* ptr;
 size_t i;
 size_t mod;
-mca_mpool_base_registration_t* user_out;
+mca_mpool_base_registration_t* user_out = NULL;

 if (flist->fl_max_to_alloc > 0 && flist->fl_num_allocated + 
num_elements > flist->fl_max_to_alloc)

 return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
@@ -97,7 +97,10 @@ int ompi_free_list_grow(ompi_free_list_t
 item->user_data = user_out;
 if (NULL != flist->fl_elem_class) {
 OBJ_CONSTRUCT_INTERNAL(item, flist->fl_elem_class);
-}
+} else {
+   OBJ_CONSTRUCT (&item->super, opal_list_item_t);
+   }
+   
 opal_list_append(&(flist->super), &(item->super));
 ptr += flist->fl_elem_size;
 }
--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [O-MPI devel] [PATCH] wrong variable type in OpenIB.

2005-08-08 Thread Galen Shipman


Gleb,

Changes are in the trunk thanks,

Galen

On Aug 7, 2005, at 4:32 AM, Gleb Natapov wrote:


Hello Galen,

Included patch changes type of returned value from ibv_poll_cq.
It should be signed because we check if it is less then zero later
in the code.


Index: ompi/mca/btl/openib/btl_openib_component.c
===
--- ompi/mca/btl/openib/btl_openib_component.c  (revision 6757)
+++ ompi/mca/btl/openib/btl_openib_component.c  (working copy)
@@ -492,8 +492,8 @@ mca_btl_base_module_t** mca_btl_openib_c

 int mca_btl_openib_component_progress()
 {
-uint32_t i, ne;
-int count = 0;
+uint32_t i;
+int count = 0, ne;
 mca_btl_openib_frag_t* frag;
 mca_btl_openib_endpoint_t* endpoint;
 /* Poll for completions */
--
Gleb.

Re: [O-MPI devel] Fwd: Regarding MVAPI Component in Open MPI

2005-08-09 Thread Galen Shipman


Hi
On Aug 9, 2005, at 8:15 AM, Sridhar Chirravuri wrote:


The same kind of output while running Pallas "pingpong" test.

-Sridhar

-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
Behalf Of Sridhar Chirravuri
Sent: Tuesday, August 09, 2005 7:44 PM
To: Open MPI Developers
Subject: Re: [O-MPI devel] Fwd: Regarding MVAPI Component in Open MPI


I have run sendrecv function in Pallas but it failed to run it. Here is
the output

[root@micrompi-2 SRC_PMB]# mpirun -np 2 PMB-MPI1 sendrecv
Could not join a running, existing universe
Establishing a new one named: default-universe-5097
[0,1,1][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub
[0,1,1][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub


[0,1,0][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub

[0,1,0][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub

[0,1,0][btl_mvapi_endpoint.c:542:mca_btl_mvapi_endpoint_send]  
Connection

to endpoint closed ... connecting ...
[0,1,0][btl_mvapi_endpoint.c:318:mca_btl_mvapi_endpoint_start_connect]
Initialized High Priority QP num = 263177, Low Priority QP num =  
263178,

LID = 785

[0,1,0][btl_mvapi_endpoint.c:190: 
mca_btl_mvapi_endpoint_send_connect_req

] Sending High Priority QP num = 263177, Low Priority QP num = 263178,
LID = 785[0,1,0][btl_mvapi_endpoint.c:542:mca_btl_mvapi_endpoint_send]
Connection to endpoint closed ... connecting ...
[0,1,0][btl_mvapi_endpoint.c:318:mca_btl_mvapi_endpoint_start_connect]
Initialized High Priority QP num = 263179, Low Priority QP num =  
263180,

LID = 786

[0,1,0][btl_mvapi_endpoint.c:190: 
mca_btl_mvapi_endpoint_send_connect_req

] Sending High Priority QP num = 263179, Low Priority QP num = 263180,
LID = 786#---
#PALLAS MPI Benchmark Suite V2.2, MPI-1 part
#---
# Date   : Tue Aug  9 07:11:25 2005
# Machine: x86_64# System : Linux
# Release: 2.6.9-5.ELsmp
# Version: #1 SMP Wed Jan 5 19:29:47 EST 2005

#
# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype   :   MPI_BYTE
# MPI_Datatype for reductions:   MPI_FLOAT
# MPI_Op :   MPI_SUM
#
#

# List of Benchmarks to run:

# Sendrecv
[0,1,1][btl_mvapi_endpoint.c:368: 
mca_btl_mvapi_endpoint_reply_start_conn

ect] Initialized High Priority QP num = 263177, Low Priority QP num =
263178,  LID = 777

[0,1,1][btl_mvapi_endpoint.c:266: 
mca_btl_mvapi_endpoint_set_remote_info]
Received High Priority QP num = 263177, Low Priority QP num 263178,   
LID

= 785

[0,1,1][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query]
Modified to init..Qp
7080096[0,1,1][btl_mvapi_endpoint.c:791: 
mca_btl_mvapi_endpoint_qp_init_q

uery] Modified to RTR..Qp
7080096[0,1,1][btl_mvapi_endpoint.c:814: 
mca_btl_mvapi_endpoint_qp_init_q

uery] Modified to RTS..Qp 7080096

[0,1,1][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query]
Modified to init..Qp 7240736
[0,1,1][btl_mvapi_endpoint.c:791:mca_btl_mvapi_endpoint_qp_init_query]
Modified to RTR..Qp
7240736[0,1,1][btl_mvapi_endpoint.c:814: 
mca_btl_mvapi_endpoint_qp_init_q

uery] Modified to RTS..Qp 7240736
[0,1,1][btl_mvapi_endpoint.c:190: 
mca_btl_mvapi_endpoint_send_connect_req

] Sending High Priority QP num = 263177, Low Priority QP num = 263178,
LID = 777
[0,1,0][btl_mvapi_endpoint.c:266: 
mca_btl_mvapi_endpoint_set_remote_info]
Received High Priority QP num = 263177, Low Priority QP num 263178,   
LID

= 777
[0,1,0][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query]
Modified to init..Qp 7081440
[0,1,0][btl_mvapi_endpoint.c:791:mca_btl_mvapi_endpoint_qp_init_query]
Modified to RTR..Qp 7081440
[0,1,0][btl_mvapi_endpoint.c:814:mca_btl_mvapi_endpoint_qp_init_query]
Modified to RTS..Qp 7081440
[0,1,0][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query]
Modified to init..Qp 7241888
[0,1,0][btl_mvapi_endpoint.c:791:mca_btl_mvapi_endpoint_qp_init_query]
Modified to RTR..Qp
7241888[0,1,0][btl_mvapi_endpoint.c:814: 
mca_btl_mvapi_endpoint_qp_init_q

uery] Modified to RTS..Qp 7241888
[0,1,1][btl_mvapi_component.c:523:mca_btl_mvapi_component_progress] Got
a recv completion


Thanks
-Sridhar




-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
Behalf Of Brian Barrett
Sent: Tuesday, August 09, 2005 7:35 PM
To: Open MPI Developers
Subject: Re: [O-MPI devel] Fwd: Regarding MVAPI Component in Open MPI

On Aug 9, 2005, at 8:48 AM, Sridhar Chirravuri wrote:


Does r6774 has lot of changes that are related to 3rd generation
point-to-point? I am trying to run some benchmark tests (ex:
pallas) with Open MPI stack and just want to compare the
performance figures with MVAPICH 095 and MVAPICH 092.

In order to use 3rd generation p2p communication, I have added the
following line in the /openmpi/etc/openmpi-mca-params.conf

pml=ob1

I also exported (as double check) OMPI_MCA_pml=ob1.

Then, I

Re: [O-MPI devel] Fwd: Regarding MVAPI Component in Open MPI

2005-08-09 Thread Galen Shipman


Hi Sridhar,

I have committed changes that allow you to set the debg verbosity,

OMPI_MCA_btl_base_debug
0 - no debug output
1 - standard debug output
2 - very verbose debug output

Also we have run the Pallas tests and are not able to reproduce your 
failures. We do see a warning in the Reduce test but it does not hang 
and runs to completion. Attached is a simple ping pong program, try 
running this and let us know the results.


Thanks,

Galen


/*
 * MPI ping program
 *
 * Patterned after the example in the Quadrics documentation
 */

#define MPI_ALLOC_MEM 0
#include 
#include 
#include 
#include 
#include 

#include 

#include "mpi.h"

static int str2size(char *str)
{
int size;
char mod[32];

switch (sscanf(str, "%d%1[mMkK]", &size, mod)) {
case 1:
return (size);

case 2:
switch (*mod) {
case 'm':
case 'M':
return (size << 20);

case 'k':
case 'K':
return (size << 10);

default:
return (size);
}

default:
return (-1);
}
}


static void usage(void)
{
fprintf(stderr,
"Usage: mpi-ping [flags]  [] []\n"
"   mpi-ping -h\n");
exit(EXIT_FAILURE);
}


static void help(void)
{
printf
("Usage: mpi-ping [flags]  [] []\n"
 "\n" "   Flags may be any of\n"
 "  -Buse blocking send/recv\n"
 "  -Ccheck data\n"
 "  -Ooverlapping pings\n"
 "  -Wperform warm-up phase\n"
 "  -r number repetitions to time\n"
 "  -A   use MPI_Alloc_mem to register memory\n" 
 "  -hprint this info\n" "\n"
 "   Numbers may be postfixed with 'k' or 'm'\n\n");

exit(EXIT_SUCCESS);
}


int main(int argc, char *argv[])
{
MPI_Status status;
MPI_Request recv_request;
MPI_Request send_request;
unsigned char *rbuf;
unsigned char *tbuf;
int c;
int i;
int bytes;
int nproc;
int peer;
int proc;
int r;
int tag = 0x666;

/*
 * default options / arguments
 */
int reps = 1;
int blocking = 0;
int check = 0;
int overlap = 0;
int warmup = 0;
int inc_bytes = 0;
int max_bytes = 0;
int min_bytes = 0;
int alloc_mem = 0; 

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &proc);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);

while ((c = getopt(argc, argv, "BCOWAr:h")) != -1) {
switch (c) {

case 'B':
blocking = 1;
break;

case 'C':
check = 1;
break;

case 'O':
overlap = 1;
break;

case 'W':
warmup = 1;
break;


case 'A': 
alloc_mem=1; 
break; 


case 'r':
if ((reps = str2size(optarg)) <= 0) {
usage();
}
break;

case 'h':
help();

default:
usage();
}
}

if (optind == argc) {
min_bytes = 0;
} else if ((min_bytes = str2size(argv[optind++])) < 0) {
usage();
}

if (optind == argc) {
max_bytes = min_bytes;
} else if ((max_bytes = str2size(argv[optind++])) < min_bytes) {
usage();
}

if (optind == argc) {
inc_bytes = 0;
} else if ((inc_bytes = str2size(argv[optind++])) < 0) {
usage();
}

if (nproc == 1) {
exit(EXIT_SUCCESS);
}

#if MPI_ALLOC_MEM
if(alloc_mem) { 
 MPI_Alloc_mem(max_bytes ? max_bytes: 8, MPI_INFO_NULL, &rbuf);
 MPI_Alloc_mem(max_bytes ? max_bytes: 8, MPI_INFO_NULL, &tbuf);
} 
else { 
#endif 
if ((rbuf = (unsigned char *) malloc(max_bytes ? max_bytes : 8)) == 
NULL) { 
perror("malloc"); 
exit(EXIT_FAILURE); 
} 
if ((tbuf = (unsigned char *) malloc(max_bytes ? max_bytes : 8)) == 
NULL) { 
perror("malloc"); 
exit(EXIT_FAILURE); 
} 
#if MPI_ALLOC_MEM
} 
#endif 

if (check) {
for (i = 0; i < max_bytes; i++) {
tbuf[i] = i & 255;
rbuf[i] = 0;
}
}

if (proc == 0) {
if (overlap) {
printf("mpi-ping: overlapping ping-pong\n");
} else if (blocking) {
printf("mpi-ping: ping-pong (using blocking send/recv)\n");
} else {
printf("mpi-ping: ping-pong\n");
}
if (check) {
printf("data checking enabled\n");
}
printf("nprocs=%d, reps=%d, min bytes=%d, max bytes=%d inc bytes=%d\n",
   nproc, reps, min_bytes, max_bytes, inc_bytes);
fflush(stdout);
}

MPI_Barrier(MPI_COMM_WORLD);

peer = proc ^ 1;

if ((peer < nproc) && (peer & 1)) {
printf("%d pings %d\n", proc, peer);
fflush(stdout

Re: [O-MPI devel] couple of problems in openib mpool.

2005-08-12 Thread Galen Shipman


Hey Gleb,

Sorry for the delay.. we have been doing a bit of reworking of the 
pml/btl so that the btl's can be shared outside of just the pml 
(collectives, etc).


I have added the bug fix (old_reg). Will look at the assumption of 
non-null registration next.


Thanks (and keep them coming ;-) ,

Galen

On Aug 11, 2005, at 8:27 AM, Gleb Natapov wrote:


Hello,

 There are couple of bugs/typos in openib mpool. First one is fixed
by included patch. Second one is in function mca_mpool_openib_free().
This function assumes that registration is never NULL, but there are
callers that think different (ompi/class/ompi_fifo.h,
ompi/class/ompi_circular_buffer_fifo.h)


Index: ompi/mca/mpool/openib/mpool_openib_module.c
===
--- ompi/mca/mpool/openib/mpool_openib_module.c (revision 6806)
+++ ompi/mca/mpool/openib/mpool_openib_module.c (working copy)
@@ -127,7 +127,7 @@
 mca_mpool_base_registration_t* old_reg  = *registration;
 void* new_mem = mpool->mpool_alloc(mpool, size, 0, registration);
 memcpy(new_mem, addr, old_reg->bound - old_reg->base);
-mpool->mpool_free(mpool, addr, &old_reg);
+mpool->mpool_free(mpool, addr, old_reg);
 return new_mem;

 }
--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[O-MPI devel] build warnings..

2005-08-12 Thread Galen Shipman


Current build warnings:

mca_base_parse_paramfile_lex.c:1664: warning: 'yy_flex_realloc' defined 
but not used

qsort.c:163: warning: cast from pointer to integer of different size
show_help_lex.c:1606: warning: 'yy_flex_realloc' defined but not used
rmgr_proxy.c:237: warning: ISO C forbids conversion of object pointer 
to function pointer type
rmgr_proxy.c:356: warning: ISO C forbids conversion of function pointer 
to object pointer type
rmgr_urm.c:184: warning: ISO C forbids conversion of object pointer to 
function pointer type
rmgr_urm.c:309: warning: ISO C forbids conversion of function pointer 
to object pointer type

comm_cid.c:167: warning: comparison between signed and unsigned
fake_stack.c:46: warning: no previous prototype for 
'ompi_convertor_create_stack_with_pos_general'

40 matches

Mail list logo