[ofa-general] Re: [RFC V3 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_NETDEV_CHANGE notification

2008-05-27 Thread Or Gerlitz

Sean Hefty wrote:

+static void cma_ndev_work_handler(struct work_struct *_work)
+{
+   struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work,
work);
+   struct rdma_id_private *id_priv = work-id;
+   int destroy = 0;
+
+   mutex_lock(id_priv-handler_mutex);
+
+   if (id_priv-id.event_handler(id_priv-id, work-event)) {

How do we know that the user hasn't tried to destroy the id from another
callback?  We need some sort of state check here.

correct, will be fixed.


+   cma_exch(id_priv, CMA_DESTROYING);
+   destroy = 1;
+   }
+
+   cma_enable_remove(id_priv);


I didn't see the matching cma_disable_remove() call.
As you can see also in the patch 3/5, places in the code which 
originally did --not-- call cma_enable_remove() but rather just did 
atomic_inc(conn_id-dev_remove) were just replaced with 
mutex_lock(id_priv-handler_mutex). This is b/c cma_enable_remove does 
two things: 1) it does the state validation 2) it locks the 
handler_mutex, so places in the code which don't need the state 
validation don't call it... a bit dirty.




+static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private
*id_priv)
+{


nit - function name isn't clear to me.  Maybe something like
cma_netdev_change_handler()?  Although I'm not sure that netdev change is what
the user is really interested in.  What they really want to know is if IP
address mapping/resolution changed.  netdev is hidden from the user.

OK, I will see how to improve the name

Maybe call this RDMA_CM_EVENT_ADDR_CHANGE?

let me think about it

+static int cma_netdev_callback(struct notifier_block *self, unsigned long
event,
+   void *ctx)
+{
+   struct net_device *ndev = (struct net_device *)ctx;
+   struct cma_device *cma_dev;
+   struct rdma_id_private *id_priv;
+   int ret = NOTIFY_DONE;
+
+   if (dev_net(ndev) != init_net)
+   return NOTIFY_DONE;
+
+   if (event != NETDEV_BONDING_FAILOVER)
+   return NOTIFY_DONE;
+
+   if (!(ndev-flags  IFF_MASTER) || !(ndev-priv_flags  IFF_BONDING))
+   return NOTIFY_DONE;
+
+   mutex_lock(lock);
+   list_for_each_entry(cma_dev, dev_list, list)


It seems like we just need to find the cma_dev that has the current mapping
correct. So I can take the lock, find the device in the list, increment 
its refcount and release the lock.  For that end I would have to save in 
the cma device structure the names of the network devices which are 
associated with it i can't use a comarison of the pdev etc pointers 
to the dma device since some network devices (eg bonding / vlan 
interfaces in ethernet) have NULL pdev  (they are virtual devices)


Later I can scan this device ID list, but I must do it under the lock 
inorder not to race with the device removal code which removed IDs from 
this list in cma_process_remove(), correct?


Or.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [RFC PATCH 3/5] rdma/cma: simplify locking needed for serialization of callbacks

2008-05-27 Thread Or Gerlitz

Sean Hefty wrote:

@@ -359,7 +358,7 @@ static int cma_disable_remove(struct rdm

spin_lock_irqsave(id_priv-lock, flags);
if (id_priv-state == state) {
-   atomic_inc(id_priv-dev_remove);
+   mutex_lock(id_priv-handler_mutex);

This just tried to acquire a mutex while holding a spinlock.


I see. So can taking this spin lock be avoided here? I understand that 
spin lock came to protect the state check, correct?


Or.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Talpey, Thomas
At 11:33 AM 5/27/2008, Tom Tucker wrote:
So I think from an NFSRDMA coding perspective it's a wash...

Just to be clear, you're talking about the NFS/RDMA server. However, it's
pretty much a wash on the client, for different reasons.

When posting the WR, We check the fastreg capabilities bit + transport 
type bit:
If fastreg is true --
   Post FastReg
   If iWARP (or with a cap bit read-with-inv-flag)
   post rdma read w/ invalidate

... For iWARP's case, this means rdma-read-w-inv,
plus rdma-send-w-inv, etc... 


Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests
don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't:



   ---+---+---+--+---+---+--
   RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
   Message| Type  | Flag  | and  | Number| STag  | Length
   OpCode |   |   | TO   |   |   | Communicated
  |   |   |  |   |   | between DDP
  |   |   |  |   |   | and RDMAP
   ---+---+---+--+---+---+--
   b  | RDMA Write| 1 | Valid| N/A   | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   0001b  | RDMA Read | 0 | N/A  | 1 | N/A   | Yes
  | Request   |   |  |   |   |
   ---+---+---+--+---+---+--
   0010b  | RDMA Read | 1 | Valid| N/A   | N/A   | Yes
  | Response  |   |  |   |   |
   ---+---+---+--+---+---+--
   0011b  | Send  | 0 | N/A  | 0 | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   0100b  | Send with | 0 | N/A  | 0 | Valid | Yes
  | Invalidate|   |  |   |   |
   ---+---+---+--+---+---+--
   0101b  | Send with | 0 | N/A  | 0 | N/A   | Yes
  | SE|   |  |   |   |
   ---+---+---+--+---+---+--
   0110b  | Send with | 0 | N/A  | 0 | Valid | Yes
  | SE and|   |  |   |   |
  | Invalidate|   |  |   |   |
   ---+---+---+--+---+---+--
   0111b  | Terminate | 0 | N/A  | 2 | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   1000b  |   |
   to | Reserved  |   Not Specified
   b  |   |
   ---+---+-



I want to take this opportunity to also mention that the RPC/RDMA client-server
exchange does not support remote-invalidate currently. Because of the multiple
stags supported by the rpcrdma chunking header, and because the client needs
to verify that the stags were in fact invalidated, there is significant 
overhead,
and the jury is out on that benefit. In fact, I suspect it's a lose at the 
client.

Tom (Talpey).  

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] OpenSM?

2008-05-27 Thread Ira Weiny
Charles,

Here at LLNL we have been running OpenSM for some time.  Thus far we are very
happy with it's performance.  Our largest cluster is 1152 nodes and OpenSM can
bring it up (not counting boot time) in less than a minute.

Here are some details.

We are running v3.1.10 of OpenSM with some minor modifications (mostly patches
which have been submitted upstream and been accepted by Sasha but are not yet
in a release.)

Our clusters are all Fat-tree topologies.

We have a node which is more or less dedicated to running OpenSM.  We have some
other monitoring software running on it, but OpenSM can utilize the CPU/Memory
if it needs to.

   A) On our large clusters this node is a 4 socket, dual core (8 cores
   total) Opteron running at 2.4Gig with 16Gig of memory.  I don't believe
   OpenSM needs this much but the nodes were built all the same so this is
   what it got.

   B) On one of our smaller clusters (128 nodes) OpenSM is running on a
   dual socket, single core (2 core) 2.4Gig Opteron nodes with 2Gig of
   memory.  We have not seen any issues with this cluster and OpenSM.

We run with the up/down algorithm, ftree has not panned out for us yet.  I
can't say how that would compare to the Cisco algorithms.

In short OpenSM should work just fine on your cluster.

Hope this helps,
Ira


On Tue, 27 May 2008 11:15:14 -0400
Charles Taylor [EMAIL PROTECTED] wrote:

 
 We have a 400 node IB cluster.We are running an embedded SM in  
 failover mode on our TS270/Cisco7008 core switches.Lately we have  
 been seeing problems with LID assignment when rebooting nodes (see log  
 messages below).   It is also taking far too long for LIDS to be  
 assigned as it takes  on the order of minutes for the ports to  
 transition to ACTIVE.
 
 This seems like a bug to us and we are considering switching to  
 OpenSM  on a host.   I'm wondering about experience with running  
 OpenSM for medium to large clusters (Fat Tree) and what resources  
 (memory/cpu) we should plan on for the host node.
 
 Thanks,
 
 Charlie Taylor
 UF HPC Center
 
 May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
 **  NEW SWEEP 
 May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
 Rediscover
 the subnet
 May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
 OUT_OF_SERVICE
 trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
 May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An  
 existing IB
 node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed
 May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
 DELETE_MC_GROUP
 trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
 May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
 Topology
 changed
 May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
 caused by
 discovering removed ports
 May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
 async events
 require sweep
 May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
 **  NEW SWEEP 
 May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
 Rediscover
 the subnet
 May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no
 routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194
 May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
 Topology
 changed
 May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
 caused by
 discovering new ports
 May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
 caused by
 multicast membership change
 May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]:  
 Force port to
 go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
 May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]:  
 Program port
 state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
 node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
 May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]:  
 Failed to
 negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad  
 status 0x1c
 May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
 IN_SERVICE trap
 for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
 May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new  
 IB node
 00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0
 May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
 async events
 require sweep
 May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
 **  NEW SWEEP 
 May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
 Rediscover
 the subnet
 May 27 14:18:46 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
 topology
 change
 May 27 14:18:46 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
 caused by
 previous GET/SET operation 

[ofa-general] Re: [PATCH] saquery: --smkey command line option

2008-05-27 Thread Sasha Khapyorsky
On 04:33 Tue 27 May , Hal Rosenstock wrote:
  
  Maybe yes, but could you be more specific? Store SMKey in read-only
  file on a client side?
 
 Treat smkey as su treats password rather than a command line parameter
 is another alternative.

Ok, let's do it as '--smkey X' and then saquery will ask for a value,
just like su does. Good?

  I'm not proposing to expose SM_Key, just added such option where this
  key could be specified.
 
 How is that not exposing it ?

Because (1) and (2) below.

Sasha

 
 -- Hal
 
   So: 1) this is *optional*, 2) there is no
  suggestions about how the right value should be determined.
  
  Sasha
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH] saquery: --smkey command line option

2008-05-27 Thread Sasha Khapyorsky
On 04:29 Tue 27 May , Hal Rosenstock wrote:
 On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote:
   Following your logic we will need to disable root passwords
   typing too.
  
  That's taking it too far. Root passwords are at least hidden when
  typing.
 
 At least hide the key typing from plain sight when typing like su does.

There are lot of tools where password can be specified as clear text in
command line (wget, smbclient, etc..) - it is an user responsibility to
keep his sensitive data safe.

Sasha
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] mthca MR attrs userspace change

2008-05-27 Thread Pete Wyckoff
My kernel started complaining to me recently:

ib_mthca :02:00.0: Process 'tgtd' did not pass in MR attrs.
ib_mthca :02:00.0:   Update libmthca to fix this.

It comes from commit cb9fbc5c37b69ac584e61d449cfd590f5ae1f90d
(IB: expand ib_umem_get() prototype) and the fix in
baaad380c0aa955f7d62e846467316c94067f1a5 (IB/mthca: Avoid changing
userspace ABI to handle DMA write barrier attribute).

Nice that everything still works with old userspace, but where is
the latest libmthca these days?  The one at kernel.org still has
ABI_VERSION 1:


http://git.kernel.org/?p=libs/infiniband/libmthca.git;a=blob;f=src/mthca-abi.h;h=2557274e4cbd9f36df2be42379644d31b4ff5da3;hb=HEAD

By the way, Roland, your efforts at Fedora packaging are certainly
appreciated here.  If the new libmthca just showed up in updates to
F-9, that would be most convenient.

-- Pete
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: mthca MR attrs userspace change

2008-05-27 Thread Roland Dreier
  ib_mthca :02:00.0: Process 'tgtd' did not pass in MR attrs.
  ib_mthca :02:00.0:   Update libmthca to fix this.

Heh, thanks for the reminder.  Will fix libmthca to handle this properly
today.

  By the way, Roland, your efforts at Fedora packaging are certainly
  appreciated here.  If the new libmthca just showed up in updates to
  F-9, that would be most convenient.

It will, although the Fedora process takes a while to grind through ;)

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] RE: [RFC PATCH 3/5] rdma/cma: simplify locking needed for serialization of callbacks

2008-05-27 Thread Sean Hefty
 spin_lock_irqsave(id_priv-lock, flags);
 if (id_priv-state == state) {
 -   atomic_inc(id_priv-dev_remove);
 +   mutex_lock(id_priv-handler_mutex);
 This just tried to acquire a mutex while holding a spinlock.

I see. So can taking this spin lock be avoided here? I understand that
spin lock came to protect the state check, correct?

I think we should just remove cma_disable_remove() and cma_enable_remove(), and
instead call mutex_lock/unlock directly in their places.  Where
cma_disable_remove() is called, add in appropriate state checks after acquiring
the mutex.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Tom Tucker

On Tue, 2008-05-27 at 10:33 -0500, Tom Tucker wrote:
 On Mon, 2008-05-26 at 16:02 -0700, Roland Dreier wrote:
   The invalidate local stag part of a read is just a local sink side
operation (ie no wire protocol change from a read).  It's not like
processing an ingress send-with-inv.  It is really functionally like a
read followed immediately by a fenced invalidate-local, but it doesn't
stall the pipe.  So the device has to remember the read is a with inv
local stag and invalidate the stag after the read response is placed
and before the WCE is reaped by the application.
  
  Yes, understood.  My point was just that in IB, at least in theory, one
  could just use an L_Key that doesn't have any remote permissions in the
  scatter list of an RDMA read, while in iWARP, the STag used to place an
  RDMA read response has to have remote write permission.  So RDMA read
  with invalidate makes sense for iWARP, because it gives a race-free way
  to allow an STag to be invalidated immediately after an RDMA read
  response is placed, while in IB it's simpler just to never give remote
  access at all.
  
 
 So I think from an NFSRDMA coding perspective it's a wash...
 
 When creating the local data sink, We need to check the transport type.
 
 If it's IB -- only local access,
 if it's iWARP -- local + remote access.
 
 When posting the WR, We check the fastreg capabilities bit + transport type 
 bit:
 If fastreg is true --
   Post FastReg
   If iWARP (or with a cap bit read-with-inv-flag)
   post rdma read w/ invalidate
   else /* IB */
   post rdma read

Steve pointed out a good optimization here. Instead of fencing the RDMA
READ here in advance of the INVALIDATE, we should post the INVALIDATE
when the READ WR completes. This will avoid stalling the SQ. Since IB
doesn't put the LKEY on the wire, there's no security issue to close. We
need to keep a bunch of fastreg MR around anyway for concurrent RPC.

Thoughts?
Tom

   post invalidate
   fi
 else
   ... today's logic
 fi
 
 I make the observation, however, that the transport type is now overloaded
 with a set of required verbs. For iWARP's case, this means rdma-read-w-inv,
 plus rdma-send-w-inv, etc... This also means that new transport types will
 inherit one or the other set of verbs (IB or iWARP).
 
 Tom
 
 
   - R.
  ___
  general mailing list
  general@lists.openfabrics.org
  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
  
  To unsubscribe, please visit 
  http://openib.org/mailman/listinfo/openib-general
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: mthca MR attrs userspace change

2008-05-27 Thread Roland Dreier
Here's the patch I plan to add to match the kernel, in case someone
wants to check it over:

diff --git a/src/mthca-abi.h b/src/mthca-abi.h
index 2557274..7e47d70 100644
--- a/src/mthca-abi.h
+++ b/src/mthca-abi.h
@@ -36,7 +36,8 @@
 
 #include infiniband/kern-abi.h
 
-#define MTHCA_UVERBS_ABI_VERSION   1
+#define MTHCA_UVERBS_MIN_ABI_VERSION   1
+#define MTHCA_UVERBS_MAX_ABI_VERSION   2
 
 struct mthca_alloc_ucontext_resp {
struct ibv_get_context_resp ibv_resp;
@@ -50,6 +51,17 @@ struct mthca_alloc_pd_resp {
__u32   reserved;
 };
 
+struct mthca_reg_mr {
+   struct ibv_reg_mr   ibv_cmd;
+/*
+ * Mark the memory region with a DMA attribute that causes
+ * in-flight DMA to be flushed when the region is written to:
+ */
+#define MTHCA_MR_DMASYNC   0x1
+   __u32   mr_attrs;
+   __u32   reserved;
+};
+
 struct mthca_create_cq {
struct ibv_create_cqibv_cmd;
__u32   lkey;
diff --git a/src/mthca.c b/src/mthca.c
index e00c4ee..dd95636 100644
--- a/src/mthca.c
+++ b/src/mthca.c
@@ -282,9 +282,11 @@ static struct ibv_device *mthca_driver_init(const char 
*uverbs_sys_path,
return NULL;
 
 found:
-   if (abi_version  MTHCA_UVERBS_ABI_VERSION) {
-   fprintf(stderr, PFX Fatal: ABI version %d of %s is too new 
(expected %d)\n,
-   abi_version, uverbs_sys_path, MTHCA_UVERBS_ABI_VERSION);
+   if (abi_version  MTHCA_UVERBS_MAX_ABI_VERSION ||
+   abi_version  MTHCA_UVERBS_MIN_ABI_VERSION) {
+   fprintf(stderr, PFX Fatal: ABI version %d of %s is not in 
supported range %d-%d\n,
+   abi_version, uverbs_sys_path, 
MTHCA_UVERBS_MIN_ABI_VERSION,
+   MTHCA_UVERBS_MAX_ABI_VERSION);
return NULL;
}
 
diff --git a/src/verbs.c b/src/verbs.c
index 6c9b53a..3d273d4 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -117,12 +117,21 @@ int mthca_free_pd(struct ibv_pd *pd)
 
 static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr,
 size_t length, uint64_t hca_va,
-enum ibv_access_flags access)
+enum ibv_access_flags access,
+int dma_sync)
 {
struct ibv_mr *mr;
-   struct ibv_reg_mr cmd;
+   struct mthca_reg_mr cmd;
int ret;
 
+   /*
+* Old kernels just ignore the extra data we pass in with the
+* reg_mr command structure, so there's no need to add an ABI
+* version check here.
+*/
+   cmd.mr_attrs = dma_sync ? MTHCA_MR_DMASYNC : 0;
+   cmd.reserved = 0;
+
mr = malloc(sizeof *mr);
if (!mr)
return NULL;
@@ -132,11 +141,11 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, 
void *addr,
struct ibv_reg_mr_resp resp;
 
ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-cmd, sizeof cmd, resp, sizeof resp);
+cmd.ibv_cmd, sizeof cmd, resp, sizeof 
resp);
}
 #else
ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-cmd, sizeof cmd);
+cmd.ibv_cmd, sizeof cmd);
 #endif
if (ret) {
free(mr);
@@ -149,7 +158,7 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, 
void *addr,
 struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr,
size_t length, enum ibv_access_flags access)
 {
-   return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access);
+   return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access, 0);
 }
 
 int mthca_dereg_mr(struct ibv_mr *mr)
@@ -202,7 +211,7 @@ struct ibv_cq *mthca_create_cq(struct ibv_context *context, 
int cqe,
 
cq-mr = __mthca_reg_mr(to_mctx(context)-pd, cq-buf.buf,
cqe * MTHCA_CQ_ENTRY_SIZE,
-   0, IBV_ACCESS_LOCAL_WRITE);
+   0, IBV_ACCESS_LOCAL_WRITE, 1);
if (!cq-mr)
goto err_buf;
 
@@ -297,7 +306,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq, int cqe)
 
mr = __mthca_reg_mr(to_mctx(ibcq-context)-pd, buf.buf,
cqe * MTHCA_CQ_ENTRY_SIZE,
-   0, IBV_ACCESS_LOCAL_WRITE);
+   0, IBV_ACCESS_LOCAL_WRITE, 1);
if (!mr) {
mthca_free_buf(buf);
ret = ENOMEM;
@@ -405,7 +414,7 @@ struct ibv_srq *mthca_create_srq(struct ibv_pd *pd,
if (mthca_alloc_srq_buf(pd, attr-attr, srq))
goto err;
 
-   srq-mr = __mthca_reg_mr(pd, srq-buf.buf, srq-buf_size, 0, 0);
+   srq-mr = __mthca_reg_mr(pd, srq-buf.buf, srq-buf_size, 0, 0, 0);
if (!srq-mr)
 

Re: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and Makefile.

2008-05-27 Thread Roland Dreier
  Makes sense. We will get rid of this CONFIG option. Apart from this
  are there any other changes you
  would like to see in the patch series ?

Have not reviewed the latest in detail but I think we are at least
pretty close to something ready to merge.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: mthca MR attrs userspace change

2008-05-27 Thread Roland Dreier
  Nice that everything still works with old userspace, but where is
  the latest libmthca these days?  The one at kernel.org still has
  ABI_VERSION 1:

Actually this tricked me... the kernel ABI didn't get bumped so there
was no reason to bump the libmthca ABI.  I'll actually use this slightly
simpler patch:

diff --git a/src/mthca-abi.h b/src/mthca-abi.h
index 2557274..4fbd98b 100644
--- a/src/mthca-abi.h
+++ b/src/mthca-abi.h
@@ -50,6 +50,17 @@ struct mthca_alloc_pd_resp {
__u32   reserved;
 };
 
+struct mthca_reg_mr {
+   struct ibv_reg_mr   ibv_cmd;
+/*
+ * Mark the memory region with a DMA attribute that causes
+ * in-flight DMA to be flushed when the region is written to:
+ */
+#define MTHCA_MR_DMASYNC   0x1
+   __u32   mr_attrs;
+   __u32   reserved;
+};
+
 struct mthca_create_cq {
struct ibv_create_cqibv_cmd;
__u32   lkey;
diff --git a/src/verbs.c b/src/verbs.c
index 6c9b53a..def0f30 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -117,12 +117,22 @@ int mthca_free_pd(struct ibv_pd *pd)
 
 static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr,
 size_t length, uint64_t hca_va,
-enum ibv_access_flags access)
+enum ibv_access_flags access,
+int dma_sync)
 {
struct ibv_mr *mr;
-   struct ibv_reg_mr cmd;
+   struct mthca_reg_mr cmd;
int ret;
 
+   /*
+* Old kernels just ignore the extra data we pass in with the
+* reg_mr command structure, so there's no need to add an ABI
+* version check here (and indeed the kernel ABI was not
+* incremented due to this change).
+*/
+   cmd.mr_attrs = dma_sync ? MTHCA_MR_DMASYNC : 0;
+   cmd.reserved = 0;
+
mr = malloc(sizeof *mr);
if (!mr)
return NULL;
@@ -132,11 +142,11 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, 
void *addr,
struct ibv_reg_mr_resp resp;
 
ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-cmd, sizeof cmd, resp, sizeof resp);
+cmd.ibv_cmd, sizeof cmd, resp, sizeof 
resp);
}
 #else
ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-cmd, sizeof cmd);
+cmd.ibv_cmd, sizeof cmd);
 #endif
if (ret) {
free(mr);
@@ -149,7 +159,7 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, 
void *addr,
 struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr,
size_t length, enum ibv_access_flags access)
 {
-   return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access);
+   return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access, 0);
 }
 
 int mthca_dereg_mr(struct ibv_mr *mr)
@@ -202,7 +212,7 @@ struct ibv_cq *mthca_create_cq(struct ibv_context *context, 
int cqe,
 
cq-mr = __mthca_reg_mr(to_mctx(context)-pd, cq-buf.buf,
cqe * MTHCA_CQ_ENTRY_SIZE,
-   0, IBV_ACCESS_LOCAL_WRITE);
+   0, IBV_ACCESS_LOCAL_WRITE, 1);
if (!cq-mr)
goto err_buf;
 
@@ -297,7 +307,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq, int cqe)
 
mr = __mthca_reg_mr(to_mctx(ibcq-context)-pd, buf.buf,
cqe * MTHCA_CQ_ENTRY_SIZE,
-   0, IBV_ACCESS_LOCAL_WRITE);
+   0, IBV_ACCESS_LOCAL_WRITE, 1);
if (!mr) {
mthca_free_buf(buf);
ret = ENOMEM;
@@ -405,7 +415,7 @@ struct ibv_srq *mthca_create_srq(struct ibv_pd *pd,
if (mthca_alloc_srq_buf(pd, attr-attr, srq))
goto err;
 
-   srq-mr = __mthca_reg_mr(pd, srq-buf.buf, srq-buf_size, 0, 0);
+   srq-mr = __mthca_reg_mr(pd, srq-buf.buf, srq-buf_size, 0, 0, 0);
if (!srq-mr)
goto err_free;
 
@@ -525,7 +535,7 @@ struct ibv_qp *mthca_create_qp(struct ibv_pd *pd, struct 
ibv_qp_init_attr *attr)
pthread_spin_init(qp-rq.lock, PTHREAD_PROCESS_PRIVATE))
goto err_free;
 
-   qp-mr = __mthca_reg_mr(pd, qp-buf.buf, qp-buf_size, 0, 0);
+   qp-mr = __mthca_reg_mr(pd, qp-buf.buf, qp-buf_size, 0, 0, 0);
if (!qp-mr)
goto err_free;
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH RFC v4 0/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Steve Wise
The following patch series proposes:

- The API and core changes needed to implement the IB BMMR and 
  iWARP equivalient memory extensions.

- cxgb3 support.

Changes since version 3:
- better comments to ib_alloc_fast_reg_page_list() function
  to explicitly state the page list is owned by the device until
  the fast_reg WR completes _and_ that the page_list can be
  modified by the device.
- cxgb3 - when allocating a page list, set max_page_list_len.
- page_size - page_shift in fast_reg union of ib_send_wr struct.
- key support via ib_update_fast_reg_key()

Changes since version 2:
- added device attribute max_fast_reg_page_list_len
- added cxgb3 patch

Changes since version 1:
- ib_alloc_mr() - ib_alloc_fast_reg_mr()
- pbl_depth - max_page_list_len
- page_list_len - max_page_list_len where it makes sense
- int - unsigned int where needed
- fbo - first_byte_offset
- added page size and page_list_len to fast_reg union in ib_send_wr
- rearranged work request fast_reg union of ib_send_wr to pack it
- dropped remove_access parameter from ib_alloc_fast_reg_mr()
- IB_DEVICE_MM_EXTENSIONS - IB_DEVICE_MEM_MGT_EXTENSIONS
- compiled

Steve.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Steve Wise

Support for the IB BMME and iWARP equivalent memory extensions to 
non shared memory regions.  This includes:

- allocation of an ib_mr for use in fast register work requests

- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).

- fast register memory region work request

- invalidate local memory region work request

- read with invalidate local memory region work request (iWARP only)


Design details:

- New device capability flag added: IB_DEVICE_MEM_MGT_EXTENSIONS indicates
device support for this feature.

- New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request.

- New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr.

- New API function, ib_alloc_mr() used to allocate fast_reg memory
regions.

- New API function, ib_alloc_fast_reg_page_list to allocate
device-specific page lists.

- New API function, ib_free_fast_reg_page_list to free said page lists.

- New API function, ib_update_fast_reg_key to allow the key portion of
the R_Key and L_Key of a fast_reg MR to be updated.  Applications call
this if desired before posting the IB_WR_FAST_REG_MR.


Usage Model:

- MR allocated with ib_alloc_mr()

- Page lists allocated via ib_alloc_fast_reg_page_list().

- MR R_Key/L_Key key field updated with ib_update_fast_reg_key().

- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)

- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

- MR deallocated with ib_dereg_mr()

- page lists dealloced via ib_free_fast_reg_page_list().

Applications can allocate a fast_reg mr once, and then can repeatedly
bind the mr to different physical memory SGLs via posting work
requests to the For each outstanding mr-to-pbl binding in the SQ pipe,
a fast_reg_page_list needs to be allocated.  Thus pipelining can be
achieved while still allowing device-specific page_list processing.

The 4B fast_reg rkey or stag is composed of a 3B index, and a 1B key.
The application can change the key each time it fast-registers thus
allowing more control over the peer's use of the rkey (ie it can
effectively be changed each time the rkey is rebound to a page list).

Signed-off-by: Steve Wise [EMAIL PROTECTED]
---

 drivers/infiniband/core/verbs.c |   46 
 include/rdma/ib_verbs.h |   76 +++
 2 files changed, 122 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 0504208..0a334b4 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr)
 }
 EXPORT_SYMBOL(ib_dereg_mr);
 
+struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len)
+{
+   struct ib_mr *mr;
+
+   if (!pd-device-alloc_fast_reg_mr)
+   return ERR_PTR(-ENOSYS);
+
+   mr = pd-device-alloc_fast_reg_mr(pd, max_page_list_len);
+
+   if (!IS_ERR(mr)) {
+   mr-device  = pd-device;
+   mr-pd  = pd;
+   mr-uobject = NULL;
+   atomic_inc(pd-usecnt);
+   atomic_set(mr-usecnt, 0);
+   }
+
+   return mr;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_mr);
+
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+   struct ib_device *device, int max_page_list_len)
+{
+   struct ib_fast_reg_page_list *page_list;
+
+   if (!device-alloc_fast_reg_page_list)
+   return ERR_PTR(-ENOSYS);
+
+   page_list = device-alloc_fast_reg_page_list(device, max_page_list_len);
+
+   if (!IS_ERR(page_list)) {
+   page_list-device = device;
+   page_list-max_page_list_len = max_page_list_len;
+   }
+
+   return page_list;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_page_list);
+
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
+{
+   page_list-device-free_fast_reg_page_list(page_list);
+}
+EXPORT_SYMBOL(ib_free_fast_reg_page_list);
+
 /* Memory windows */
 
 struct ib_mw *ib_alloc_mw(struct ib_pd *pd)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..ede0c80 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -106,6 +106,7 @@ enum ib_device_cap_flags {
IB_DEVICE_UD_IP_CSUM= (118),
IB_DEVICE_UD_TSO= (119),
IB_DEVICE_SEND_W_INV= (121),
+   IB_DEVICE_MEM_MGT_EXTENSIONS= (122),
 };
 
 enum ib_atomic_cap {
@@ -151,6 +152,7 @@ struct ib_device_attr {
int max_srq;
int max_srq_wr;
int max_srq_sge;
+   unsigned intmax_fast_reg_page_list_len;
u16 max_pkeys;
u8  local_ca_ack_delay;
 };
@@ -414,6 +416,8 @@ enum ib_wc_opcode {

[ofa-general] [PATCH RFC v4 2/2] RDMA/cxgb3: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Steve Wise

- set IB_DEVICE_MEM_MGT_EXTENSIONS capability bit.
- set max_fast_reg_page_list_len device attribute.
- add iwch_alloc_fast_reg_mr function.
- add iwch_alloc_fastreg_pbl 
- add iwch_free_fastreg_pbl
- adjust the WQ depth for kernel mode work queues to account for
  fastreg possibly taking 2 WR slots.
- add fastreg_mr work request support.
- add invalidate_mr work request support.
- add send_with_inv and send_with_se_inv work request support.

Signed-off-by: Steve Wise [EMAIL PROTECTED]
---

 drivers/infiniband/hw/cxgb3/cxio_hal.c  |   13 ++-
 drivers/infiniband/hw/cxgb3/cxio_hal.h  |1 
 drivers/infiniband/hw/cxgb3/cxio_wr.h   |   51 ++-
 drivers/infiniband/hw/cxgb3/iwch_provider.c |   78 -
 drivers/infiniband/hw/cxgb3/iwch_qp.c   |  123 +++
 5 files changed, 216 insertions(+), 50 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c 
b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index 3f441fc..6315c77 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -145,7 +145,7 @@ static int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, 
u32 qpid)
}
wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
memset(wqe, 0, sizeof(*wqe));
-   build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 0, qpid, 7);
+   build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 0, qpid, 7, 3);
wqe-flags = cpu_to_be32(MODQP_WRITE_EC);
sge_cmd = qpid  8 | 3;
wqe-sge_cmd = cpu_to_be64(sge_cmd);
@@ -558,7 +558,7 @@ static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
memset(wqe, 0, sizeof(*wqe));
build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 0,
-  T3_CTL_QP_TID, 7);
+  T3_CTL_QP_TID, 7, 3);
wqe-flags = cpu_to_be32(MODQP_WRITE_EC);
sge_cmd = (3ULL  56) | FW_RI_SGEEC_START  8 | 3;
wqe-sge_cmd = cpu_to_be64(sge_cmd);
@@ -674,7 +674,7 @@ static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev 
*rdev_p, u32 addr,
build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag,
   Q_GENBIT(rdev_p-ctrl_qp.wptr,
T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID,
-  wr_len);
+  wr_len, 3);
if (flag == T3_COMPLETION_FLAG)
ring_doorbell(rdev_p-ctrl_qp.doorbell, T3_CTRL_QP_ID);
len -= 96;
@@ -816,6 +816,13 @@ int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 
stag)
 0, 0);
 }
 
+int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid)
+{
+   *stag = T3_STAG_UNSET;
+   return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, 
+0, 0, 0ULL, 0, 0, 0, 0);
+}
+
 int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
 {
struct t3_rdma_init_wr *wqe;
diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h 
b/drivers/infiniband/hw/cxgb3/cxio_hal.h
index 6e128f6..e7659f6 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h
@@ -165,6 +165,7 @@ int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * 
stag, u32 pdid,
 int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size,
   u32 pbl_addr);
 int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
+int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
 int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag);
 int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr);
 void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h 
b/drivers/infiniband/hw/cxgb3/cxio_wr.h
index f1a25a8..2a24962 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_wr.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h
@@ -72,7 +72,8 @@ enum t3_wr_opcode {
T3_WR_BIND = FW_WROPCODE_RI_BIND_MW,
T3_WR_RCV = FW_WROPCODE_RI_RECEIVE,
T3_WR_INIT = FW_WROPCODE_RI_RDMA_INIT,
-   T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP
+   T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP,
+   T3_WR_FASTREG = FW_WROPCODE_RI_FASTREGISTER_MR
 } __attribute__ ((packed));
 
 enum t3_rdma_opcode {
@@ -89,7 +90,8 @@ enum t3_rdma_opcode {
T3_FAST_REGISTER,
T3_LOCAL_INV,
T3_QP_MOD,
-   T3_BYPASS
+   T3_BYPASS,  
+   T3_RDMA_READ_REQ_WITH_INV,
 } __attribute__ ((packed));
 
 static inline enum t3_rdma_opcode wr2opcode(enum t3_wr_opcode wrop)
@@ -170,11 +172,46 @@ struct t3_send_wr {
struct t3_sge sgl[T3_MAX_SGE];  /* 4+ */
 };
 
+#define T3_MAX_FASTREG_DEPTH 18
+
+struct t3_fastreg_wr {
+   struct fw_riwrh wrh;/* 0 */
+   union t3_wrid wrid; /* 1 */
+   __be32 page_type_perms; 

Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Tom Tucker

On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote:
 At 11:33 AM 5/27/2008, Tom Tucker wrote:
 So I think from an NFSRDMA coding perspective it's a wash...
 
 Just to be clear, you're talking about the NFS/RDMA server. However, it's
 pretty much a wash on the client, for different reasons.
 
Tom:

What client side memory registration strategy do you recommend if the
default on the server side is fastreg?

On the performance side we are limited by the min size of the
read/write-chunk element. If the client still gives the server a 4k
chunk, the performance benefit (fewer PDU on the wire) goes away.

Tom

 When posting the WR, We check the fastreg capabilities bit + transport 
 type bit:
 If fastreg is true --
Post FastReg
If iWARP (or with a cap bit read-with-inv-flag)
post rdma read w/ invalidate
 
 ... For iWARP's case, this means rdma-read-w-inv,
 plus rdma-send-w-inv, etc... 
 
 
 Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests
 don't support remote invalidate. At least, the table in RFC5040 (p.22) 
 doesn't:
 
 
 
---+---+---+--+---+---+--
RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
Message| Type  | Flag  | and  | Number| STag  | Length
OpCode |   |   | TO   |   |   | Communicated
   |   |   |  |   |   | between DDP
   |   |   |  |   |   | and RDMAP
---+---+---+--+---+---+--
b  | RDMA Write| 1 | Valid| N/A   | N/A   | Yes
   |   |   |  |   |   |
---+---+---+--+---+---+--
0001b  | RDMA Read | 0 | N/A  | 1 | N/A   | Yes
   | Request   |   |  |   |   |
---+---+---+--+---+---+--
0010b  | RDMA Read | 1 | Valid| N/A   | N/A   | Yes
   | Response  |   |  |   |   |
---+---+---+--+---+---+--
0011b  | Send  | 0 | N/A  | 0 | N/A   | Yes
   |   |   |  |   |   |
---+---+---+--+---+---+--
0100b  | Send with | 0 | N/A  | 0 | Valid | Yes
   | Invalidate|   |  |   |   |
---+---+---+--+---+---+--
0101b  | Send with | 0 | N/A  | 0 | N/A   | Yes
   | SE|   |  |   |   |
---+---+---+--+---+---+--
0110b  | Send with | 0 | N/A  | 0 | Valid | Yes
   | SE and|   |  |   |   |
   | Invalidate|   |  |   |   |
---+---+---+--+---+---+--
0111b  | Terminate | 0 | N/A  | 2 | N/A   | Yes
   |   |   |  |   |   |
---+---+---+--+---+---+--
1000b  |   |
to | Reserved  |   Not Specified
b  |   |
---+---+-
 
 
 
 I want to take this opportunity to also mention that the RPC/RDMA 
 client-server
 exchange does not support remote-invalidate currently. Because of the multiple
 stags supported by the rpcrdma chunking header, and because the client needs
 to verify that the stags were in fact invalidated, there is significant 
 overhead,
 and the jury is out on that benefit. In fact, I suspect it's a lose at the 
 client.
 
 Tom (Talpey).  
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Steve Wise

Tom Tucker wrote:

On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote:
  

At 11:33 AM 5/27/2008, Tom Tucker wrote:


So I think from an NFSRDMA coding perspective it's a wash...
  

Just to be clear, you're talking about the NFS/RDMA server. However, it's
pretty much a wash on the client, for different reasons.



Tom:

What client side memory registration strategy do you recommend if the
default on the server side is fastreg?

On the performance side we are limited by the min size of the
read/write-chunk element. If the client still gives the server a 4k
chunk, the performance benefit (fewer PDU on the wire) goes away.

Tom

  


I would hope that dma_mr usage will be replaced with fast_reg on both 
the client and the server. 

When posting the WR, We check the fastreg capabilities bit + transport 
type bit:

If fastreg is true --
  Post FastReg
  If iWARP (or with a cap bit read-with-inv-flag)
  post rdma read w/ invalidate
  
... For iWARP's case, this means rdma-read-w-inv,
plus rdma-send-w-inv, etc... 
  

Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests
don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't:



   ---+---+---+--+---+---+--
   RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
   Message| Type  | Flag  | and  | Number| STag  | Length
   OpCode |   |   | TO   |   |   | Communicated
  |   |   |  |   |   | between DDP
  |   |   |  |   |   | and RDMAP
   ---+---+---+--+---+---+--
   b  | RDMA Write| 1 | Valid| N/A   | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   0001b  | RDMA Read | 0 | N/A  | 1 | N/A   | Yes
  | Request   |   |  |   |   |
   ---+---+---+--+---+---+--
   0010b  | RDMA Read | 1 | Valid| N/A   | N/A   | Yes
  | Response  |   |  |   |   |
   ---+---+---+--+---+---+--
   0011b  | Send  | 0 | N/A  | 0 | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   0100b  | Send with | 0 | N/A  | 0 | Valid | Yes
  | Invalidate|   |  |   |   |
   ---+---+---+--+---+---+--
   0101b  | Send with | 0 | N/A  | 0 | N/A   | Yes
  | SE|   |  |   |   |
   ---+---+---+--+---+---+--
   0110b  | Send with | 0 | N/A  | 0 | Valid | Yes
  | SE and|   |  |   |   |
  | Invalidate|   |  |   |   |
   ---+---+---+--+---+---+--
   0111b  | Terminate | 0 | N/A  | 2 | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   1000b  |   |
   to | Reserved  |   Not Specified
   b  |   |
   ---+---+-



I want to take this opportunity to also mention that the RPC/RDMA client-server
exchange does not support remote-invalidate currently. Because of the multiple
stags supported by the rpcrdma chunking header, and because the client needs
to verify that the stags were in fact invalidated, there is significant 
overhead,
and the jury is out on that benefit. In fact, I suspect it's a lose at the 
client.

Tom (Talpey).  


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
  


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Talpey, Thomas
At 02:58 PM 5/27/2008, Tom Tucker wrote:

On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote:
 At 11:33 AM 5/27/2008, Tom Tucker wrote:
 So I think from an NFSRDMA coding perspective it's a wash...
 
 Just to be clear, you're talking about the NFS/RDMA server. However, it's
 pretty much a wash on the client, for different reasons.
 
Tom:

What client side memory registration strategy do you recommend if the
default on the server side is fastreg?

Whatever is fastest and safest. Given that the client and server won't
necessarily be using the same hardware, nor the same kernel for that
matter, I don't think we can or should legislate it.

That said, I am hopeful that fastreg does turn out to be fast and
therefore will become the only logical choice for the NFS/RDMA Linux
client. But the future Linux client is only one such system. I cannot
speak for others.

Tom.


On the performance side we are limited by the min size of the
read/write-chunk element. If the client still gives the server a 4k
chunk, the performance benefit (fewer PDU on the wire) goes away.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: mthca MR attrs userspace change

2008-05-27 Thread akepner
On Tue, May 27, 2008 at 11:33:55AM -0700, Roland Dreier wrote:
 ...
 Actually this tricked me... the kernel ABI didn't get bumped so there
 was no reason to bump the libmthca ABI.  I'll actually use this slightly
 simpler patch:

I was wondering about that... FWIW, the patch (which I deleted 
below) looked good to me.

-- 
Arthur

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS support

2008-05-27 Thread Talpey, Thomas
At 02:40 PM 5/27/2008, Steve Wise wrote:
Talpey, Thomas wrote:
 At 12:58 PM 5/27/2008, Felix Marti wrote:
   
 RDMA Read with Local Invalidate does not affect the wire. The 'must
 invalidate' state is kept in the RNIC that issues the RDMA Read
 Request...
 

 Aha, okay that was not clear to me. What information does the RNIC use
 to line up the arrival of the RDMA Read response with the must invalidate
 state? 

The rnic already tracks outstanding read requests.  It now also will 
track the local stag to invalidate when the read completes.

Ah - okay, so the stag that actually gets invalidated was provided with
the RDMA Read request posting, and is not necessarily the stag that
arrived in the peer's RDMA Read response. That helps.

What happens if the upper layer gives up and invalidates the stag itself,
and the peer's RDMA Read response arrives later? Nothing bad, I assume,
and the peer's response is denied?


 Also, how does the RNIC signal whether the invalidation actually
 occurred, so the upper layer can defend itself from attack?

   

The stag is guaranteed to be in the invalid state by the time the app 
reaps the read-inv-local work completion...

Ok, given my correct understanding of the source of the stag above.

Tom.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: mthca MR attrs userspace change

2008-05-27 Thread akepner
On Tue, May 27, 2008 at 03:18:25PM -0400, Pete Wyckoff wrote:
 
 ... Only CQ changes need dmasync, apparently.

Right.

-- 
Arthur
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS support

2008-05-27 Thread Steve Wise

Talpey, Thomas wrote:

At 02:40 PM 5/27/2008, Steve Wise wrote:
  

Talpey, Thomas wrote:


At 12:58 PM 5/27/2008, Felix Marti wrote:
  
  

RDMA Read with Local Invalidate does not affect the wire. The 'must
invalidate' state is kept in the RNIC that issues the RDMA Read
Request...



Aha, okay that was not clear to me. What information does the RNIC use
to line up the arrival of the RDMA Read response with the must invalidate
state? 
  
The rnic already tracks outstanding read requests.  It now also will 
track the local stag to invalidate when the read completes.



Ah - okay, so the stag that actually gets invalidated was provided with
the RDMA Read request posting, and is not necessarily the stag that
arrived in the peer's RDMA Read response. That helps.

What happens if the upper layer gives up and invalidates the stag itself,
and the peer's RDMA Read response arrives later? Nothing bad, I assume,
and the peer's response is denied?

  


It behaves just like any other tagged message arriving and the target 
stag is invalid.  The connection is torn down via an RDMAP TERMINATE...





Also, how does the RNIC signal whether the invalidation actually
occurred, so the upper layer can defend itself from attack?

  
  
The stag is guaranteed to be in the invalid state by the time the app 
reaps the read-inv-local work completion...



Ok, given my correct understanding of the source of the stag above.

Tom.
  


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode

2008-05-27 Thread Roland Dreier
thanks, applied for 2.6.27...

  +/*
  + * we rely on this condition when copying small skbs and we
  + * pass ownership of the first fragment only.
  + */
  +if (SKB_TSHOLD  IPOIB_CM_HEAD_SIZE) {
  +printk(%s: SKB_TSHOLD(%d) must not be larger then %d\n,
  +   THIS_MODULE-name, SKB_TSHOLD, IPOIB_CM_HEAD_SIZE);
  +return -EINVAL;
  +}

I changed this to a build bug, to avoid waiting until runtime to notice
this problem:

+   /*
+* When copying small received packets, we only copy from the
+* linear data part of the SKB, so we rely on this condition.
+*/
+   BUILD_BUG_ON(IPOIB_CM_COPYBREAK  IPOIB_CM_HEAD_SIZE);
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS support

2008-05-27 Thread Steve Wise




Talpey, Thomas wrote:

  At 03:59 PM 5/27/2008, Steve Wise wrote:
  
  
Talpey, Thomas wrote:


  What happens if the upper layer gives up and invalidates the stag itself,
and the peer's RDMA Read response arrives later? Nothing bad, I assume,
and the peer's response is denied?

  
  

It behaves just like any other tagged message arriving and the target 
stag is invalid.  The connection is torn down via an RDMAP TERMINATE...

  
  
I was wondering more about the dangling stag reference that the original work
request carried. Normally, it would reference the still-valid stag, but if that
stag was torn down (causing the invalidation to point to nothing), or worse,
re-bound (causing it to point at something else!), then it's a possible issue?

Sorry to seem paranoid here. Storage is pretty sensitive to silent data
corruption avenues. Because they always find a way to happen.

  


These WRs allow the user to shoot himself in the foot. You're describe
one such case...

Steve.



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [ANNOUNCE] libmthca 1.0.5 released

2008-05-27 Thread Roland Dreier
libmthca is a userspace driver for Mellanox InfiniBand HCAs.  It is a
plug-in module for libibverbs that allows programs to use Mellanox
hardware directly from userspace.

A new stable release, libmthca 1.0.5, is available from

http://www.openfabrics.org/downloads/mlx4/libmthca-1.0.5.tar.gz

with sha1sum

a68b1de47d320546c7bcc92bfa9c482f7d74fac1  
/data/home/roland/libmthca-1.0.5.tar.gz

I also tagged the 1.0.5 release of libmthca and pushed it out to my
git tree on kernel.org:

git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git

(the name of the tag is libmthca-1.0.5).

Builds for the Ubuntu 7.10 and 8.04 releases will be available by
adding the lines (replacing hardy by gutsy if needed)

deb http://ppa.launchpad.net/roland-digitalvampire/ubuntu hardy main
deb-src http://ppa.launchpad.net/roland-digitalvampire/ubuntu hardy main

to your /etc/sources.list file, and updated Debian and Fedora packages
will work their way into the main archives.

This release fixes several bugs and adds support for the new kernel
interface for mr attrs (which will be shipped in 2.6.26).

The complete list of changes since 1.0.4 is:

Eli Cohen (3):
  Ensure an receive WQEs are in memory before linking to chain
  Remove checks for srq-first_free  0
  IB/ib_mthca: Pre-link receive WQEs in Tavor mode

Jack Morgenstein (1):
  Clear context struct at allocation time

Michael S. Tsirkin (2):
  Fix posting 255 recv WRs for Tavor
  Set cleaned CQEs back to HW ownership when cleaning CQ

Roland Dreier (21):
  Fix paths in Debian install files for libibverbs 1.1
  Update Debian changelog
  debian/rules: Remove DEB_DH_STRIP_ARGS
  Fix handling of send CQE with error for QPs connected to SRQ
  Add missing wmb() in mthca_tavor_post_send()
  Remove deprecated ${Source-Version} from debian/control
  Clean up NVALGRIND comment in config.h.in
  Fix Valgrind annotations so they can actually be built
  Remove ibv_driver_init from linker version script
  Fix spec file License: tag
  Mark driver file in sysconfdir with %config
  Update Debian policy version to 3.7.3
  Fix Valgrind false positives in mthca_create_cq() and mthca_create_srq()
  Add debian/watch file
  Update Debian build to avoid setting RPATH
  Change openib.org URLs to openfabrics.org URLs
  Fix CQ cleanup when QP is destroyed
  Update libmthca to handle new kernel ABI
  Include spec file changes from Fedora CVS
  Remove %config tag from mthca.driver file
  Roll libmthca-1.0.5 release
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] OpenSM?

2008-05-27 Thread Yevgeny Kliteynik

Charles,

Ira Weiny wrote:

Charles,

Here at LLNL we have been running OpenSM for some time.  Thus far we are very
happy with it's performance.  Our largest cluster is 1152 nodes and OpenSM can
bring it up (not counting boot time) in less than a minute.


OpenSM is successfully running on some large clusters with 4-5K nodes.
It takes about 2-3 minutes to bring up such clusters.


Here are some details.

We are running v3.1.10 of OpenSM with some minor modifications (mostly patches
which have been submitted upstream and been accepted by Sasha but are not yet
in a release.)

Our clusters are all Fat-tree topologies.

We have a node which is more or less dedicated to running OpenSM.  We have some
other monitoring software running on it, but OpenSM can utilize the CPU/Memory
if it needs to.

   A) On our large clusters this node is a 4 socket, dual core (8 cores
   total) Opteron running at 2.4Gig with 16Gig of memory.  I don't believe
   OpenSM needs this much but the nodes were built all the same so this is
   what it got.

   B) On one of our smaller clusters (128 nodes) OpenSM is running on a
   dual socket, single core (2 core) 2.4Gig Opteron nodes with 2Gig of
   memory.  We have not seen any issues with this cluster and OpenSM.

We run with the up/down algorithm, ftree has not panned out for us yet.  I
can't say how that would compare to the Cisco algorithms.


If the cluster topology is fat-tree, then there is a ftree and up/down routing.
Ftree would be a good choice if you need LMC=0 (plus if the topology complies
with certain fat-tree rules). For any other tree, or for LMC0, up/down should
work.

-- Yevgeny


In short OpenSM should work just fine on your cluster.

Hope this helps,
Ira


On Tue, 27 May 2008 11:15:14 -0400
Charles Taylor [EMAIL PROTECTED] wrote:

We have a 400 node IB cluster.We are running an embedded SM in  
failover mode on our TS270/Cisco7008 core switches.Lately we have  
been seeing problems with LID assignment when rebooting nodes (see log  
messages below).   It is also taking far too long for LIDS to be  
assigned as it takes  on the order of minutes for the ports to  
transition to ACTIVE.


This seems like a bug to us and we are considering switching to  
OpenSM  on a host.   I'm wondering about experience with running  
OpenSM for medium to large clusters (Fat Tree) and what resources  
(memory/cpu) we should plan on for the host node.


Thanks,

Charlie Taylor
UF HPC Center

May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
**  NEW SWEEP 
May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
Rediscover

the subnet
May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
OUT_OF_SERVICE

trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An  
existing IB

node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed
May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
DELETE_MC_GROUP

trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
Topology

changed
May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by

discovering removed ports
May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
async events

require sweep
May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
**  NEW SWEEP 
May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
Rediscover

the subnet
May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no
routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194
May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
Topology

changed
May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by

discovering new ports
May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by

multicast membership change
May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]:  
Force port to

go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]:  
Program port

state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]:  
Failed to
negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad  
status 0x1c
May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
IN_SERVICE trap

for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new  
IB node

00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0
May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
async events

require sweep
May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:

[ofa-general] Re: mthca MR attrs userspace change

2008-05-27 Thread Roland Dreier
libmthca-1.0.5 is now in Fedora 9 proposed updates --
http://koji.fedoraproject.org/koji/buildinfo?buildID=50682

I'm not sure exactly how it makes it into real F-9.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] MLX HCA: CQ request notification for multiple completions not implemented?

2008-05-27 Thread Yicheng Jia
Thanks for your reply. I'm using one CQ for all the WRs. Do you know why 
there's no ARM-N support in MLX drivers? My concern is the performance. 
The overhead of software poll_cq loop is quite significant if there are 
multiple pieces of small amount of data to be transferred on both 
sender/receiver sides. For instance, on the sender, the data I have are 
1k, 1k, 2k, 1k..., on the receiver side, the data size and blocks are the 
same, 1k, 1k, 2k, 1k Do you have a good solution for such kind of 
problem?

Best,
Yicheng




Dotan Barak [EMAIL PROTECTED] 
05/23/2008 01:27 PM

To
Yicheng Jia [EMAIL PROTECTED]
cc
general@lists.openfabrics.org
Subject
Re: [ofa-general] MLX HCA: CQ request notification for multiple 
completions not implemented?






Hi.

Yicheng Jia wrote:

 Hi Folks,

 I'm trying to use CQ Event notification for multiple completions 
 (ARM_N) according to Mellanox Lx III user manual for scatter/gathering 
 RDMA. However I couldn't find it in current MLX driver. It seems to me 
 that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are 
 multiple work requests, I have to use poll_cq to synchronously wait 
 until all the requests are done, is it correct? Is there a way to do 
 asynchronous multiple send by subscribing for a ARM_N event?
You are right: the low level drivers of Mellanox devices doesn't support 
ARM-N
(This feature is supported by the devices, but it wasn't implemented in 
the low level drivers).

You are right, in order to read all of the completions you need to use 
poll_cq.

By the way: Do you have you have to create a completion for any WR?
(if you are using one QP, this will maybe solve your problem).

Dotan

_
Scanned by IBM Email Security Management Services powered by MessageLabs. 
For more information please visit http://www.ers.ibm.com
_



_
Scanned by IBM Email Security Management Services powered by MessageLabs. For 
more information please visit http://www.ers.ibm.com

general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Steve Wise

 enum ib_send_flags {
@@ -676,6 +683,19 @@ struct ib_send_wr {
u16 pkey_index; /* valid for GSI only */
u8  port_num;   /* valid for DR SMPs on switch only 
*/
} ud;
+   struct {
+   u64 iova_start;
+   struct ib_mr*mr;
+   struct ib_fast_reg_page_list*page_list;
+   unsigned intpage_shift;
+   unsigned intpage_list_len;
+   unsigned intfirst_byte_offset;
+   u32 length;
+   int access_flags;
+   } fast_reg;
+   struct {
+   struct ib_mr*mr;
+   } local_inv;
} wr;
 };


Ok, while writing a test case for all this jazz, I see that passing the 
struct ib_mr pointer to both IB_WR_FAST_REGISTER_MR and 
IB_WR_INVALIDATE_MR is perhaps bad.  Consider a chain of WRs: 
INVALIDATE_MR linked to a FAST_REGISTER_MR and passed to the provider 
via a single ib_post_send() call.  You can't do that if you want to bump 
the key value between the invalidate and the fast_reg with the new key, 
which is probably what apps want to do. You are forced, under this 
proposed API, to post the two WRs separately and call 
ib_update_fast_reg_key() in between the ib_post_send() calls.


Perhaps we should just pass in a u32 rkey for both WRs instead of the mr 
pointer?  Then the code could put the old rkey in the invalidate WR, and 
the newly updated rkey in the fast_reg WR and chain the two together and 
do a single post.


I think this is the way to go: change the fast_reg and local_inv unions 
to take a u32 rkey instead of a struct ib_mr *mr.


Thoughts?

Steve.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Roland Dreier
  Perhaps we should just pass in a u32 rkey for both WRs instead of the
  mr pointer?  Then the code could put the old rkey in the invalidate
  WR, and the newly updated rkey in the fast_reg WR and chain the two
  together and do a single post.

Makes sense to me.  The only thing I would worry about would be if some
device needs the actual mr struct pointer to post the work request, but
mlx4 and I guess cxgb3 don't at least and I don't see a good reason why
another device would.  Let's go for it.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] ~After Effects CS 3 Pro~

2008-05-27 Thread Wendi Bishop
~ Adobe CS3 Master Collection for PC or MAC includes:
~ InDesign CS3
~ Photoshop CS3
~ Illustrator CS3
~ Acrobat 8 Professional
~ Flash CS3 Professional
~ Dreamweaver CS3
~ Fireworks CS3
~ Contribute CS3
~ After Effects CS3 Professional
~ Premiere Pro CS3
~ Encore DVD CS3
~ Soundbooth CS3

~ oemnewdeal . com in your Internet Exp1orer

~ System Requirements

~ For PC:
~ Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel 
Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core
~ Duo (or compatible) processor; SSE2-enabled processor required for AMD systems
~ Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home 
Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions)
~ 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when 
running multiple components
~ 38GB of available hard-disk space (additional free space required during 
installation)
~ Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array 
storage (RAID 0) for HD; SCSI disk subsystem preferred
~ Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound 
card recommended)
~ 1,280x1,024 monitor resolution with 32-bit color adapter
~ DVD-ROM drive

~ For MAC:
~ PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, 
and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a 
Windows application and may be used with Boot Camp)
~ Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version 
Cue CS3 Server
~ 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when 
running multiple components
~ 36GB of available hard-disk space (additional free space required during 
installation)
~ Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array 
storage (RAID 0) for HD; SCSI disk subsystem preferred
~ Core Audio compatible sound card
~ 1,280x1,024 monitor resolution with 32-bit color adapter
~ DVD-ROM drive~ DVD+-R burner required for DVD creation

U.N. agencies in Myanmar say international aid workers are finally moving into 
the Irrawaddy Delta. More visas for aid workers are being processed, and half a 
million people have now received food rations from the World Food Program.

Dresden, Germany, has regained some of its prewar splendor, thanks to the 
restoration of its Baroque architectural masterpieces. The Elbe River Valley 
that runs through the city was declared a U.N. World Heritage site. But now, a 
new steel bridge being built across the river is threatening that status.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH v2 08/13] QLogic VNIC: sysfs interface implementation for the driver

2008-05-27 Thread Roland Dreier
  +ssize_t vnic_create_primary(struct device *dev,
  +struct device_attribute *dev_attr, const char *buf,
  +size_t count)

  +ssize_t vnic_create_secondary(struct device *dev,
  +  struct device_attribute *dev_attr,
  +  const char *buf, size_t count)

  +ssize_t vnic_delete(struct device *dev, struct device_attribute *dev_attr,
  +const char *buf, size_t count)

These are all only referenced from a sysfs attribute defined in the same
file, so they can be made static (and don't need extern declarations in
a header file).
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH v2 03/13] QLogic VNIC: Implementation of communication protocol with EVIC/VEx

2008-05-27 Thread Roland Dreier
  +void viport_disconnect(struct viport *viport)
  +{
  +VIPORT_FUNCTION(viport_disconnect()\n);
  +viport-disconnect = 1;
  +viport_failure(viport);
  +wait_event(viport-disconnect_queue, viport-disconnect == 0);
  +}
  +
  +void viport_free(struct viport *viport)
  +{
  +VIPORT_FUNCTION(viport_free()\n);
  +viport_disconnect(viport);  /* NOTE: this can sleep */

There are no other calls to viport_disconnect() that I can see, so it
can be made static (and the declaration in vnic_viport.h can be dropped).
in fact given how small the function is and the fact that it has only a
single call site, it might be easier just to merge it into
viport_free().  But that's a matter of taste.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general