RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.

2007-09-27 Thread Kanevsky, Arkady
Sean,
What is the model on how client connects, say for iSCSI,
when client and server both support, iWARP and 10GbE or 1GbE,
and would like to setup most performant connection for ULP?
Thanks,

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

 -Original Message-
 From: Sean Hefty [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, September 27, 2007 2:39 PM
 To: Steve Wise
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 general@lists.openfabrics.org; [EMAIL PROTECTED]
 Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Support 
 iwarp-only interfacesto avoid 4-tuple conflicts.
 
  The sysadmin creates for iwarp use only alias interfaces 
 of the form 
  devname:iw* where devname is the native interface name 
 (eg eth0) for 
  the iwarp netdev device.  The alias label can be anything 
 starting with iw.
  The iw immediately after the ':' is the key used by the 
 iw_cxgb3 driver.
 
 I'm still not sure about this, but haven't come up with 
 anything better myself.  And if there's a good chance of 
 other rnic's needing the same support, I'd rather see the 
 common code separated out, even if just encapsulated within 
 this module for easy re-use.
 
 As for the code, I have a couple of questions about whether 
 deadlock and a race condition are possible, plus a few minor comments.
 
  +static void insert_ifa(struct iwch_dev *rnicp, struct 
 in_ifaddr *ifa) 
  +{
  +   struct iwch_addrlist *addr;
  +
  +   addr = kmalloc(sizeof *addr, GFP_KERNEL);
  +   if (!addr) {
  +   printk(KERN_ERR MOD %s - failed to alloc memory!\n,
  +  __FUNCTION__);
  +   return;
  +   }
  +   addr-ifa = ifa;
  +   mutex_lock(rnicp-mutex);
  +   list_add_tail(addr-entry, rnicp-addrlist);
  +   mutex_unlock(rnicp-mutex);
  +}
 
 Should this return success/failure?
 
  +static int nb_callback(struct notifier_block *self, 
 unsigned long event,
  +  void *ctx)
  +{
  +   struct in_ifaddr *ifa = ctx;
  +   struct iwch_dev *rnicp = container_of(self, struct 
 iwch_dev, nb);
  +
  +   PDBG(%s rnicp %p event %lx\n, __FUNCTION__, rnicp, event);
  +
  +   switch (event) {
  +   case NETDEV_UP:
  +   if (netdev_is_ours(rnicp, ifa-ifa_dev-dev) 
  +   is_iwarp_label(ifa-ifa_label)) {
  +   PDBG(%s label %s addr 0x%x added\n,
  +   __FUNCTION__, ifa-ifa_label, 
 ifa-ifa_address);
  +   insert_ifa(rnicp, ifa);
  +   iwch_listeners_add_addr(rnicp, 
 ifa-ifa_address);
 
 If insert_ifa() fails, what will iwch_listeners_add_addr() 
 do?  (I'm not easily seeing the relationship between the 
 address list and the listen list at this point.)
 
  +   }
  +   break;
  +   case NETDEV_DOWN:
  +   if (netdev_is_ours(rnicp, ifa-ifa_dev-dev) 
  +   is_iwarp_label(ifa-ifa_label)) {
  +   PDBG(%s label %s addr 0x%x deleted\n,
  +   __FUNCTION__, ifa-ifa_label, 
 ifa-ifa_address);
  +   iwch_listeners_del_addr(rnicp, 
 ifa-ifa_address);
  +   remove_ifa(rnicp, ifa);
  +   }
  +   break;
  +   default:
  +   break;
  +   }
  +   return 0;
  +}
  +
  +static void delete_addrlist(struct iwch_dev *rnicp) {
  +   struct iwch_addrlist *addr, *tmp;
  +
  +   mutex_lock(rnicp-mutex);
  +   list_for_each_entry_safe(addr, tmp, rnicp-addrlist, entry) {
  +   list_del(addr-entry);
  +   kfree(addr);
  +   }
  +   mutex_unlock(rnicp-mutex);
  +}
  +
  +static void populate_addrlist(struct iwch_dev *rnicp) {
  +   int i;
  +   struct in_device *indev;
  +
  +   for (i = 0; i  rnicp-rdev.port_info.nports; i++) {
  +   indev = in_dev_get(rnicp-rdev.port_info.lldevs[i]);
  +   if (!indev)
  +   continue;
  +   for_ifa(indev)
  +   if (is_iwarp_label(ifa-ifa_label)) {
  +   PDBG(%s label %s addr 0x%x added\n,
  +__FUNCTION__, ifa-ifa_label,
  +ifa-ifa_address);
  +   insert_ifa(rnicp, ifa);
  +   }
  +   endfor_ifa(indev);
  +   }
  +}
  +
   static void rnic_init(struct iwch_dev *rnicp)  {
  PDBG(%s iwch_dev %p\n, __FUNCTION__,  rnicp); @@ 
 -70,6 +187,12 @@ 
  static void rnic_init(struct iwch_dev *r
  idr_init(rnicp-qpidr);
  idr_init(rnicp-mmidr);
  spin_lock_init(rnicp-lock);
  +   INIT_LIST_HEAD(rnicp-addrlist);
  +   INIT_LIST_HEAD(rnicp-listen_eps);
  +   mutex_init(rnicp-mutex);
  +   rnicp-nb.notifier_call = nb_callback;
  +   populate_addrlist(rnicp);
  +   register_inetaddr_notifier(rnicp-nb);
   
  rnicp-attr.vendor_id = 0x168;
  rnicp-attr.vendor_part_id = 7;
  @@ -148,6 +271,8 @@ static

RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.

2007-09-24 Thread Glenn Grundstrom
I'm sure I had seen a previous email in this thread that suggested using
a userspace library to open a socket
in the shared port space.  It seems that suggestion was dropped without
reason.  Does anyone know why?

Thanks,
Glenn.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Steve Wise
Sent: Sunday, September 23, 2007 3:37 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED];
general@lists.openfabrics.org
Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only
interfacesto avoid 4-tuple conflicts.


iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts.

Version 3:

- don't use list_del_init() where list_del() is sufficient.

Version 2:

- added a per-device mutex for the address and listening endpoints
lists.

- wait for all replies if sending multiple passive_open requests to
rnic.

- log warning if no addresses are available when a listen is issued.

- tested

---

Design:

The sysadmin creates for iwarp use only alias interfaces of the form
devname:iw* where devname is the native interface name (eg eth0) for
the
iwarp netdev device.  The alias label can be anything starting with
iw.
The iw immediately after the ':' is the key used by the iw_cxgb3
driver.

EG:
ifconfig eth0 192.168.70.123 up
ifconfig eth0:iw1 192.168.71.123 up
ifconfig eth0:iw2 192.168.72.123 up

In the above example, 192.168.70/24 is for TCP traffic, while
192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use.

The rdma-only interface must be on its own IP subnet. This allows
routing
all rdma traffic onto this interface.

The iWARP driver must translate all listens on address 0.0.0.0 to the
set of rdma-only ip addresses for the device in question.  This prevents
incoming connect requests to the TCP ipaddresses from going up the
rdma stack.

Implementation Details:

- The iw_cxgb3 driver registers for inetaddr events via
register_inetaddr_notifier().  This allows tracking the iwarp-only
addresses/subnets as they get added and deleted.  The iwarp driver
maintains a list of the current iwarp-only addresses.

- The iw_cxgb3 driver builds the list of iwarp-only addresses for its
devices at module insert time.  This is needed because the inetaddr
notifier callbacks don't replay address-add events when someone
registers.  So the driver must build the initial list at module load
time.

- When a listen is done on address 0.0.0.0, then the iw_cxgb3 driver
must translate that into a set of listens on the iwarp-only addresses.
This is implemented by maintaining a list of stid/addr entries per
listening endpoint.

- When a new iwarp-only address is added or removed, the iw_cxgb3 driver
must traverse the set of listening endpoints and update them
accordingly.
This allows an application to bind to 0.0.0.0 prior to the iwarp-only
interfaces being configured.  It also allows changing the iwarp-only set
of addresses and getting the expected behavior for apps already bound
to 0.0.0.0.  This is done by maintaining a list of listening endpoints
off the device struct.

- The address list, the listening endpoint list, and each list of
stid/addrs in use per listening endpoint are all protected via a mutex
per iw_cxgb3 device.

Signed-off-by: Steve Wise [EMAIL PROTECTED]
---

 drivers/infiniband/hw/cxgb3/iwch.c|  125 
 drivers/infiniband/hw/cxgb3/iwch.h|   11 +
 drivers/infiniband/hw/cxgb3/iwch_cm.c |  259
+++--
 drivers/infiniband/hw/cxgb3/iwch_cm.h |   15 ++
 4 files changed, 360 insertions(+), 50 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c
b/drivers/infiniband/hw/cxgb3/iwch.c
index 0315c9d..d81d46e 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.c
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -63,6 +63,123 @@ struct cxgb3_client t3c_client = {
 static LIST_HEAD(dev_list);
 static DEFINE_MUTEX(dev_mutex);
 
+static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa)
+{
+   struct iwch_addrlist *addr;
+
+   addr = kmalloc(sizeof *addr, GFP_KERNEL);
+   if (!addr) {
+   printk(KERN_ERR MOD %s - failed to alloc memory!\n,
+  __FUNCTION__);
+   return;
+   }
+   addr-ifa = ifa;
+   mutex_lock(rnicp-mutex);
+   list_add_tail(addr-entry, rnicp-addrlist);
+   mutex_unlock(rnicp-mutex);
+}
+
+static void remove_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa)
+{
+   struct iwch_addrlist *addr, *tmp;
+
+   mutex_lock(rnicp-mutex);
+   list_for_each_entry_safe(addr, tmp, rnicp-addrlist, entry) {
+   if (addr-ifa == ifa) {
+   list_del(addr-entry);
+   kfree(addr);
+   goto out;
+   }
+   }
+out:
+   mutex_unlock(rnicp-mutex);
+}
+
+static int netdev_is_ours(struct iwch_dev *rnicp, struct net_device
*netdev)
+{
+   int i;
+
+   for (i = 0; i  rnicp-rdev.port_info.nports; i++)
+   

Re: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.

2007-09-24 Thread Roland Dreier
  I'm sure I had seen a previous email in this thread that suggested using
  a userspace library to open a socket
  in the shared port space.  It seems that suggestion was dropped without
  reason.  Does anyone know why?

Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, iSER, etc).

Does the neteffect NIC have the same issue as cxgb3 here?  What are
your thoughts on how to handle this?

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.

2007-09-24 Thread Tom Tucker
On Mon, 2007-09-24 at 16:30 -0500, Glenn Grundstrom wrote:
  
  -Original Message-
  From: Roland Dreier [mailto:[EMAIL PROTECTED] 
  Sent: Monday, September 24, 2007 2:33 PM
  To: Glenn Grundstrom
  Cc: Steve Wise; [EMAIL PROTECTED]; general@lists.openfabrics.org
  Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Support 
  iwarp-only interfacesto avoid 4-tuple conflicts.
  
I'm sure I had seen a previous email in this thread that 
  suggested using
a userspace library to open a socket
in the shared port space.  It seems that suggestion was 
  dropped without
reason.  Does anyone know why?
  
  Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, 
  iSER, etc).
 
 The kernel apps could open a Linux tcp socket and create an RDMA
 socket connection.  Both calls are standard Linux kernel architected
 routines. 

This approach was NAK'd by David Miller and others...

  Doesn't NFSoRDMA already open a TCP socket and another for 
 RDMA traffic (ports 2049  2050 if I remember correctly)?  

The NFS RDMA transport driver does not open a socket for the RDMA
connection. It uses a different port in order to allow both TCP and RDMA
mounts to the same filer.

 I currently
 don't know if iSER, RDS, etc. already do the same thing, but if they
 don't, they probably could very easily.
 

Woe be to those who do so...

  
  Does the neteffect NIC have the same issue as cxgb3 here?  What are
  your thoughts on how to handle this?
 
 Yes, the NetEffect RNIC will have the same issue as Chelsio.  And all
 Future RNIC's which support a unified tcp address with Linux will as
 well.
 
 Steve has put a lot of thought and energy into the problem, but
 I don't think users  admins will be very happy with us in the long run.
 

Agreed.

 In summary, short of having the rdma_cm share kernel port space, I'd
 like to see the equivalent in userspace and have the kernel apps handle
 the issue in a similar way as described above.  There are a few
 technical
 issues to work through (like passing the userspace IP address to the
 kernel),

This just moves the socket creation to code that is outside the purview
the kernel maintainers. The exchanging of the 4-tuple created with the
kernel module, however, is back in the kernel and in the maintainer's
control and responsibility. In my view anything like this will be viewed
as an attempt to sneak code into the kernel that the maintainer has
already vehemently rejected. This will make people angry and damage the
cooperative working relationship that we are trying to build.

  but I think we can solve that just like other information that
 gets passed from user into the IB/RDMA kernel modules.
 


Sharing the IP 4-tuple space cooperatively with the core in any fashion
has been nak'd. Without this cooperation, the options we've been able to
come up with are administrative/policy based approaches. 

Any ideas you have along these lines are welcome.

Tom

 Glenn.
 
  
   - R.
  
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general