date:20070924

[ofa-general] Re: [query] Multi path discovery in openSM

2007-09-24 Thread Keshetti Mahesh

  If there are multiple paths between two end nodes in a network and
  I set the LMC  0 then whether the openSM itself identifies those
  routes and updates the switch forwarding tables or is it the duty of some
  other consumer ??

 OpenSM.

I am using min-hop algorithm with openSM.
Now in this case, if there are multiple paths (some are not min-hop paths) will
the openSM(LMC  0) configure those paths?

regards,
Mahesh
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCHES] TX batching

2007-09-24 Thread Kok, Auke

jamal wrote:
 On Sun, 2007-23-09 at 12:36 -0700, Kok, Auke wrote:
 
 please be reminded that we're going to strip down e1000 and most of the 
 features
 should go into e1000e, which has much less hardware workarounds. I'm still
 reluctant to putting in new stuff in e1000 - I really want to chop it down 
 first ;)
 
 sure - the question then is, will you take those changes if i use
 e1000e? theres a few cleanups that have nothing to do with batching;
 take a look at the modified e1000 on the git tree.

that's bad to begin with :) - please send those separately so I can fasttrack 
them
into e1000e and e1000 where applicable.

But yes, I'm very inclined to merge more features into e1000e than e1000. I 
intend
to put multiqueue support into e1000e, as *all* of the hardware that it will
support has multiple queues. Putting in any other performance feature like tx
batching would absolutely be interesting.

Auke
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] Re: [query] Multi path discovery in openSM

2007-09-24 Thread Eitan Zahavi

OpenSM will always use min-hop paths (no matter what routing algorithm
except maybe for LASH).
If you use the default algorithms OpenSM will tend to spread traffic
such that if you have used LMC=1 (2 LIDs per port)
The two paths going to LID0 and LID1 will go through different systems
or if not possible through different nodes.

EZ

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Keshetti Mahesh
 Sent: Monday, September 24, 2007 8:54 AM
 To: openIB
 Subject: [ofa-general] Re: [query] Multi path discovery in openSM
 
   If there are multiple paths between two end nodes in a 
 network and I 
   set the LMC  0 then whether the openSM itself identifies those 
   routes and updates the switch forwarding tables or is it 
 the duty of 
   some other consumer ??
 
  OpenSM.
 
 I am using min-hop algorithm with openSM.
 Now in this case, if there are multiple paths (some are not 
 min-hop paths) will the openSM(LMC  0) configure those paths?
 
 regards,
 Mahesh
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit 
 http://openib.org/mailman/listinfo/openib-general
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2

2007-09-24 Thread Michael S. Tsirkin

 Quoting Steve Wise [EMAIL PROTECTED]:
 Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2
 
 Please pull the latest from my libcxgb3 git repos to update the 
 ofed-1.2.5 and ofed-1.3 libcxgb3 release.  This will update to version 
 1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms.
 
 git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5

This looks wrong. 1.2.X releases are done from ofed_1_2 branch.
1.2.5 is just a tag. What do you want me to do?

 and
 
 git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3

OK for that one.


-- 
MST
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] ipoib patches - resend subset

2007-09-24 Thread Eli Cohen

Hi Roland,

as per your request for a smaller number of changes, I resend this
subset of the previous series.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 1/11] IB/ipoib: high dma support

2007-09-24 Thread Eli Cohen

Add high dma support to ipoib

This patch assumes all IB devices support 64 bit DMA.

Signed-off-by: Eli Cohen [EMAIL PROTECTED]

---

Index: linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- linux-2.6.23-rc1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 
2007-08-15 20:50:16.0 +0300
+++ linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c  2007-08-15 
20:50:27.0 +0300
@@ -1079,6 +1079,8 @@ static struct net_device *ipoib_add_port
 
SET_NETDEV_DEV(priv-dev, hca-dma_device);
 
+   priv-dev-features |= NETIF_F_HIGHDMA;
+
result = ib_query_pkey(hca, port, 0, priv-pkey);
if (result) {
printk(KERN_WARNING %s: ib_query_pkey port %d failed (ret = 
%d)\n,

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 3/11] ib_core: add checksum offload support

2007-09-24 Thread Eli Cohen

Add checksum offload support to the core

Signed-off-by: Eli Cohen [EMAIL PROTECTED]

---

A device that publishes IB_DEVICE_IP_CSUM actually supports
calculating checksum on transmit and provides indication whether
the checksum is OK on receive.

Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h
===
--- ofa_1_3_dev_kernel.orig/include/rdma/ib_verbs.h 2007-09-24 
13:24:22.0 +0200
+++ ofa_1_3_dev_kernel/include/rdma/ib_verbs.h  2007-09-24 13:24:40.0 
+0200
@@ -95,7 +95,8 @@ enum ib_device_cap_flags {
IB_DEVICE_N_NOTIFY_CQ   = (114),
IB_DEVICE_ZERO_STAG = (115),
IB_DEVICE_SEND_W_INV= (116),
-   IB_DEVICE_MEM_WINDOW= (117)
+   IB_DEVICE_MEM_WINDOW= (117),
+   IB_DEVICE_IP_CSUM   = (118),
 };
 
 enum ib_atomic_cap {
@@ -431,6 +432,7 @@ struct ib_wc {
u8  sl;
u8  dlid_path_bits;
u8  port_num;   /* valid only for DR SMPs on 
switches */
+   int csum_ok;
 };
 
 enum ib_cq_notify_flags {
@@ -615,7 +617,9 @@ enum ib_send_flags {
IB_SEND_FENCE   = 1,
IB_SEND_SIGNALED= (11),
IB_SEND_SOLICITED   = (12),
-   IB_SEND_INLINE  = (13)
+   IB_SEND_INLINE  = (13),
+   IB_SEND_IP_CSUM = (14),
+   IB_SEND_UDP_TCP_CSUM= (15)
 };
 
 struct ib_sge {

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 5/11]: mlx4_ib: add checksum offload support

2007-09-24 Thread Eli Cohen

Add checksum offload support to mlx4

Signed-off-by: Ali Ayub [EMAIL PROTECTED]
Signed-off-by: Eli Cohen [EMAIL PROTECTED]

---

Index: ofa_1_3_dev_kernel/include/linux/mlx4/cq.h
===
--- ofa_1_3_dev_kernel.orig/include/linux/mlx4/cq.h 2007-09-24 
11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/include/linux/mlx4/cq.h  2007-09-24 12:36:46.0 
+0200
@@ -45,11 +45,11 @@ struct mlx4_cqe {
u8  sl;
u8  reserved1;
__be16  rlid;
-   u32 reserved2;
+   __be32  ipoib_status;
__be32  byte_cnt;
__be16  wqe_index;
__be16  checksum;
-   u8  reserved3[3];
+   u8  reserved2[3];
u8  owner_sr_opcode;
 };
 
Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/cq.c 2007-09-24 
11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c  2007-09-24 
12:38:29.0 +0200
@@ -439,6 +439,8 @@ static int mlx4_ib_poll_one(struct mlx4_
wc-wc_flags  |= be32_to_cpu(cqe-g_mlpath_rqpn)  
0x8000 ?
IB_WC_GRH : 0;
wc-pkey_index = be32_to_cpu(cqe-immed_rss_invalid)  16;
+   wc-csum_ok   = be32_to_cpu(cqe-ipoib_status)  0x1000 
+   be16_to_cpu(cqe-checksum) == 0x;
}
 
return 0;
Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/main.c   2007-09-24 
11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c2007-09-24 
12:36:46.0 +0200
@@ -100,6 +100,8 @@ static int mlx4_ib_query_device(struct i
props-device_cap_flags |= IB_DEVICE_AUTO_PATH_MIG;
if (dev-dev-caps.flags  MLX4_DEV_CAP_FLAG_UD_AV_PORT)
props-device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE;
+   if (dev-dev-caps.flags  MLX4_DEV_CAP_FLAG_IPOIB_CSUM)
+   props-device_cap_flags |= IB_DEVICE_IP_CSUM;
 
props-vendor_id   = be32_to_cpup((__be32 *) (out_mad-data + 
36)) 
0xff;
@@ -626,6 +628,9 @@ static void *mlx4_ib_add(struct mlx4_dev
ibdev-ib_dev.unmap_fmr = mlx4_ib_unmap_fmr;
ibdev-ib_dev.dealloc_fmr   = mlx4_ib_fmr_dealloc;
 
+   if (ibdev-dev-caps.flags  MLX4_DEV_CAP_FLAG_IPOIB_CSUM)
+   ibdev-ib_dev.flags |= IB_DEVICE_IP_CSUM;
+
if (init_node_data(ibdev))
goto err_map;
 
Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-24 
11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c  2007-09-24 
12:36:46.0 +0200
@@ -1433,6 +1433,10 @@ int mlx4_ib_post_send(struct ib_qp *ibqp
 cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE) : 0) |
(wr-send_flags  IB_SEND_SOLICITED ?
 cpu_to_be32(MLX4_WQE_CTRL_SOLICITED) : 0) |
+   ((wr-send_flags  IB_SEND_IP_CSUM) ?
+cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM) : 0)   |
+   ((wr-send_flags  IB_SEND_UDP_TCP_CSUM) ?
+cpu_to_be32(MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0)   |
qp-sq_signal_bits;
 
if (wr-opcode == IB_WR_SEND_WITH_IMM ||
Index: ofa_1_3_dev_kernel/include/linux/mlx4/qp.h
===
--- ofa_1_3_dev_kernel.orig/include/linux/mlx4/qp.h 2007-09-24 
11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/include/linux/mlx4/qp.h  2007-09-24 12:36:46.0 
+0200
@@ -162,6 +162,8 @@ enum {
MLX4_WQE_CTRL_FENCE = 1  6,
MLX4_WQE_CTRL_CQ_UPDATE = 3  2,
MLX4_WQE_CTRL_SOLICITED = 1  1,
+   MLX4_WQE_CTRL_IP_CSUM   = 1  4,
+   MLX4_WQE_CTRL_TCP_UDP_CSUM  = 1  5,
 };
 
 struct mlx4_wqe_ctrl_seg {
Index: ofa_1_3_dev_kernel/drivers/net/mlx4/fw.c
===
--- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/fw.c   2007-09-24 
11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/drivers/net/mlx4/fw.c2007-09-24 12:36:46.0 
+0200
@@ -741,6 +741,9 @@ int mlx4_INIT_HCA(struct mlx4_dev *dev, 
MLX4_PUT(inbox, (u8) (PAGE_SHIFT - 12), INIT_HCA_UAR_PAGE_SZ_OFFSET);
MLX4_PUT(inbox, param-log_uar_sz,  INIT_HCA_LOG_UAR_SZ_OFFSET);
 
+   if (dev-caps.flags  MLX4_DEV_CAP_FLAG_IPOIB_CSUM)
+

[ofa-general] [PATCH 6/11] IB/ipoib: add checksum offload support

2007-09-24 Thread Eli Cohen

Add checksum offload support to ipoib

Signed-off-by: Eli Cohen [EMAIL PROTECTED]
Signed-off-by: Ali Ayub [EMAIL PROTECTED]

---

Add checksum offload support to ipoib

Signed-off-by: Eli Cohen [EMAIL PROTECTED]
Signed-off-by: Ali Ayub [EMAIL PROTECTED]

---

Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h
2007-09-24 12:09:21.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 
12:49:00.0 +0200
@@ -86,6 +86,7 @@ enum {
IPOIB_MCAST_STARTED   = 8,
IPOIB_FLAG_NETIF_STOPPED  = 9,
IPOIB_FLAG_ADMIN_CM   = 10,
+   IPOIB_FLAG_RX_CSUM= 11,
 
IPOIB_MAX_BACKOFF_SECONDS = 16,
 
Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
2007-09-24 12:23:26.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c  2007-09-24 
13:05:21.0 +0200
@@ -1258,6 +1258,13 @@ static ssize_t set_mode(struct device *d
set_bit(IPOIB_FLAG_ADMIN_CM, priv-flags);
ipoib_warn(priv, enabling connected mode 
   will cause multicast packet drops\n);
+
+   /* clear ipv6 flag too */
+   dev-features = ~NETIF_F_IP_CSUM;
+
+   priv-tx_wr.send_flags =
+   ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM);
+
ipoib_flush_paths(dev);
return count;
}
@@ -1266,6 +1273,10 @@ static ssize_t set_mode(struct device *d
clear_bit(IPOIB_FLAG_ADMIN_CM, priv-flags);
dev-mtu = min(priv-mcast_mtu, dev-mtu);
ipoib_flush_paths(dev);
+
+   if (priv-ca-flags  IB_DEVICE_IP_CSUM)
+   dev-features |= NETIF_F_IP_CSUM; /* ipv6 too */
+
return count;
}
 
Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
2007-09-24 11:57:02.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2007-09-24 
13:03:27.0 +0200
@@ -37,6 +37,7 @@
 
 #include linux/delay.h
 #include linux/dma-mapping.h
+#include linux/ip.h
 
 #include rdma/ib_cache.h
 
@@ -231,6 +232,16 @@ static void ipoib_ib_handle_rx_wc(struct
skb-dev = dev;
/* XXX get correct PACKET_ type here */
skb-pkt_type = PACKET_HOST;
+
+   /* check rx csum */
+   if (test_bit(IPOIB_FLAG_RX_CSUM, priv-flags)  likely(wc-csum_ok)) {
+   /* Note: this is a specific requirement for Mellanox
+  HW but since it is the only HW currently supporting
+  checksum offload I put it here */
+   if struct iphdr *)(skb-data))-ihl) == 5)
+   skb-ip_summed = CHECKSUM_UNNECESSARY;
+   }
+
netif_receive_skb(skb);
 
 repost:
@@ -396,6 +407,15 @@ void ipoib_send(struct net_device *dev, 
return;
}
 
+   if (priv-ca-flags  IB_DEVICE_IP_CSUM 
+   skb-ip_summed == CHECKSUM_PARTIAL)
+   priv-tx_wr.send_flags |=
+   IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM;
+   else
+   priv-tx_wr.send_flags =
+   ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM);
+
+
if (unlikely(post_send(priv, priv-tx_head  (ipoib_sendq_size - 1),
   address-ah, qpn,
   tx_req-mapping, skb_headlen(skb),
Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c   
2007-09-24 12:23:00.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c
2007-09-24 13:04:52.0 +0200
@@ -1109,6 +1109,29 @@ int ipoib_add_pkey_attr(struct net_devic
return device_create_file(dev-dev, dev_attr_pkey);
 }
 
+static void set_tx_csum(struct net_device *dev)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   if (test_bit(IPOIB_FLAG_ADMIN_CM, priv-flags))
+   return;
+
+   if (!(priv-ca-flags  IB_DEVICE_IP_CSUM))
+   return;
+
+   dev-features |= NETIF_F_SG | NETIF_F_IP_CSUM; /* turn on ipv6 too */
+}
+
+static void set_rx_csum(struct net_device *dev)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   if (!(priv-ca-flags  IB_DEVICE_IP_CSUM))
+   return;
+
+   set_bit(IPOIB_FLAG_RX_CSUM, priv-flags);
+}
+
 static struct net_device *ipoib_add_port(const char *format,
 struct

[ofa-general] [PATCH 8/11]: Add support for modifying CQ params

2007-09-24 Thread Eli Cohen

Add support for modifying CQ parameters for controlling
event generation moderation. This allows to control the
rate of event (interrupt) generation by specifying a minimum
number of CQEs or a minimum period of time required to
generate an event.

Signed-off-by: Eli Cohen [EMAIL PROTECTED]

---

Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h
===
--- ofa_1_3_dev_kernel.orig/include/rdma/ib_verbs.h 2007-09-24 
12:33:41.0 +0200
+++ ofa_1_3_dev_kernel/include/rdma/ib_verbs.h  2007-09-24 13:07:59.0 
+0200
@@ -967,6 +967,8 @@ struct ib_device {
int comp_vector,
struct ib_ucontext *context,
struct ib_udata *udata);
+   int(*modify_cq)(struct ib_cq *cq, u16 cq_count,
+   u16 cq_period);
int(*destroy_cq)(struct ib_cq *cq);
int(*resize_cq)(struct ib_cq *cq, int cqe,
struct ib_udata *udata);
@@ -1372,6 +1374,16 @@ struct ib_cq *ib_create_cq(struct ib_dev
 int ib_resize_cq(struct ib_cq *cq, int cqe);
 
 /**
+ * ib_modify_cq - Modifies moderation params of the CQ
+ * @cq: The CQ to modify.
+ * @cq_count: number of CQEs that will tirgger an event
+ * @cq_period: max period of time beofre triggering an event
+ *
+ * Users can examine the cq structure to determine the actual CQ size.
+ */
+int ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);
+
+/**
  * ib_destroy_cq - Destroys the specified CQ.
  * @cq: The CQ to destroy.
  */
Index: ofa_1_3_dev_kernel/drivers/infiniband/core/verbs.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/core/verbs.c 2007-09-24 
11:19:03.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/core/verbs.c  2007-09-24 
13:07:59.0 +0200
@@ -628,6 +628,13 @@ struct ib_cq *ib_create_cq(struct ib_dev
 }
 EXPORT_SYMBOL(ib_create_cq);
 
+int ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period)
+{
+   return cq-device-modify_cq ?
+   cq-device-modify_cq(cq, cq_count, cq_period) : -ENOSYS;
+}
+EXPORT_SYMBOL(ib_modify_cq);
+
 int ib_destroy_cq(struct ib_cq *cq)
 {
if (atomic_read(cq-usecnt))

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 9/11] mlx4_ib: add support for modifying CQ parameters

2007-09-24 Thread Eli Cohen

Add support for modifying CQ parameters.

Signed-off-by: Eli Cohen [EMAIL PROTECTED]

---

Add support for modifying CQ parameters.

Signed-off-by: Eli Cohen [EMAIL PROTECTED]

---

Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/main.c   2007-09-24 
12:36:46.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c2007-09-24 
13:08:55.0 +0200
@@ -613,6 +613,7 @@ static void *mlx4_ib_add(struct mlx4_dev
ibdev-ib_dev.post_send = mlx4_ib_post_send;
ibdev-ib_dev.post_recv = mlx4_ib_post_recv;
ibdev-ib_dev.create_cq = mlx4_ib_create_cq;
+   ibdev-ib_dev.modify_cq = mlx4_ib_modify_cq;
ibdev-ib_dev.destroy_cq= mlx4_ib_destroy_cq;
ibdev-ib_dev.poll_cq   = mlx4_ib_poll_cq;
ibdev-ib_dev.req_notify_cq = mlx4_ib_arm_cq;
Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/cq.c 2007-09-24 
12:38:29.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c  2007-09-24 
13:08:55.0 +0200
@@ -91,6 +91,25 @@ static struct mlx4_cqe *next_cqe_sw(stru
return get_sw_cqe(cq, cq-mcq.cons_index);
 }
 
+int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period)
+{
+   struct mlx4_ib_cq *mcq = to_mcq(cq);
+   struct mlx4_ib_dev *dev = to_mdev(cq-device);
+   struct mlx4_cq_context *context;
+   int err;
+
+   context = kzalloc(sizeof *context, GFP_KERNEL);
+   if (!context)
+   return -ENOMEM;
+
+   context-cq_period = cpu_to_be16(cq_period);
+   context-cq_max_count = cpu_to_be16(cq_count);
+   err = mlx4_cq_modify(dev-dev, mcq-mcq, context, 1);
+
+   kfree(context);
+   return err;
+}
+
 struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int 
vector,
struct ib_ucontext *context,
struct ib_udata *udata)
Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h
2007-09-24 11:19:03.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-09-24 
13:08:55.0 +0200
@@ -249,6 +249,7 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct
  struct ib_udata *udata);
 int mlx4_ib_dereg_mr(struct ib_mr *mr);
 
+int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);
 struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int 
vector,
struct ib_ucontext *context,
struct ib_udata *udata);
Index: ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c
===
--- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/cq.c   2007-09-24 
11:19:03.0 +0200
+++ ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c2007-09-24 13:08:55.0 
+0200
@@ -38,33 +38,11 @@
 #include linux/hardirq.h
 
 #include linux/mlx4/cmd.h
+#include linux/mlx4/cq.h
 
 #include mlx4.h
 #include icm.h
 
-struct mlx4_cq_context {
-   __be32  flags;
-   u16 reserved1[3];
-   __be16  page_offset;
-   __be32  logsize_usrpage;
-   u8  reserved2;
-   u8  cq_period;
-   u8  reserved3;
-   u8  cq_max_count;
-   u8  reserved4[3];
-   u8  comp_eqn;
-   u8  log_page_size;
-   u8  reserved5[2];
-   u8  mtt_base_addr_h;
-   __be32  mtt_base_addr_l;
-   __be32  last_notified_index;
-   __be32  solicit_producer_index;
-   __be32  consumer_index;
-   __be32  producer_index;
-   u32 reserved6[2];
-   __be64  db_rec_addr;
-};
-
 #define MLX4_CQ_STATUS_OK  ( 0  28)
 #define MLX4_CQ_STATUS_OVERFLOW( 9  28)
 #define MLX4_CQ_STATUS_WRITE_FAIL  (10  28)
@@ -121,6 +99,13 @@ static int mlx4_SW2HW_CQ(struct mlx4_dev
MLX4_CMD_TIME_CLASS_A);
 }
 
+static int mlx4_MODIFY_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox 
*mailbox,
+int cq_num, u32 opmod)
+{
+   return mlx4_cmd(dev, mailbox-dma, cq_num, opmod, MLX4_CMD_MODIFY_CQ,
+   MLX4_CMD_TIME_CLASS_A);
+}
+
 static int mlx4_HW2SW_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox 
*mailbox,

[ofa-general] [PATCH 10/11]: IB/ipoib modify cq params

2007-09-24 Thread Eli Cohen

Implement support for modifying IPOIB CQ moderation params

This can be used to tune at run time the paramters controlling
the event (interrupt) generation rate and thus reduce the overhead
incurred by hadling interrupts resulting in better throughput.

Signed-off-by: Eli Cohen [EMAIL PROTECTED]

---

Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h
2007-09-24 13:07:43.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 
13:12:21.0 +0200
@@ -270,6 +270,13 @@ struct ipoib_cm_dev_priv {
struct ib_recv_wr   rx_wr;
 };
 
+struct ipoib_ethtool_st {
+   u16 rx_coalesce_usecs;
+   u16 tx_coalesce_usecs;
+   u16 rx_max_coalesced_frames;
+   u16 tx_max_coalesced_frames;
+};
+
 /*
  * Device private locking: tx_lock protects members used in TX fast
  * path (and we use LLTX so upper layers don't do extra locking).
@@ -346,6 +353,7 @@ struct ipoib_dev_priv {
struct dentry *mcg_dentry;
struct dentry *path_dentry;
 #endif
+   struct ipoib_ethtool_st etool;
 };
 
 struct ipoib_ah {
Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_etool.c  
2007-09-24 13:07:43.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c   
2007-09-24 13:09:26.0 +0200
@@ -44,9 +44,49 @@ static void ipoib_get_drvinfo(struct net
strncpy(drvinfo-driver, ipoib, sizeof(drvinfo-driver) - 1);
 }
 
+static int ipoib_get_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *coal)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   coal-rx_coalesce_usecs = priv-etool.rx_coalesce_usecs;
+   coal-tx_coalesce_usecs = priv-etool.tx_coalesce_usecs;
+   coal-rx_max_coalesced_frames = priv-etool.rx_max_coalesced_frames;
+   coal-rx_max_coalesced_frames = priv-etool.tx_max_coalesced_frames;
+
+   return 0;
+}
+
+static int ipoib_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *coal)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+   int ret;
+
+   if (coal-rx_coalesce_usecs  0x||
+   coal-tx_coalesce_usecs  0x||
+   coal-rx_max_coalesced_frames  0x  ||
+   coal-tx_max_coalesced_frames  0x)
+   return -EINVAL;
+
+   ret = ib_modify_cq(priv-cq, coal-rx_max_coalesced_frames,
+   coal-rx_coalesce_usecs);
+   if (ret)
+   return ret;
+
+   priv-etool.rx_coalesce_usecs = coal-rx_coalesce_usecs;
+   priv-etool.tx_coalesce_usecs = coal-tx_coalesce_usecs;
+   priv-etool.rx_max_coalesced_frames = coal-rx_max_coalesced_frames;
+   priv-etool.tx_max_coalesced_frames = coal-rx_max_coalesced_frames;
+
+   return 0;
+}
+
 static const struct ethtool_ops ipoib_ethtool_ops = {
.get_drvinfo= ipoib_get_drvinfo,
.get_tso= ethtool_op_get_tso,
+   .get_coalesce   = ipoib_get_coalesce,
+   .set_coalesce   = ipoib_set_coalesce,
 };
 
 void ipoib_set_ethtool_ops(struct net_device *dev)

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 11/11]: mlx4_core use fixed CQ moderation paramters

2007-09-24 Thread Eli Cohen

From: Michael S. Tsirkin [EMAIL PROTECTED]
Subject: IB/ipoib: support for sending gather skbs

Enable interrupt coalescing for CQs in mlx4.

Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED]

---

Index: ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c
===
--- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/cq.c   2007-09-24 
13:08:55.0 +0200
+++ ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c2007-09-24 13:12:42.0 
+0200
@@ -43,6 +43,14 @@
 #include mlx4.h
 #include icm.h
 
+static int cq_max_count = 16;
+static int cq_period = 10;
+
+module_param(cq_max_count, int, 0444);
+MODULE_PARM_DESC(cq_max_count, number of CQEs to generate event);
+module_param(cq_period, int, 0444);
+MODULE_PARM_DESC(cq_period, time in usec for CQ event generation);
+
 #define MLX4_CQ_STATUS_OK  ( 0  28)
 #define MLX4_CQ_STATUS_OVERFLOW( 9  28)
 #define MLX4_CQ_STATUS_WRITE_FAIL  (10  28)

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 4/11] ib_mthca: add checksum offload support

2007-09-24 Thread Eli Cohen

Add checksum offload support in mthca

Signed-off-by: Eli Cohen [EMAIL PROTECTED]

---

resending - adding the openfabrics list

Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 
2007-09-24 11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c  2007-09-24 
12:34:59.0 +0200
@@ -1377,6 +1377,9 @@ int mthca_INIT_HCA(struct mthca_dev *dev
MTHCA_PUT(inbox, param-uarc_base,   
INIT_HCA_UAR_CTX_BASE_OFFSET);
}
 
+   if (dev-device_cap_flags  IB_DEVICE_IP_CSUM)
+   *(inbox + INIT_HCA_FLAGS2_OFFSET / 4) |= cpu_to_be32(7  3);
+
err = mthca_cmd(dev, mailbox-dma, 0, 0, CMD_INIT_HCA, HZ, status);
 
mthca_free_mailbox(dev, mailbox);
Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.h
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cmd.h 
2007-09-24 11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.h  2007-09-24 
12:34:59.0 +0200
@@ -103,6 +103,7 @@ enum {
DEV_LIM_FLAG_RAW_IPV6   = 1  4,
DEV_LIM_FLAG_RAW_ETHER  = 1  5,
DEV_LIM_FLAG_SRQ= 1  6,
+   DEV_LIM_FLAG_IPOIB_CSUM = 1  7,
DEV_LIM_FLAG_BAD_PKEY_CNTR  = 1  8,
DEV_LIM_FLAG_BAD_QKEY_CNTR  = 1  9,
DEV_LIM_FLAG_MW = 1  16,
Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cq.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cq.c  
2007-09-24 11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cq.c   2007-09-24 
12:36:06.0 +0200
@@ -119,7 +119,8 @@ struct mthca_cqe {
__be32 my_qpn;
__be32 my_ee;
__be32 rqpn;
-   __be16 sl_g_mlpath;
+   u8  sl_ipok;
+   u8  g_mlpath;
__be16 rlid;
__be32 imm_etype_pkey_eec;
__be32 byte_cnt;
@@ -498,6 +499,7 @@ static inline int mthca_poll_one(struct 
int is_send;
int free_cqe = 1;
int err = 0;
+   u16 checksum;
 
cqe = next_cqe_sw(cq);
if (!cqe)
@@ -639,12 +641,14 @@ static inline int mthca_poll_one(struct 
break;
}
entry-slid= be16_to_cpu(cqe-rlid);
-   entry-sl  = be16_to_cpu(cqe-sl_g_mlpath)  12;
+   entry-sl  = cqe-sl_ipok  4;
entry-src_qp  = be32_to_cpu(cqe-rqpn)  0xff;
-   entry-dlid_path_bits = be16_to_cpu(cqe-sl_g_mlpath)  0x7f;
+   entry-dlid_path_bits = cqe-g_mlpath  0x7f;
entry-pkey_index  = be32_to_cpu(cqe-imm_etype_pkey_eec)  16;
-   entry-wc_flags   |= be16_to_cpu(cqe-sl_g_mlpath)  0x80 ?
-   IB_WC_GRH : 0;
+   entry-wc_flags   |= cqe-g_mlpath  0x80 ? IB_WC_GRH : 0;
+   checksum = (be32_to_cpu(cqe-rqpn)  24) |
+   ((be32_to_cpu(cqe-my_ee)  16)  0xff00);
+   entry-csum_ok = (cqe-sl_ipok  1  checksum == 0x);
}
 
entry-status = IB_WC_SUCCESS;
Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_main.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_main.c
2007-09-24 11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_main.c 2007-09-24 
12:34:59.0 +0200
@@ -289,6 +289,10 @@ static int mthca_dev_lim(struct mthca_de
if (dev_lim-flags  DEV_LIM_FLAG_SRQ)
mdev-mthca_flags |= MTHCA_FLAG_SRQ;
 
+   if (mthca_is_memfree(mdev))
+   if (dev_lim-flags  DEV_LIM_FLAG_IPOIB_CSUM)
+   mdev-device_cap_flags |= IB_DEVICE_IP_CSUM;
+
return 0;
 }
 
@@ -1125,6 +1129,8 @@ static int __mthca_init_one(struct pci_d
if (err)
goto err_cmd;
 
+   mdev-ib_dev.flags = mdev-device_cap_flags;
+
if (mdev-fw_ver  mthca_hca_table[hca_type].latest_fw) {
mthca_warn(mdev, HCA FW version %d.%d.%03d is old (%d.%d.%03d 
is current).\n,
   (int) (mdev-fw_ver  32), (int) (mdev-fw_ver  
16)  0x,
Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_qp.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_qp.c  
2007-09-24 11:19:08.0 +0200
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_qp.c   2007-09-24 
12:34:59.0 +0200
@@ -2024,6 +2024,10 @@ int mthca_arbel_post_send(struct ib_qp *
 cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0)

Re: [ofa-general] Re: [query] Multi path discovery in openSM

2007-09-24 Thread Sasha Khapyorsky

On 09:05 Mon 24 Sep , Eitan Zahavi wrote:
 OpenSM will always use min-hop paths (no matter what routing algorithm

I would clarify here - For LMC  0 OpenSM will choose different paths
between _discovered_ shortest paths. For min-hop algorithm those
shortest paths are real min-hops paths. For Up/Down it is min-hop paths
which satisfies Up/Down constraint.

 except maybe for LASH).

For LASH too (LASH is abbreviation of LAyered SHortest paths). There
a different layers (VLs in case of IB) are used for credit loops
resolution. However current LASH implementation does not support LMC  0.

Sasha

 If you use the default algorithms OpenSM will tend to spread traffic
 such that if you have used LMC=1 (2 LIDs per port)
 The two paths going to LID0 and LID1 will go through different systems
 or if not possible through different nodes.
 
 EZ
 
 Eitan Zahavi
 Senior Engineering Director, Software Architect
 Mellanox Technologies LTD
 Tel:+972-4-9097208
 Fax:+972-4-9593245
 P.O. Box 586 Yokneam 20692 ISRAEL
 
  
 
  -Original Message-
  From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED] On Behalf Of 
  Keshetti Mahesh
  Sent: Monday, September 24, 2007 8:54 AM
  To: openIB
  Subject: [ofa-general] Re: [query] Multi path discovery in openSM
  
If there are multiple paths between two end nodes in a 
  network and I 
set the LMC  0 then whether the openSM itself identifies those 
routes and updates the switch forwarding tables or is it 
  the duty of 
some other consumer ??
  
   OpenSM.
  
  I am using min-hop algorithm with openSM.
  Now in this case, if there are multiple paths (some are not 
  min-hop paths) will the openSM(LMC  0) configure those paths?
  
  regards,
  Mahesh
  ___
  general mailing list
  general@lists.openfabrics.org
  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
  
  To unsubscribe, please visit 
  http://openib.org/mailman/listinfo/openib-general
  
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2

2007-09-24 Thread Michael S. Tsirkin

 Quoting Michael S. Tsirkin [EMAIL PROTECTED]:
 Subject: Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2

  Quoting Steve Wise [EMAIL PROTECTED]:
  Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2

  Please pull the latest from my libcxgb3 git repos to update the 
  ofed-1.2.5 and ofed-1.3 libcxgb3 release.  This will update to version 
  1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms.

  git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5

 This looks wrong. 1.2.X releases are done from ofed_1_2 branch.
 1.2.5 is just a tag. What do you want me to do?

I figured it out. done.

-- 
MST
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCHv3] IB/ipoib: HW checksum support

2007-09-24 Thread Michael S. Tsirkin

Add module option hw_csum: when set, IPoIB will report HW CSUM
and S/G support, and rely on hardware end-to-end transport
checksum (ICRC) instead of software-level protocol checksums.

Forwarding such packets outside the IB subnet would increase
the risk of data corruption, so it is safest not to set
hw_csum flag on gateways. To reduce the chance of
this routing triggering data corruption by mistake, on RX
we set skb checksum field to CHECKSUM_UNNECESSARY - this way
if such a packet ends up outside the IB network,
it is detected as malformed and dropped.

To enable interoperability with IEEE IPoIB, checksum
for outgoing packets is calculated in software
unless the remote advertises hw_csum capability
by setting a bit in hardware address flag.

Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED]

---

This patch has to be applied on top of
[PATCH 2/11] IB/ipoib: support for sending gather skbs.

Updates since v2: 

Enable interoperability with IEEE IPoIB.
Split out S/G support to a separate patch.

Updates since v1: fixed thinko in setting header flags.

When applied on top of previously posted mlx4 patches,
and with hw_csum enabled on both ends, this patch speeds up
single-stream netperf bandwidth on connectx DDR from 1000
to 1250 MBytes/sec.

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 285c143..485f979 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -86,6 +86,7 @@ enum {
IPOIB_MCAST_STARTED   = 8,
IPOIB_FLAG_NETIF_STOPPED  = 9,
IPOIB_FLAG_ADMIN_CM   = 10,
+   IPOIB_FLAG_HW_CSUM= 11,
 
IPOIB_MAX_BACKOFF_SECONDS = 16,
 
@@ -104,9 +105,11 @@ enum {
 
 /* structs */
 
+#define IPOIB_HEADER_F_HWCSUM 0x1
+
 struct ipoib_header {
__be16  proto;
-   u16 reserved;
+   __be16  flags;
 };
 
 struct ipoib_pseudoheader {
@@ -430,6 +478,8 @@ void ipoib_pkey_poll(struct work_struct *work);
 int ipoib_pkey_dev_delay_open(struct net_device *dev);
 void ipoib_drain_cq(struct net_device *dev);
 
+#define IPOIB_FLAGS_HWCSUM  0x01
+
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 
 #define IPOIB_FLAGS_RC  0x80
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 08b4676..a308e92 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -407,6 +407,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
unsigned long flags;
u64 mapping[IPOIB_CM_RX_SG];
int frags;
+   struct ipoib_header *header;
 
ipoib_dbg_data(priv, cm recv completion: id %d, status: %d\n,
   wr_id, wc-status);
@@ -469,7 +470,10 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
 
skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc-byte_len, newskb);
 
-   skb-protocol = ((struct ipoib_header *) skb-data)-proto;
+   header = (struct ipoib_header *)skb-data;
+   skb-protocol = header-proto;
+   if (header-flags  cpu_to_be16(IPOIB_HEADER_F_HWCSUM))
+   skb-ip_summed = CHECKSUM_UNNECESSARY;
skb_reset_mac_header(skb);
skb_pull(skb, IPOIB_ENCAP_LEN);
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 1094488..59b1735 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -170,6 +170,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, 
struct ib_wc *wc)
struct ipoib_dev_priv *priv = netdev_priv(dev);
unsigned int wr_id = wc-wr_id  ~IPOIB_OP_RECV;
struct sk_buff *skb;
+   struct ipoib_header *header;
u64 addr;
 
ipoib_dbg_data(priv, recv completion: id %d, status: %d\n,
@@ -220,7 +221,10 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, 
struct ib_wc *wc)
skb_put(skb, wc-byte_len);
skb_pull(skb, IB_GRH_BYTES);
 
-   skb-protocol = ((struct ipoib_header *) skb-data)-proto;
+   header = (struct ipoib_header *)skb-data;
+   skb-protocol = header-proto;
+   if (header-flags  cpu_to_be16(IPOIB_HEADER_F_HWCSUM))
+   skb-ip_summed = CHECKSUM_UNNECESSARY;
skb_reset_mac_header(skb);
skb_pull(skb, IPOIB_ENCAP_LEN);
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 894b1dc..74d10e6 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -55,11 +55,14 @@ MODULE_LICENSE(Dual BSD/GPL);
 
 int ipoib_sendq_size __read_mostly = IPOIB_TX_RING_SIZE;
 int ipoib_recvq_size __read_mostly = IPOIB_RX_RING_SIZE;
+static int ipoib_hw_csum __read_mostly = 0;
 
 module_param_named(send_queue_size, ipoib_sendq_size, int, 0444);
 MODULE_PARM_DESC(send_queue_size, Number of descriptors in send queue);
 module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444);

[ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2

2007-09-24 Thread Tziporet Koren


Michael S. Tsirkin wrote:

Quoting Michael S. Tsirkin [EMAIL PROTECTED]:
Subject: Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2



I figured it out. done.

  
And I did a new build of OFED 1.2.5 daily (look at 
http://www.openfabrics.org/builds/connectx/latest.txt)


Tziporet
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister

2007-09-24 Thread Moni Shoua

Roland Dreier wrote:
   The action in bonding to a detach of slave is to unregister the master 
 (see patch 10).
   This can't be done from the context of unregister_netdevice itself (it is 
 protected by rtnl_lock).
 
 I'm confused.  Your patch has:
 
   +  ipoib_slave_detach(cpriv-dev);
  unregister_netdev(cpriv-dev);
 
 And ipoib_slave_detach() is:
 
   +static inline void ipoib_slave_detach(struct net_device *dev)
   +{
   +  rtnl_lock();
   +  netdev_slave_detach(dev);
   +  rtnl_unlock();
   +}
 
 so you are calling netdev_slave_detach() with the rtnl lock held.
 Why can't you make the same call from the start of unregister_netdevice()?
 
 Anyway, if the rtnl lock is a problem, can you just add the call to
 netdev_slave_detach() to unregister_netdev() before it takes the rtnl lock?
 
  - R.
 

Your comment made me do a little rethinking.
In bonding, device is released by calling unregister_netdevice() that doesn't 
take the rtnl_lock (unlike unregister_netdev() that does). I guess that this 
made me 
confused to think that this is not possible. So, I guess I could put 
the detach notification in unregister_netedev() and the reaction to the 
notification 
in the bonding driver would not block.
However, I looked one more time at the code of unregister_netdevice() and found 
out that
nothing prevents from calling unregister_netdevice() again when the 
notification NETDEV_GOING_DOWN
is sent. I tried that and it works.
I have a new set of patches without sending a slave detach and I will send it 
soon.

Thanks for the comment Roland. It makes this patch simpler.

I'd also like to give a credit to Jay for the idea of using NETDEV_GOING_DOWN 
notification
instead of NETDEV_CHANGE+IFF_SLAVE_DETACH. He suggested it a while ago but I 
wrongly thought that
it wouldn't work.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2

2007-09-24 Thread Steve Wise




Michael S. Tsirkin wrote:

Quoting Steve Wise [EMAIL PROTECTED]:
Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2

Please pull the latest from my libcxgb3 git repos to update the 
ofed-1.2.5 and ofed-1.3 libcxgb3 release.  This will update to version 
1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms.


git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5




Go look at

http://www.openfabrics.org/git/?p=ofed_1_2_5/libcxgb3.git;a=summary

It has a ofed_1_2_5 branch.  I believe Vlad setup the build scripts to 
handle this.


Yes?


This looks wrong. 1.2.X releases are done from ofed_1_2 branch.
1.2.5 is just a tag. What do you want me to do?


and

git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3


OK for that one.



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2

2007-09-24 Thread Michael S. Tsirkin

 Quoting Steve Wise [EMAIL PROTECTED]:
 Subject: Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2

 Michael S. Tsirkin wrote:
 Quoting Steve Wise [EMAIL PROTECTED]:
 Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2

 Please pull the latest from my libcxgb3 git repos to update the 
 ofed-1.2.5 and ofed-1.3 libcxgb3 release.  This will update to version 
 1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms.

 git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5

 Go look at

 http://www.openfabrics.org/git/?p=ofed_1_2_5/libcxgb3.git;a=summary

 It has a ofed_1_2_5 branch.  I believe Vlad setup the build scripts to 
 handle this.

 Yes?

 This looks wrong. 1.2.X releases are done from ofed_1_2 branch.
 1.2.5 is just a tag. What do you want me to do?

 and

 git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3

 OK for that one.

It's OK, done for both.
-- 
MST
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [ewg] OFED teleconference today

2007-09-24 Thread Tziporet Koren


Jeff Squyres wrote:
Friendly reminder: the OFED teleconference is several hours from now 
(Monday, September 24, 2007).


Noon US eastern / 9am US Pacific / -=6pm Israel=-
1. Monday, Sep 24, code 210062024 (***TODAY***)


Agenda:
1. Agree on the new OFED 1.3 schedule:

   * Feature freeze - Sep 25
   * Alpha release - Oct 1
   * Beta release - Oct 17 (may change according to 2.6.24 rc1
 availability)
   * RC1 - Oct 24
   * RC2 - Nov 7
   * RC3 - Nov 20
   * RC4 - Dec 4
   * GA release - Dec 18

2. Agree to move to kernel base 2.6.24
   Start with what we have now (2.6.23) and move to 2.6.24 when RC1 is 
available.
   This will reduce many patches and with the new timeline seems more 
appropriate.


Please send if you have any other agenda items

Tziporet


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver

2007-09-24 Thread Moni Shoua

This patch series is the sixth version (see below link to V5) of the 
suggested changes to the bonding driver so it would be able to support 
non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. 

Patches 1-8 were originally submitted in V5 and patch 9 is an addition by Jay.


Major changes from the previous version:


1. Remove the patches to net/core. Bonding will use the NETDEV_GOING_DOWN 
notification
   instead of NETDEV_CHANGE+IFF_SLAVE_DETACH. This reduces the amount of 
patches from 11
   to 9.

Links to earlier discussion:


1. A discussion in netdev about bonding support for IPoIB.
http://lists.openwall.net/netdev/2006/11/30/46

2. V5 series
http://lists.openfabrics.org/pipermail/general/2007-September/040996.html

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH V6 1/9] IB/ipoib: Bound the net device to the ipoib_neigh structue

2007-09-24 Thread Moni Shoua

IPoIB uses a two layer neighboring scheme, such that for each struct neighbour
whose device is an ipoib one, there is a struct ipoib_neigh buddy which is
created on demand at the tx flow by an ipoib_neigh_alloc(skb-dst-neighbour)
call.

When using the bonding driver, neighbours are created by the net stack on behalf
of the bonding (master) device. On the tx flow the bonding code gets an skb such
that skb-dev points to the master device, it changes this skb to point on the
slave device and calls the slave hard_start_xmit function.

Under this scheme, ipoib_neigh_destructor assumption that for each struct
neighbour it gets, n-dev is an ipoib device and hence netdev_priv(n-dev)
can be casted to struct ipoib_dev_priv is buggy.

To fix it, this patch adds a dev field to struct ipoib_neigh which is used
instead of the struct neighbour dev one, when n-dev-flags has the
IFF_MASTER bit set.

Signed-off-by: Moni Shoua monis at voltaire.com
Signed-off-by: Or Gerlitz ogerlitz at voltaire.com
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |4 +++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |   24 +++-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |3 ++-
 3 files changed, 20 insertions(+), 11 deletions(-)

Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h   2007-09-18 
17:08:53.245849217 +0200
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h2007-09-18 
17:09:26.534874404 +0200
@@ -328,6 +328,7 @@ struct ipoib_neigh {
struct sk_buff_head queue;
 
struct neighbour   *neighbour;
+   struct net_device *dev;
 
struct list_headlist;
 };
@@ -344,7 +345,8 @@ static inline struct ipoib_neigh **to_ip
 INFINIBAND_ALEN, sizeof(void *));
 }
 
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh);
+struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh,
+ struct net_device *dev);
 void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh);
 
 extern struct workqueue_struct *ipoib_workqueue;
Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c  2007-09-18 
17:08:53.245849217 +0200
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c   2007-09-18 
17:23:54.725744661 +0200
@@ -511,7 +511,7 @@ static void neigh_add_path(struct sk_buf
struct ipoib_path *path;
struct ipoib_neigh *neigh;
 
-   neigh = ipoib_neigh_alloc(skb-dst-neighbour);
+   neigh = ipoib_neigh_alloc(skb-dst-neighbour, skb-dev);
if (!neigh) {
++priv-stats.tx_dropped;
dev_kfree_skb_any(skb);
@@ -830,6 +830,13 @@ static void ipoib_neigh_cleanup(struct n
unsigned long flags;
struct ipoib_ah *ah = NULL;
 
+   neigh = *to_ipoib_neigh(n);
+   if (neigh) {
+   priv = netdev_priv(neigh-dev);
+   ipoib_dbg(priv, neigh_destructor for bonding device: %s\n,
+ n-dev-name);
+   } else
+   return;
ipoib_dbg(priv,
  neigh_cleanup for %06x  IPOIB_GID_FMT \n,
  IPOIB_QPN(n-ha),
@@ -837,13 +844,10 @@ static void ipoib_neigh_cleanup(struct n
 
spin_lock_irqsave(priv-lock, flags);
 
-   neigh = *to_ipoib_neigh(n);
-   if (neigh) {
-   if (neigh-ah)
-   ah = neigh-ah;
-   list_del(neigh-list);
-   ipoib_neigh_free(n-dev, neigh);
-   }
+   if (neigh-ah)
+   ah = neigh-ah;
+   list_del(neigh-list);
+   ipoib_neigh_free(n-dev, neigh);
 
spin_unlock_irqrestore(priv-lock, flags);
 
@@ -851,7 +855,8 @@ static void ipoib_neigh_cleanup(struct n
ipoib_put_ah(ah);
 }
 
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour)
+struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour,
+ struct net_device *dev)
 {
struct ipoib_neigh *neigh;
 
@@ -860,6 +865,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(st
return NULL;
 
neigh-neighbour = neighbour;
+   neigh-dev = dev;
*to_ipoib_neigh(neighbour) = neigh;
skb_queue_head_init(neigh-queue);
ipoib_cm_set(neigh, NULL);
Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-18 
17:08:53.245849217 +0200
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c  2007-09-18 
17:09:26.536874045 +0200
@@ -727,7 +727,8 @@ out:
if (skb-dst
skb-dst-neighbour 
!*to_ipoib_neigh(skb-dst-neighbour)) {
-

[ofa-general] [PATCH V6 2/9] IB/ipoib: Verify address handle validity on send

2007-09-24 Thread Moni Shoua

When the bonding device senses a carrier loss of its active slave it replaces
that slave with a new one. In between the times when the carrier of an IPoIB
device goes down and ipoib_neigh is destroyed, it is possible that the
bonding driver will send a packet on a new slave that uses an old ipoib_neigh.
This patch detects and prevents this from happenning.

Signed-off-by: Moni Shoua monis at voltaire.com
Signed-off-by: Or Gerlitz ogerlitz at voltaire.com
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c  2007-09-18 
17:09:26.535874225 +0200
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c   2007-09-18 
17:10:22.375853147 +0200
@@ -686,9 +686,10 @@ static int ipoib_start_xmit(struct sk_bu
goto out;
}
} else if (neigh-ah) {
-   if (unlikely(memcmp(neigh-dgid.raw,
+   if (unlikely((memcmp(neigh-dgid.raw,
skb-dst-neighbour-ha + 4,
-   sizeof(union ib_gid {
+   sizeof(union ib_gid))) ||
+(neigh-dev != dev))) {
spin_lock(priv-lock);
/*
 * It's safe to call ipoib_put_ah() inside

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH V6 3/9] net/bonding: Enable bonding to enslave non ARPHRD_ETHER

2007-09-24 Thread Moni Shoua

This patch changes some of the bond netdevice attributes and functions
to be that of the active slave for the case of the enslaved device not being
of ARPHRD_ETHER type. Basically it overrides those setting done by 
ether_setup(),
which are netdevice **type** dependent and hence might be not appropriate for
devices of other types. It also enforces mutual exclusion on bonding slaves
from dissimilar ether types, as was concluded over the v1 discussion.

IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes
IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this
IPoIB device is bounded to. The QP is a resource created by the IB HW and the
GID is an identifier burned into the HCA (i have omitted here some details which
are not important for the bonding RFC).

Signed-off-by: Moni Shoua monis at voltaire.com
Signed-off-by: Or Gerlitz ogerlitz at voltaire.com
---
 drivers/net/bonding/bond_main.c |   39 +++
 1 files changed, 39 insertions(+)

Index: net-2.6/drivers/net/bonding/bond_main.c
===
--- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 
10:08:59.0 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:13.424688411 
+0300
@@ -1237,6 +1237,26 @@ static int bond_compute_features(struct 
return 0;
 }
 
+
+static void bond_setup_by_slave(struct net_device *bond_dev,
+   struct net_device *slave_dev)
+{
+   bond_dev-hard_header   = slave_dev-hard_header;
+   bond_dev-rebuild_header= slave_dev-rebuild_header;
+   bond_dev-hard_header_cache = slave_dev-hard_header_cache;
+   bond_dev-header_cache_update   = slave_dev-header_cache_update;
+   bond_dev-hard_header_parse = slave_dev-hard_header_parse;
+
+   bond_dev-neigh_setup   = slave_dev-neigh_setup;
+
+   bond_dev-type  = slave_dev-type;
+   bond_dev-hard_header_len   = slave_dev-hard_header_len;
+   bond_dev-addr_len  = slave_dev-addr_len;
+
+   memcpy(bond_dev-broadcast, slave_dev-broadcast,
+   slave_dev-addr_len);
+}
+
 /* enslave device slave to bond device master */
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 {
@@ -1311,6 +1331,25 @@ int bond_enslave(struct net_device *bond
goto err_undo_flags;
}
 
+   /* set bonding device ether type by slave - bonding netdevices are
+* created with ether_setup, so when the slave type is not ARPHRD_ETHER
+* there is a need to override some of the type dependent attribs/funcs.
+*
+* bond ether type mutual exclusion - don't allow slaves of dissimilar
+* ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same 
bond
+*/
+   if (bond-slave_cnt == 0) {
+   if (slave_dev-type != ARPHRD_ETHER)
+   bond_setup_by_slave(bond_dev, slave_dev);
+   } else if (bond_dev-type != slave_dev-type) {
+   printk(KERN_ERR DRV_NAME : %s ether type (%d) is different 
+   from other slaves (%d), can not enslave it.\n,
+   slave_dev-name,
+   slave_dev-type, bond_dev-type);
+   res = -EINVAL;
+   goto err_undo_flags;
+   }
+
if (slave_dev-set_mac_address == NULL) {
printk(KERN_ERR DRV_NAME
: %s: Error: The slave device you specified does 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH V6 4/9] net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address()

2007-09-24 Thread Moni Shoua

This patch allows for enslaving netdevices which do not support
the set_mac_address() function. In that case the bond mac address is the one
of the active slave, where remote peers are notified on the mac address
(neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs
(this is already done by the bonding code).

Signed-off-by: Moni Shoua monis at voltaire.com
Signed-off-by: Or Gerlitz ogerlitz at voltaire.com
---
 drivers/net/bonding/bond_main.c |   87 +++-
 drivers/net/bonding/bonding.h   |1 
 2 files changed, 60 insertions(+), 28 deletions(-)

Index: net-2.6/drivers/net/bonding/bond_main.c
===
--- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 
10:54:13.0 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.971632881 
+0300
@@ -1095,6 +1095,14 @@ void bond_change_active_slave(struct bon
if (new_active) {
bond_set_slave_active_flags(new_active);
}
+
+   /* when bonding does not set the slave MAC address, the bond MAC
+* address is the one of the active slave.
+*/
+   if (new_active  !bond-do_set_mac_addr)
+   memcpy(bond-dev-dev_addr,  new_active-dev-dev_addr,
+   new_active-dev-addr_len);
+
bond_send_gratuitous_arp(bond);
}
 }
@@ -1351,13 +1359,22 @@ int bond_enslave(struct net_device *bond
}
 
if (slave_dev-set_mac_address == NULL) {
-   printk(KERN_ERR DRV_NAME
-   : %s: Error: The slave device you specified does 
-   not support setting the MAC address. 
-   Your kernel likely does not support slave 
-   devices.\n, bond_dev-name);
-   res = -EOPNOTSUPP;
-   goto err_undo_flags;
+   if (bond-slave_cnt == 0) {
+   printk(KERN_WARNING DRV_NAME
+   : %s: Warning: The first slave device you 
+   specified does not support setting the MAC 
+   address. This bond MAC address would be that 
+   of the active slave.\n, bond_dev-name);
+   bond-do_set_mac_addr = 0;
+   } else if (bond-do_set_mac_addr) {
+   printk(KERN_ERR DRV_NAME
+   : %s: Error: The slave device you specified 
+   does not support setting the MAC addres,.
+   but this bond uses this practice. \n
+   , bond_dev-name);
+   res = -EOPNOTSUPP;
+   goto err_undo_flags;
+   }
}
 
new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL);
@@ -1378,16 +1395,18 @@ int bond_enslave(struct net_device *bond
 */
memcpy(new_slave-perm_hwaddr, slave_dev-dev_addr, ETH_ALEN);
 
-   /*
-* Set slave to master's mac address.  The application already
-* set the master's mac address to that of the first slave
-*/
-   memcpy(addr.sa_data, bond_dev-dev_addr, bond_dev-addr_len);
-   addr.sa_family = slave_dev-type;
-   res = dev_set_mac_address(slave_dev, addr);
-   if (res) {
-   dprintk(Error %d calling set_mac_address\n, res);
-   goto err_free;
+   if (bond-do_set_mac_addr) {
+   /*
+* Set slave to master's mac address.  The application already
+* set the master's mac address to that of the first slave
+*/
+   memcpy(addr.sa_data, bond_dev-dev_addr, bond_dev-addr_len);
+   addr.sa_family = slave_dev-type;
+   res = dev_set_mac_address(slave_dev, addr);
+   if (res) {
+   dprintk(Error %d calling set_mac_address\n, res);
+   goto err_free;
+   }
}
 
res = netdev_set_master(slave_dev, bond_dev);
@@ -1612,9 +1631,11 @@ err_close:
dev_close(slave_dev);
 
 err_restore_mac:
-   memcpy(addr.sa_data, new_slave-perm_hwaddr, ETH_ALEN);
-   addr.sa_family = slave_dev-type;
-   dev_set_mac_address(slave_dev, addr);
+   if (bond-do_set_mac_addr) {
+   memcpy(addr.sa_data, new_slave-perm_hwaddr, ETH_ALEN);
+   addr.sa_family = slave_dev-type;
+   dev_set_mac_address(slave_dev, addr);
+   }
 
 err_free:
kfree(new_slave);
@@ -1792,10 +1813,12 @@ int bond_release(struct net_device *bond
/* close slave before restoring its mac address */
dev_close(slave_dev);
 
-   /* restore original (permanent) mac address */
-   memcpy(addr.sa_data, slave-perm_hwaddr, ETH_ALEN);
-

[ofa-general] [PATCH V6 5/9] net/bonding: Enable IP multicast for bonding IPoIB devices

2007-09-24 Thread Moni Shoua

Allow to enslave devices when the bonding device is not up. Over the discussion
held at the previous post this seemed to be the most clean way to go, where it
is not expected to cause instabilities.

Normally, the bonding driver is UP before any enslavement takes place.
Once a netdevice is UP, the network stack acts to have it join some multicast 
groups
(eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding 
device
type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code
computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called
where for multicast joins taking place after the enslavement another 
ip_xxx_mc_map()
is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND)

Signed-off-by: Moni Shoua monis at voltaire.com
Signed-off-by: Or Gerlitz ogerlitz at voltaire.com
---
 drivers/net/bonding/bond_main.c  |5 +++--
 drivers/net/bonding/bond_sysfs.c |6 ++
 2 files changed, 5 insertions(+), 6 deletions(-)

Index: net-2.6/drivers/net/bonding/bond_main.c
===
--- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 
10:54:41.0 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.431862446 
+0300
@@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond
 
/* bond must be initialized by bond_open() before enslaving */
if (!(bond_dev-flags  IFF_UP)) {
-   dprintk(Error, master_dev is not up\n);
-   return -EPERM;
+   printk(KERN_WARNING DRV_NAME
+%s: master_dev is not up in bond_enslave\n,
+   bond_dev-name);
}
 
/* already enslaved */
Index: net-2.6/drivers/net/bonding/bond_sysfs.c
===
--- net-2.6.orig/drivers/net/bonding/bond_sysfs.c   2007-08-15 
10:08:58.0 +0300
+++ net-2.6/drivers/net/bonding/bond_sysfs.c2007-08-15 10:55:48.432862269 
+0300
@@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru
 
/* Quick sanity check -- is the bond interface up? */
if (!(bond-dev-flags  IFF_UP)) {
-   printk(KERN_ERR DRV_NAME
-  : %s: Unable to update slaves because interface is 
down.\n,
+   printk(KERN_WARNING DRV_NAME
+  : %s: doing slave updates when interface is down.\n,
   bond-dev-name);
-   ret = -EPERM;
-   goto out;
}
 
/* Note:  We can't hold bond-lock here, as bond_create grabs it. */

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH V6 6/9] net/bonding: Handlle wrong assumptions that slave is always an Ethernet device

2007-09-24 Thread Moni Shoua

bonding sometimes uses Ethernet constants (such as MTU and address length) which
are not good when it enslaves non Ethernet devices (such as InfiniBand).

Signed-off-by: Moni Shoua monis at voltaire.com
---
 drivers/net/bonding/bond_main.c  |3 ++-
 drivers/net/bonding/bond_sysfs.c |   10 --
 drivers/net/bonding/bonding.h|1 +
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: net-2.6/drivers/net/bonding/bond_main.c
===
--- net-2.6.orig/drivers/net/bonding/bond_main.c2007-09-24 
12:52:33.0 +0200
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-09-24 12:57:33.411459811 
+0200
@@ -1224,7 +1224,8 @@ static int bond_compute_features(struct 
struct slave *slave;
struct net_device *bond_dev = bond-dev;
unsigned long features = bond_dev-features;
-   unsigned short max_hard_header_len = ETH_HLEN;
+   unsigned short max_hard_header_len = max((u16)ETH_HLEN,
+   bond_dev-hard_header_len);
int i;
 
features = ~(NETIF_F_ALL_CSUM | BOND_VLAN_FEATURES);
Index: net-2.6/drivers/net/bonding/bond_sysfs.c
===
--- net-2.6.orig/drivers/net/bonding/bond_sysfs.c   2007-09-24 
12:55:09.0 +0200
+++ net-2.6/drivers/net/bonding/bond_sysfs.c2007-09-24 13:00:23.752680721 
+0200
@@ -260,6 +260,7 @@ static ssize_t bonding_store_slaves(stru
char command[IFNAMSIZ + 1] = { 0, };
char *ifname;
int i, res, found, ret = count;
+   u32 original_mtu;
struct slave *slave;
struct net_device *dev = NULL;
struct bonding *bond = to_bond(d);
@@ -325,6 +326,7 @@ static ssize_t bonding_store_slaves(stru
}
 
/* Set the slave's MTU to match the bond */
+   original_mtu = dev-mtu;
if (dev-mtu != bond-dev-mtu) {
if (dev-change_mtu) {
res = dev-change_mtu(dev,
@@ -339,6 +341,9 @@ static ssize_t bonding_store_slaves(stru
}
rtnl_lock();
res = bond_enslave(bond-dev, dev);
+   bond_for_each_slave(bond, slave, i)
+   if (strnicmp(slave-dev-name, ifname, IFNAMSIZ) == 0)
+   slave-original_mtu = original_mtu;
rtnl_unlock();
if (res) {
ret = res;
@@ -351,6 +356,7 @@ static ssize_t bonding_store_slaves(stru
bond_for_each_slave(bond, slave, i)
if (strnicmp(slave-dev-name, ifname, IFNAMSIZ) == 0) {
dev = slave-dev;
+   original_mtu = slave-original_mtu;
break;
}
if (dev) {
@@ -365,9 +371,9 @@ static ssize_t bonding_store_slaves(stru
}
/* set the slave MTU to the default */
if (dev-change_mtu) {
-   dev-change_mtu(dev, 1500);
+   dev-change_mtu(dev, original_mtu);
} else {
-   dev-mtu = 1500;
+   dev-mtu = original_mtu;
}
}
else {
Index: net-2.6/drivers/net/bonding/bonding.h
===
--- net-2.6.orig/drivers/net/bonding/bonding.h  2007-09-24 12:55:09.0 
+0200
+++ net-2.6/drivers/net/bonding/bonding.h   2007-09-24 12:57:33.412459636 
+0200
@@ -156,6 +156,7 @@ struct slave {
s8 link;/* one of BOND_LINK_ */
s8 state;   /* one of BOND_STATE_ */
u32original_flags;
+   u32original_mtu;
u32link_failure_count;
u16speed;
u8 duplex;

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] PATCH V6 7/9] net/bonding: Delay sending of gratuitous ARP to avoid failure

2007-09-24 Thread Moni Shoua

Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit
in dev-state field is on. This improves the chances for the arp packet to
be transmitted.

Signed-off-by: Moni Shoua monis at voltaire.com
---
 drivers/net/bonding/bond_main.c |   24 +---
 drivers/net/bonding/bonding.h   |1 +
 2 files changed, 22 insertions(+), 3 deletions(-)

Index: net-2.6/drivers/net/bonding/bond_main.c
===
--- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 
10:56:33.0 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 11:04:37.221123652 
+0300
@@ -1102,8 +1102,14 @@ void bond_change_active_slave(struct bon
if (new_active  !bond-do_set_mac_addr)
memcpy(bond-dev-dev_addr,  new_active-dev-dev_addr,
new_active-dev-addr_len);
-
-   bond_send_gratuitous_arp(bond);
+   if (bond-curr_active_slave 
+   test_bit(__LINK_STATE_LINKWATCH_PENDING,
+   bond-curr_active_slave-dev-state)) {
+   dprintk(delaying gratuitous arp on %s\n,
+   bond-curr_active_slave-dev-name);
+   bond-send_grat_arp = 1;
+   } else
+   bond_send_gratuitous_arp(bond);
}
 }
 
@@ -2083,6 +2089,17 @@ void bond_mii_monitor(struct net_device 
 * program could monitor the link itself if needed.
 */
 
+   if (bond-send_grat_arp) {
+   if (bond-curr_active_slave  
test_bit(__LINK_STATE_LINKWATCH_PENDING,
+   bond-curr_active_slave-dev-state))
+   dprintk(Needs to send gratuitous arp but not yet\n);
+   else {
+   dprintk(sending delayed gratuitous arp on on %s\n,
+   bond-curr_active_slave-dev-name);
+   bond_send_gratuitous_arp(bond);
+   bond-send_grat_arp = 0;
+   }
+   }
read_lock(bond-curr_slave_lock);
oldcurrent = bond-curr_active_slave;
read_unlock(bond-curr_slave_lock);
@@ -2484,7 +2501,7 @@ static void bond_send_gratuitous_arp(str
 
if (bond-master_ip) {
bond_arp_send(slave-dev, ARPOP_REPLY, bond-master_ip,
- bond-master_ip, 0);
+   bond-master_ip, 0);
}
 
list_for_each_entry(vlan, bond-vlan_list, vlan_list) {
@@ -4293,6 +4310,7 @@ static int bond_init(struct net_device *
bond-current_arp_slave = NULL;
bond-primary_slave = NULL;
bond-dev = bond_dev;
+   bond-send_grat_arp = 0;
INIT_LIST_HEAD(bond-vlan_list);
 
/* Initialize the device entry points */
Index: net-2.6/drivers/net/bonding/bonding.h
===
--- net-2.6.orig/drivers/net/bonding/bonding.h  2007-08-15 10:56:33.0 
+0300
+++ net-2.6/drivers/net/bonding/bonding.h   2007-08-15 11:05:41.516451497 
+0300
@@ -187,6 +187,7 @@ struct bonding {
struct   timer_list arp_timer;
s8   kill_timers;
s8   do_set_mac_addr;
+   s8   send_grat_arp;
struct   net_device_stats stats;
 #ifdef CONFIG_PROC_FS
struct   proc_dir_entry *proc_entry;

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH V6 8/9] net/bonding: Destroy bonding master when last slave is gone

2007-09-24 Thread Moni Shoua

When bonding enslaves non Ethernet devices it takes pointers to functions 
in the module that owns the slaves. In this case it becomes unsafe
to keep the bonding master registered after last slave was unenslaved 
because we don't know if the pointers are still valid.  Destroying the bond 
when slave_cnt is zero
ensures that these functions be used anymore.

Signed-off-by: Moni Shoua monis at voltaire.com
---
 drivers/net/bonding/bond_main.c  |   37 +
 drivers/net/bonding/bond_sysfs.c |9 +
 drivers/net/bonding/bonding.h|3 +++
 3 files changed, 45 insertions(+), 4 deletions(-)

Index: net-2.6/drivers/net/bonding/bond_main.c
===
--- net-2.6.orig/drivers/net/bonding/bond_main.c2007-09-24 
14:01:24.055441842 +0200
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-09-24 14:05:05.658979207 
+0200
@@ -1256,6 +1256,7 @@ static int bond_compute_features(struct 
 static void bond_setup_by_slave(struct net_device *bond_dev,
struct net_device *slave_dev)
 {
+   struct bonding *bond = bond_dev-priv;
bond_dev-hard_header   = slave_dev-hard_header;
bond_dev-rebuild_header= slave_dev-rebuild_header;
bond_dev-hard_header_cache = slave_dev-hard_header_cache;
@@ -1270,6 +1271,7 @@ static void bond_setup_by_slave(struct n
 
memcpy(bond_dev-broadcast, slave_dev-broadcast,
slave_dev-addr_len);
+   bond-setup_by_slave = 1;
 }
 
 /* enslave device slave to bond device master */
@@ -1838,6 +1840,35 @@ int bond_release(struct net_device *bond
 }
 
 /*
+* Destroy a bonding device.
+* Must be under rtnl_lock when this function is called.
+*/
+void bond_destroy(struct bonding *bond)
+{
+   bond_deinit(bond-dev);
+   bond_destroy_sysfs_entry(bond);
+   unregister_netdevice(bond-dev);
+}
+
+/*
+* First release a slave and than destroy the bond if no more slaves iare left.
+* Must be under rtnl_lock when this function is called.
+*/
+int  bond_release_and_destroy(struct net_device *bond_dev, struct net_device 
*slave_dev)
+{
+   struct bonding *bond = bond_dev-priv;
+   int ret;
+
+   ret = bond_release(bond_dev, slave_dev);
+   if ((ret == 0)  (bond-slave_cnt == 0)) {
+   printk(KERN_INFO DRV_NAME  %s: destroying bond %s.\n,
+   bond_dev-name);
+   bond_destroy(bond);
+   }
+   return ret;
+}
+
+/*
  * This function releases all slaves.
  */
 static int bond_release_all(struct net_device *bond_dev)
@@ -3337,6 +3368,11 @@ static int bond_slave_netdev_event(unsig
 * ... Or is it this?
 */
break;
+   case NETDEV_GOING_DOWN:
+   dprintk(slave %s is going down\n, slave_dev-name);
+   if (bond-setup_by_slave)
+   bond_release_and_destroy(bond_dev, slave_dev);
+   break;
case NETDEV_CHANGEMTU:
/*
 * TODO: Should slaves be allowed to
@@ -4311,6 +4347,7 @@ static int bond_init(struct net_device *
bond-primary_slave = NULL;
bond-dev = bond_dev;
bond-send_grat_arp = 0;
+   bond-setup_by_slave = 0;
INIT_LIST_HEAD(bond-vlan_list);
 
/* Initialize the device entry points */
Index: net-2.6/drivers/net/bonding/bonding.h
===
--- net-2.6.orig/drivers/net/bonding/bonding.h  2007-09-24 14:01:24.055441842 
+0200
+++ net-2.6/drivers/net/bonding/bonding.h   2007-09-24 14:01:24.627340013 
+0200
@@ -188,6 +188,7 @@ struct bonding {
s8   kill_timers;
s8   do_set_mac_addr;
s8   send_grat_arp;
+   s8   setup_by_slave;
struct   net_device_stats stats;
 #ifdef CONFIG_PROC_FS
struct   proc_dir_entry *proc_entry;
@@ -295,6 +296,8 @@ static inline void bond_unset_master_alb
 struct vlan_entry *bond_next_vlan(struct bonding *bond, struct vlan_entry 
*curr);
 int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb, struct 
net_device *slave_dev);
 int bond_create(char *name, struct bond_params *params, struct bonding 
**newbond);
+void bond_destroy(struct bonding *bond);
+int  bond_release_and_destroy(struct net_device *bond_dev, struct net_device 
*slave_dev);
 void bond_deinit(struct net_device *bond_dev);
 int bond_create_sysfs(void);
 void bond_destroy_sysfs(void);
Index: net-2.6/drivers/net/bonding/bond_sysfs.c
===
--- net-2.6.orig/drivers/net/bonding/bond_sysfs.c   2007-09-24 
14:01:23.523536550 +0200
+++ net-2.6/drivers/net/bonding/bond_sysfs.c2007-09-24 14:01:24.628339835 
+0200
@@ -164,9 +164,7 @@ static ssize_t bonding_store_bonds(struc
printk(KERN_INFO DRV_NAME
: %s

[ofa-general] [PATCH 9/9] bonding: Optionally allow ethernet slaves to keep own MAC

2007-09-24 Thread Moni Shoua

Update the don't change MAC of slaves functionality added in
previous changes to be a generic option, rather than something tied to IB
devices, as it's occasionally useful for regular ethernet devices as well.

Adds fail_over_mac option (which is automatically enabled for IB
slaves), applicable only to active-backup mode.

Includes documentation update.

Updates bonding driver version to 3.2.0.

Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]
---
 Documentation/networking/bonding.txt |   33 +++
 drivers/net/bonding/bond_main.c  |   57 +
 drivers/net/bonding/bond_sysfs.c |   49 +
 drivers/net/bonding/bonding.h|6 ++--
 4 files changed, 121 insertions(+), 24 deletions(-)

diff --git a/Documentation/networking/bonding.txt 
b/Documentation/networking/bonding.txt
index 1da5666..1134062 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -281,6 +281,39 @@ downdelay
will be rounded down to the nearest multiple.  The default
value is 0.
 
+fail_over_mac
+
+   Specifies whether active-backup mode should set all slaves to
+   the same MAC address (the traditional behavior), or, when
+   enabled, change the bond's MAC address when changing the
+   active interface (i.e., fail over the MAC address itself).
+
+   Fail over MAC is useful for devices that cannot ever alter
+   their MAC address, or for devices that refuse incoming
+   broadcasts with their own source MAC (which interferes with
+   the ARP monitor).
+
+   The down side of fail over MAC is that every device on the
+   network must be updated via gratuitous ARP, vs. just updating
+   a switch or set of switches (which often takes place for any
+   traffic, not just ARP traffic, if the switch snoops incoming
+   traffic to update its tables) for the traditional method.  If
+   the gratuitous ARP is lost, communication may be disrupted.
+
+   When fail over MAC is used in conjuction with the mii monitor,
+   devices which assert link up prior to being able to actually
+   transmit and receive are particularly susecptible to loss of
+   the gratuitous ARP, and an appropriate updelay setting may be
+   required.
+
+   A value of 0 disables fail over MAC, and is the default.  A
+   value of 1 enables fail over MAC.  This option is enabled
+   automatically if the first slave added cannot change its MAC
+   address.  This option may be modified via sysfs only when no
+   slaves are present in the bond.
+
+   This option was added in bonding version 3.2.0.
+
 lacp_rate
 
Option specifying the rate in which we'll ask our link partner
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 77caca3..c01ff9d 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -97,6 +97,7 @@ static char *xmit_hash_policy = NULL;
 static int arp_interval = BOND_LINK_ARP_INTERV;
 static char *arp_ip_target[BOND_MAX_ARP_TARGETS] = { NULL, };
 static char *arp_validate = NULL;
+static int fail_over_mac = 0;
 struct bond_params bonding_defaults;
 
 module_param(max_bonds, int, 0);
@@ -130,6 +131,8 @@ module_param_array(arp_ip_target, charp, NULL, 0);
 MODULE_PARM_DESC(arp_ip_target, arp targets in n.n.n.n form);
 module_param(arp_validate, charp, 0);
 MODULE_PARM_DESC(arp_validate, validate src/dst of ARP probes: none 
(default), active, backup or all);
+module_param(fail_over_mac, int, 0);
+MODULE_PARM_DESC(fail_over_mac, For active-backup, do not set all slaves to 
the same MAC.  0 of off (default), 1 for on.);
 
 /*- Global variables */
 
@@ -1099,7 +1102,7 @@ void bond_change_active_slave(struct bonding *bond, 
struct slave *new_active)
/* when bonding does not set the slave MAC address, the bond MAC
 * address is the one of the active slave.
 */
-   if (new_active  !bond-do_set_mac_addr)
+   if (new_active  bond-params.fail_over_mac)
memcpy(bond-dev-dev_addr,  new_active-dev-dev_addr,
new_active-dev-addr_len);
if (bond-curr_active_slave 
@@ -1371,16 +1374,16 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev)
if (slave_dev-set_mac_address == NULL) {
if (bond-slave_cnt == 0) {
printk(KERN_WARNING DRV_NAME
-   : %s: Warning: The first slave device you 
-   specified does not support setting the MAC 
-   address. This bond MAC address would be that 
-   of the active slave.\n, bond_dev-name);
-   bond-do_set_mac_addr = 0;
-   } else

Re: [ofa-general] Re: [ewg] OFED teleconference today

2007-09-24 Thread Steve Wise


I cannot make the meeting today.

I vote for 2.6.24 base.

There is still the outstanding iwarp port space issue that will need to 
be pulled into ofed-1.3 when it finalizes.  But its a bug fix really, so 
not a new feature I guess.




Tziporet Koren wrote:

Jeff Squyres wrote:
Friendly reminder: the OFED teleconference is several hours from now 
(Monday, September 24, 2007).


Noon US eastern / 9am US Pacific / -=6pm Israel=-
1. Monday, Sep 24, code 210062024 (***TODAY***)


Agenda:
1. Agree on the new OFED 1.3 schedule:

   * Feature freeze - Sep 25
   * Alpha release - Oct 1
   * Beta release - Oct 17 (may change according to 2.6.24 rc1
 availability)
   * RC1 - Oct 24
   * RC2 - Nov 7
   * RC3 - Nov 20
   * RC4 - Dec 4
   * GA release - Dec 18

2. Agree to move to kernel base 2.6.24
   Start with what we have now (2.6.23) and move to 2.6.24 when RC1 is 
available.
   This will reduce many patches and with the new timeline seems more 
appropriate.


Please send if you have any other agenda items

Tziporet


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH V6 5/9] net/bonding: Enable IP multicast for bonding IPoIB devices

2007-09-24 Thread Stephen Hemminger

On Mon, 24 Sep 2007 17:37:00 +0200
Moni Shoua [EMAIL PROTECTED] wrote:

 Allow to enslave devices when the bonding device is not up. Over the 
 discussion
 held at the previous post this seemed to be the most clean way to go, where it
 is not expected to cause instabilities.
 
 Normally, the bonding driver is UP before any enslavement takes place.
 Once a netdevice is UP, the network stack acts to have it join some multicast 
 groups
 (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding 
 device
 type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code
 computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called
 where for multicast joins taking place after the enslavement another 
 ip_xxx_mc_map()
 is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND)
 
 Signed-off-by: Moni Shoua monis at voltaire.com
 Signed-off-by: Or Gerlitz ogerlitz at voltaire.com
 ---
  drivers/net/bonding/bond_main.c  |5 +++--
  drivers/net/bonding/bond_sysfs.c |6 ++
  2 files changed, 5 insertions(+), 6 deletions(-)
 
 Index: net-2.6/drivers/net/bonding/bond_main.c
 ===
 --- net-2.6.orig/drivers/net/bonding/bond_main.c  2007-08-15 
 10:54:41.0 +0300
 +++ net-2.6/drivers/net/bonding/bond_main.c   2007-08-15 10:55:48.431862446 
 +0300
 @@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond
  
   /* bond must be initialized by bond_open() before enslaving */
   if (!(bond_dev-flags  IFF_UP)) {
 - dprintk(Error, master_dev is not up\n);
 - return -EPERM;
 + printk(KERN_WARNING DRV_NAME
 +  %s: master_dev is not up in bond_enslave\n,
 + bond_dev-name);
   }
  
   /* already enslaved */
 Index: net-2.6/drivers/net/bonding/bond_sysfs.c
 ===
 --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 
 10:08:58.0 +0300
 +++ net-2.6/drivers/net/bonding/bond_sysfs.c  2007-08-15 10:55:48.432862269 
 +0300
 @@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru
  
   /* Quick sanity check -- is the bond interface up? */
   if (!(bond-dev-flags  IFF_UP)) {
 - printk(KERN_ERR DRV_NAME
 -: %s: Unable to update slaves because interface is 
 down.\n,
 + printk(KERN_WARNING DRV_NAME
 +: %s: doing slave updates when interface is down.\n,
  bond-dev-name);
 - ret = -EPERM;
 - goto out;
   }
  

Please get rid of the warning. Make bonding work correctly and allow 
enslave/remove
of device when bonding is down.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [BUG report / PATCH] fix race in the core multicast management

2007-09-24 Thread Sean Hefty

Now, is this case there was --no-- previous event, when the port was 
brought back online there was PORT_ACTIVE event (its a driver issue 
which we look at). However, from the view point of the SA there was GID 
out event, so the HCA port was dropped out from the multicast group and 
the multicast routing (spanning tree, MFTs configuration etc) was 
computed without this port being included. This is the ipoib logging of 
what happens from its perspective (I have added the event number to the 
port state change event print):


Do you know why there wasn't some sort of port down event?


node 1 - switch A - switch B - switch C - SA


The host would only see port up/down events as of changes in the link 
state in the local port or in the port which is connected to it through 
the cable.



So, if you brought the link down/up between switches A  B, node 1 
wouldn't receive any events, but it would be removed from the multicast 
group?


- Sean
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] RE: OFA website edits

2007-09-24 Thread Sean Hefty


Jeff Becker wrote:

I'm OK with these suggestions. Please let me know what you would like
implemented. Thanks.


I tried changing my WEB_README, and the updates didn't show up on the 
download page.  How often should be the page be updated?


- Sean
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch

2007-09-24 Thread Sean Hefty

I used git-format-patch to extract patches from this tree
and add them to ofed 1.3 kernel tree.

Thanks
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] rdma_cm connect / disconnect / reject race....resulting in crash....

2007-09-24 Thread rick


Sean, per our discussion here's the problem description from Olaf...


We start to shut down the connection, and call rdma_destroy_qp
on our cm_id. We haven't executed rdma_destroy_id yet.

Now apparently a connect reject message comes in from the
other host, and cma_ib_handler() is called with an event
of IB_CM_REJ_RECEIVED. It calls cma_modify_qp_err, which
for some odd reason tries to modify the exact same QP we just
destroyed.

The crash looks like this:

RDS/IB: connection request while the connection exist: 11.0.0.18, disconnecting 
and reconnecting ic f7ccb800 ic-i_cm_id f7cb2a00
rdma_destroy_qp(f7cb2a00)
Unable to handle kernel NULL pointer dereference at virtual address 00f8

EIP is at ib_modify_qp+0x5/0xe [ib_core]

Stack:  f7cb2a00 f8ac36af 0006  1a0f4680 f6742e7c c011cc85
  c495ede0 f671ce30 c495ede0 c495ede0 0086 c495ede0 c011d1a3 f671ce30
  f671ce30 0002 c4966de0 0002  c495ede0 0001 0001
Call Trace:
[f8ac36af] cma_modify_qp_err+0x22/0x2d [rdma_cm]
[...]
[f8ac3371] cma_disable_remove+0x35/0x3b [rdma_cm]
[f8ac3e31] cma_ib_handler+0xe6/0x158 [rdma_cm]
[f89150f7] cm_process_work+0x4a/0x80 [ib_cm]
[f8916c33] cm_rej_handler+0xd3/0x114 [ib_cm]

It dies trying to dereference qp-device-modify_qp
because qp-device is NULL. If you check the stack, you'll see
the exact same cm_id that we just called rdma_destroy_qp() on
(note that the printk(rdma_destroy_qp) that appears above comes
*after* the call itself, so by the time this is printed, the QP
is dead already.

That's easy, I thought. Obviously, rdma_destroy_qp just forgets to
clear cm_id-qp after destroying the queue pair:

void rdma_destroy_qp(struct rdma_cm_id *id)
{
   ib_destroy_qp(id-qp);
+   id-qp = NULL;
}

But that didn't really fix it. So either there's something else
going on which I don't grok yet, or this is just another case of
bad locking.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] RE: OFA website edits

2007-09-24 Thread Jeff Becker

Hi Sean. I just talked to Jeff Scott about this, as he had announced
the new downloads page. It turns out that the new page does not use my
php page that automatically updates, but rather took a snapshot of
the page state. That's why your update doesn't show up. He said he
would try to fix this.

-jeff

On 9/24/07, Sean Hefty [EMAIL PROTECTED] wrote:
 Jeff Becker wrote:
  I'm OK with these suggestions. Please let me know what you would like
  implemented. Thanks.

 I tried changing my WEB_README, and the updates didn't show up on the
 download page.  How often should be the page be updated?

 - Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread Waskiewicz Jr, Peter P

 I have submitted this before; but here it is again.
 Against net-2.6.24 from yesterday for this and all following patches. 
 
 
 cheers,
 jamal

Hi Jamal,
I've been (slowly) working on resurrecting the original design
of my multiqueue patches to address this exact issue of the queue_lock
being a hot item.  I added a queue_lock to each queue in the subqueue
struct, and in the enqueue and dequeue, just lock that queue instead of
the global device queue_lock.  The only two issues to overcome are the
QDISC_RUNNING state flag, since that also serializes entry into the
qdisc_restart() function, and the qdisc statistics maintenance, which
needs to be serialized.  Do you think this work along with your patch
will benefit from one another?  I apologize for not having working
patches right now, but I am working on them slowly as I have some blips
of spare time.

Thanks,
-PJ Waskiewicz
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.

2007-09-24 Thread Glenn Grundstrom

I'm sure I had seen a previous email in this thread that suggested using
a userspace library to open a socket
in the shared port space.  It seems that suggestion was dropped without
reason.  Does anyone know why?

Thanks,
Glenn.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Steve Wise
Sent: Sunday, September 23, 2007 3:37 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED];
general@lists.openfabrics.org
Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only
interfacesto avoid 4-tuple conflicts.


iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts.

Version 3:

- don't use list_del_init() where list_del() is sufficient.

Version 2:

- added a per-device mutex for the address and listening endpoints
lists.

- wait for all replies if sending multiple passive_open requests to
rnic.

- log warning if no addresses are available when a listen is issued.

- tested

---

Design:

The sysadmin creates for iwarp use only alias interfaces of the form
devname:iw* where devname is the native interface name (eg eth0) for
the
iwarp netdev device.  The alias label can be anything starting with
iw.
The iw immediately after the ':' is the key used by the iw_cxgb3
driver.

EG:
ifconfig eth0 192.168.70.123 up
ifconfig eth0:iw1 192.168.71.123 up
ifconfig eth0:iw2 192.168.72.123 up

In the above example, 192.168.70/24 is for TCP traffic, while
192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use.

The rdma-only interface must be on its own IP subnet. This allows
routing
all rdma traffic onto this interface.

The iWARP driver must translate all listens on address 0.0.0.0 to the
set of rdma-only ip addresses for the device in question.  This prevents
incoming connect requests to the TCP ipaddresses from going up the
rdma stack.

Implementation Details:

- The iw_cxgb3 driver registers for inetaddr events via
register_inetaddr_notifier().  This allows tracking the iwarp-only
addresses/subnets as they get added and deleted.  The iwarp driver
maintains a list of the current iwarp-only addresses.

- The iw_cxgb3 driver builds the list of iwarp-only addresses for its
devices at module insert time.  This is needed because the inetaddr
notifier callbacks don't replay address-add events when someone
registers.  So the driver must build the initial list at module load
time.

- When a listen is done on address 0.0.0.0, then the iw_cxgb3 driver
must translate that into a set of listens on the iwarp-only addresses.
This is implemented by maintaining a list of stid/addr entries per
listening endpoint.

- When a new iwarp-only address is added or removed, the iw_cxgb3 driver
must traverse the set of listening endpoints and update them
accordingly.
This allows an application to bind to 0.0.0.0 prior to the iwarp-only
interfaces being configured.  It also allows changing the iwarp-only set
of addresses and getting the expected behavior for apps already bound
to 0.0.0.0.  This is done by maintaining a list of listening endpoints
off the device struct.

- The address list, the listening endpoint list, and each list of
stid/addrs in use per listening endpoint are all protected via a mutex
per iw_cxgb3 device.

Signed-off-by: Steve Wise [EMAIL PROTECTED]
---

 drivers/infiniband/hw/cxgb3/iwch.c|  125 
 drivers/infiniband/hw/cxgb3/iwch.h|   11 +
 drivers/infiniband/hw/cxgb3/iwch_cm.c |  259
+++--
 drivers/infiniband/hw/cxgb3/iwch_cm.h |   15 ++
 4 files changed, 360 insertions(+), 50 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c
b/drivers/infiniband/hw/cxgb3/iwch.c
index 0315c9d..d81d46e 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.c
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -63,6 +63,123 @@ struct cxgb3_client t3c_client = {
 static LIST_HEAD(dev_list);
 static DEFINE_MUTEX(dev_mutex);
 
+static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa)
+{
+   struct iwch_addrlist *addr;
+
+   addr = kmalloc(sizeof *addr, GFP_KERNEL);
+   if (!addr) {
+   printk(KERN_ERR MOD %s - failed to alloc memory!\n,
+  __FUNCTION__);
+   return;
+   }
+   addr-ifa = ifa;
+   mutex_lock(rnicp-mutex);
+   list_add_tail(addr-entry, rnicp-addrlist);
+   mutex_unlock(rnicp-mutex);
+}
+
+static void remove_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa)
+{
+   struct iwch_addrlist *addr, *tmp;
+
+   mutex_lock(rnicp-mutex);
+   list_for_each_entry_safe(addr, tmp, rnicp-addrlist, entry) {
+   if (addr-ifa == ifa) {
+   list_del(addr-entry);
+   kfree(addr);
+   goto out;
+   }
+   }
+out:
+   mutex_unlock(rnicp-mutex);
+}
+
+static int netdev_is_ours(struct iwch_dev *rnicp, struct net_device
*netdev)
+{
+   int i;
+
+   for (i = 0; i  rnicp-rdev.port_info.nports; i++)
+

Re: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.

2007-09-24 Thread Roland Dreier

  I'm sure I had seen a previous email in this thread that suggested using
  a userspace library to open a socket
  in the shared port space.  It seems that suggestion was dropped without
  reason.  Does anyone know why?

Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, iSER, etc).

Does the neteffect NIC have the same issue as cxgb3 here?  What are
your thoughts on how to handle this?

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] rdma/cm: add locking around QP accesses

2007-09-24 Thread Sean Hefty

If a user allocates a QP on an rdma_cm_id, the rdma_cm will automatically
transition the QP through its states (RTR, RTS, error, etc.)  While the
QP state transitions are occurring, the QP itself must remain valid.
Provide locking around the QP pointer to prevent its destruction while
accessing the pointer.

This fixes an issue reported by Olaf Kirch from Oracle that resulted in
a system crash:

An incoming connection arrives and we decide to tear down the nascent
 connection.  The remote ends decides to do the same.  We start to shut
 down the connection, and call rdma_destroy_qp on our cm_id. ... Now
 apparently a 'connect reject' message comes in from the other host,
 and cma_ib_handler() is called with an event of IB_CM_REJ_RECEIVED.
 It calls cma_modify_qp_err, which for some odd reason tries to modify
 the exact same QP we just destroyed.

Signed-off-by: Sean Hefty [EMAIL PROTECTED]
---
Rick, can you please test this patch and let me know if it fixes your problem?

 drivers/infiniband/core/cma.c |   90 +++--
 1 files changed, 60 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9ffb998..c6a6dba 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -120,6 +120,8 @@ struct rdma_id_private {
 
enum cma_state  state;
spinlock_t  lock;
+   struct mutexqp_mutex;
+
struct completion   comp;
atomic_trefcount;
wait_queue_head_t   wait_remove;
@@ -387,6 +389,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler 
event_handler,
id_priv-id.event_handler = event_handler;
id_priv-id.ps = ps;
spin_lock_init(id_priv-lock);
+   mutex_init(id_priv-qp_mutex);
init_completion(id_priv-comp);
atomic_set(id_priv-refcount, 1);
init_waitqueue_head(id_priv-wait_remove);
@@ -472,61 +475,86 @@ EXPORT_SYMBOL(rdma_create_qp);
 
 void rdma_destroy_qp(struct rdma_cm_id *id)
 {
-   ib_destroy_qp(id-qp);
+   struct rdma_id_private *id_priv;
+
+   id_priv = container_of(id, struct rdma_id_private, id);
+   mutex_lock(id_priv-qp_mutex);
+   ib_destroy_qp(id_priv-id.qp);
+   id_priv-id.qp = NULL;
+   mutex_unlock(id_priv-qp_mutex);
 }
 EXPORT_SYMBOL(rdma_destroy_qp);
 
-static int cma_modify_qp_rtr(struct rdma_cm_id *id)
+static int cma_modify_qp_rtr(struct rdma_id_private *id_priv)
 {
struct ib_qp_attr qp_attr;
int qp_attr_mask, ret;
 
-   if (!id-qp)
-   return 0;
+   mutex_lock(id_priv-qp_mutex);
+   if (!id_priv-id.qp) {
+   ret = 0;
+   goto out;
+   }
 
/* Need to update QP attributes from default values. */
qp_attr.qp_state = IB_QPS_INIT;
-   ret = rdma_init_qp_attr(id, qp_attr, qp_attr_mask);
+   ret = rdma_init_qp_attr(id_priv-id, qp_attr, qp_attr_mask);
if (ret)
-   return ret;
+   goto out;
 
-   ret = ib_modify_qp(id-qp, qp_attr, qp_attr_mask);
+   ret = ib_modify_qp(id_priv-id.qp, qp_attr, qp_attr_mask);
if (ret)
-   return ret;
+   goto out;
 
qp_attr.qp_state = IB_QPS_RTR;
-   ret = rdma_init_qp_attr(id, qp_attr, qp_attr_mask);
+   ret = rdma_init_qp_attr(id_priv-id, qp_attr, qp_attr_mask);
if (ret)
-   return ret;
+   goto out;
 
-   return ib_modify_qp(id-qp, qp_attr, qp_attr_mask);
+   ret = ib_modify_qp(id_priv-id.qp, qp_attr, qp_attr_mask);
+out:
+   mutex_unlock(id_priv-qp_mutex);
+   return ret;
 }
 
-static int cma_modify_qp_rts(struct rdma_cm_id *id)
+static int cma_modify_qp_rts(struct rdma_id_private *id_priv)
 {
struct ib_qp_attr qp_attr;
int qp_attr_mask, ret;
 
-   if (!id-qp)
-   return 0;
+   mutex_lock(id_priv-qp_mutex);
+   if (!id_priv-id.qp) {
+   ret = 0;
+   goto out;
+   }
 
qp_attr.qp_state = IB_QPS_RTS;
-   ret = rdma_init_qp_attr(id, qp_attr, qp_attr_mask);
+   ret = rdma_init_qp_attr(id_priv-id, qp_attr, qp_attr_mask);
if (ret)
-   return ret;
+   goto out;
 
-   return ib_modify_qp(id-qp, qp_attr, qp_attr_mask);
+   ret = ib_modify_qp(id_priv-id.qp, qp_attr, qp_attr_mask);
+out:
+   mutex_unlock(id_priv-qp_mutex);
+   return ret;
 }
 
-static int cma_modify_qp_err(struct rdma_cm_id *id)
+static int cma_modify_qp_err(struct rdma_id_private *id_priv)
 {
struct ib_qp_attr qp_attr;
+   int ret;
 
-   if (!id-qp)
-   return 0;
+   mutex_lock(id_priv-qp_mutex);
+   if (!id_priv-id.qp) {
+   ret = 0;
+   goto out;
+   }
 
qp_attr.qp_state = IB_QPS_ERR;
-   return ib_modify_qp(id-qp, qp_attr, IB_QP_STATE);
+   ret = ib_modify_qp(id_priv-id.qp, qp_attr,

Re: [ofa-general] Re: A question about rdma_get_cm_event

2007-09-24 Thread Sean Hefty

Note that the private data length _is_ correct for iwarp.  So the man 
pages should mention that this is an IB-only issue maybe?  And maybe 
indicate that transport-independent applications should not rely on the 
length...


I modified the man pages to describe private_data_len as:

Specifies the size of the user-controlled data buffer.  Note that the 
actual amount of data transferred to the remote side is transport 
dependent and may be larger than that requested.


These changes have been pushed into my git tree.

- Sean
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] osm/osm_sa_path_record: trivial cosmetic chage

2007-09-24 Thread Yevgeny Kliteynik

Trivial fix in osm_sa_path_record.c

Signed-off-by:  Yevgeny Kliteynik [EMAIL PROTECTED]
---
 opensm/opensm/osm_sa_path_record.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_sa_path_record.c 
b/opensm/opensm/osm_sa_path_record.c
index 3b183d9..ce75ec8 100644
--- a/opensm/opensm/osm_sa_path_record.c
+++ b/opensm/opensm/osm_sa_path_record.c
@@ -723,7 +723,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv,
if (pkey) {
p_prtn =
(osm_prtn_t *) cl_qmap_get(p_rcv-p_subn-prtn_pkey_tbl,
-  pkey  cl_ntoh16((uint16_t) ~
+  pkey  cl_hton16((uint16_t) ~
0x8000));
if (p_prtn ==
(osm_prtn_t *) cl_qmap_end(p_rcv-p_subn-prtn_pkey_tbl))
-- 
1.5.1.4

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.

2007-09-24 Thread Tom Tucker

On Mon, 2007-09-24 at 16:30 -0500, Glenn Grundstrom wrote:

  -Original Message-
  From: Roland Dreier [mailto:[EMAIL PROTECTED] 
  Sent: Monday, September 24, 2007 2:33 PM
  To: Glenn Grundstrom
  Cc: Steve Wise; [EMAIL PROTECTED]; general@lists.openfabrics.org
  Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Support 
  iwarp-only interfacesto avoid 4-tuple conflicts.

I'm sure I had seen a previous email in this thread that 
  suggested using
a userspace library to open a socket
in the shared port space.  It seems that suggestion was 
  dropped without
reason.  Does anyone know why?

  Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, 
  iSER, etc).

 The kernel apps could open a Linux tcp socket and create an RDMA
 socket connection.  Both calls are standard Linux kernel architected
 routines. 

This approach was NAK'd by David Miller and others...

  Doesn't NFSoRDMA already open a TCP socket and another for 
 RDMA traffic (ports 2049  2050 if I remember correctly)?  

The NFS RDMA transport driver does not open a socket for the RDMA
connection. It uses a different port in order to allow both TCP and RDMA
mounts to the same filer.

 I currently
 don't know if iSER, RDS, etc. already do the same thing, but if they
 don't, they probably could very easily.

Woe be to those who do so...

  Does the neteffect NIC have the same issue as cxgb3 here?  What are
  your thoughts on how to handle this?

 Yes, the NetEffect RNIC will have the same issue as Chelsio.  And all
 Future RNIC's which support a unified tcp address with Linux will as
 well.

 Steve has put a lot of thought and energy into the problem, but
 I don't think users  admins will be very happy with us in the long run.

Agreed.

 In summary, short of having the rdma_cm share kernel port space, I'd
 like to see the equivalent in userspace and have the kernel apps handle
 the issue in a similar way as described above.  There are a few
 technical
 issues to work through (like passing the userspace IP address to the
 kernel),

This just moves the socket creation to code that is outside the purview
the kernel maintainers. The exchanging of the 4-tuple created with the
kernel module, however, is back in the kernel and in the maintainer's
control and responsibility. In my view anything like this will be viewed
as an attempt to sneak code into the kernel that the maintainer has
already vehemently rejected. This will make people angry and damage the
cooperative working relationship that we are trying to build.

  but I think we can solve that just like other information that
 gets passed from user into the IB/RDMA kernel modules.

Sharing the IP 4-tuple space cooperatively with the core in any fashion
has been nak'd. Without this cooperation, the options we've been able to
come up with are administrative/policy based approaches. 

Any ideas you have along these lines are welcome.

Tom

 Glenn.

   - R.

 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread jamal

On Mon, 2007-24-09 at 12:12 -0700, Waskiewicz Jr, Peter P wrote:

 Hi Jamal,
   I've been (slowly) working on resurrecting the original design
 of my multiqueue patches to address this exact issue of the queue_lock
 being a hot item.  I added a queue_lock to each queue in the subqueue
 struct, and in the enqueue and dequeue, just lock that queue instead of
 the global device queue_lock.  The only two issues to overcome are the
 QDISC_RUNNING state flag, since that also serializes entry into the
 qdisc_restart() function, and the qdisc statistics maintenance, which
 needs to be serialized.  Do you think this work along with your patch
 will benefit from one another? 

The one thing that seems obvious is to use dev-hard_prep_xmit() in the
patches i posted to select the xmit ring. You should be able to do
figure out the txmit ring without holding any lock. 

I lost track of how/where things went since the last discussion; so i
need to wrap my mind around it to make sensisble suggestions - I know
the core patches are in the kernel but havent paid attention to details
and if you look at my second patch youd see a comment in
dev_batch_xmit() which says i need to scrutinize multiqueue more. 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCHES] TX batching

2007-09-24 Thread Kok, Auke

jamal wrote:
 On Mon, 2007-24-09 at 00:00 -0700, Kok, Auke wrote:
 
 that's bad to begin with :) - please send those separately so I can 
 fasttrack them
 into e1000e and e1000 where applicable.
 
 Ive been CCing you ;- Most of the changes are readability and
 reusability with the batching.
 
 But yes, I'm very inclined to merge more features into e1000e than e1000. I 
 intend
 to put multiqueue support into e1000e, as *all* of the hardware that it will
 support has multiple queues. Putting in any other performance feature like tx
 batching would absolutely be interesting.
 
 I looked at the e1000e and it is very close to e1000 so i should be able
 to move the changes easily. Most importantly, can i kill LLTX?
 For tx batching, we have to wait to see how Dave wants to move forward;
 i will have the patches but it is not something you need to push until
 we see where that is going.

hmm, I though I already removed that, but now I see some remnants from that.

By all means, please send a separate patch for that!

Auke
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [DOC] Net batching driver howto

2007-09-24 Thread jamal


I have updated the driver howto to match the patches i posted yesterday.
attached. 

cheers,
jamal
Heres the begining of a howto for driver authors.

The intended audience for this howto is people already
familiar with netdevices.

1.0  Netdevice Pre-requisites
--

For hardware based netdevices, you must have at least hardware that 
is capable of doing DMA with many descriptors; i.e having hardware 
with a queue length of 3 (as in some fscked ethernet hardware) is 
not very useful in this case.

2.0  What is new in the driver API
---

There are 3 new methods and one new variable introduced. These are:
1)dev-hard_prep_xmit()
2)dev-hard_end_xmit()
3)dev-hard_batch_xmit()
4)dev-xmit_win

2.1 Using Core driver changes
-

To provide context, lets look at a typical driver abstraction
for dev-hard_start_xmit(). It has 4 parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew 
on, tx completion interupts etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functions
anyways].
A driver which has all 4 parts and needing to support batching is 
advised to split its dev-hard_start_xmit() in the following manner:
1)use its dev-hard_prep_xmit() method to achieve #a
2)use its dev-hard_end_xmit() method to achieve #d
3)#b and #c can stay in -hard_start_xmit() (or whichever way you 
want to do this)
Note: There are drivers which may need not support any of the two
methods (example the tun driver i patched) so the two methods are
essentially optional.

2.1.1 Theory of operation
--

The core will first do the packet formatting by invoking your 
supplied dev-hard_prep_xmit() method. It will then pass you the packet 
via your dev-hard_start_xmit() method for as many as packets you
have advertised (via dev-xmit_win) you can consume. Lastly it will 
invoke your dev-hard_end_xmit() when it completes passing you all the 
packets queued for you. 


2.1.1.1 Locking rules
-

dev-hard_prep_xmit() is invoked without holding any
tx lock but the rest are under TX_LOCK(). So you have to ensure that
whatever you put it dev-hard_prep_xmit() doesnt require locking.

2.1.1.2 The slippery LLTX
-

LLTX drivers present a challenge in that we have to introduce a deviation
from the norm and require the -hard_batch_xmit() method. An LLTX
driver presents us with -hard_batch_xmit() to which we pass it a list
of packets in a dev-blist skb queue. It is then the responsibility
of the -hard_batch_xmit() to exercise steps #b and #c for all packets
passed in the dev-blist.
Step #a and #d are done by the core should you register presence of
dev-hard_prep_xmit() and dev-hard_end_xmit() in your setup.

2.1.1.3 xmit_win


dev-xmit_win variable is set by the driver to tell us how
much space it has in its rings/queues. dev-xmit_win is introduced to 
ensure that when we pass the driver a list of packets it will swallow 
all of them - which is useful because we dont requeue to the qdisc (and 
avoids burning unnecessary cpu cycles or introducing any strange 
re-ordering). The driver tells us, whenever it invokes netif_wake_queue,
how much space it has for descriptors by setting this variable.

3.0 Driver Essentials
-

The typical driver tx state machine is:


-1- +Core sends packets
 +-- Driver puts packet onto hardware queue
 +if hardware queue is full, netif_stop_queue(dev)
 +
-2- +core stops sending because of netif_stop_queue(dev)
..
.. time passes ...
..
-3- +--- driver has transmitted packets, opens up tx path by
  invoking netif_wake_queue(dev)
-1- +Cycle repeats and core sends more packets (step 1).


3.1  Driver pre-requisite
--

This is _a very important_ requirement in making batching useful.
The pre-requisite for batching changes is that the driver should 
provide a low threshold to open up the tx path.
Drivers such as tg3 and e1000 already do this.
Before you invoke netif_wake_queue(dev) you check if there is a
threshold of space reached to insert new packets.

Heres an example of how i added it to tun driver. Observe the
setting of dev-xmit_win

---
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
..
..
u32 t = skb_queue_len(tun-readq);
if (netif_queue_stopped(tun-dev)  t  NETDEV_LTT) {
tun-dev-xmit_win = tun-dev-tx_queue_len;
netif_wake_queue(tun-dev);
}
---

Heres how the batching e1000 driver does it:

--
if (unlikely(cleaned  netif_carrier_ok(netdev) 
 E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) {

if (netif_queue_stopped(netdev)) {
   int rspace =  E1000_DESC_UNUSED(tx_ring) -

[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread Waskiewicz Jr, Peter P

 The one thing that seems obvious is to use 
 dev-hard_prep_xmit() in the patches i posted to select the 
 xmit ring. You should be able to do figure out the txmit ring 
 without holding any lock. 

I've looked at that as a candidate to use.  The lock for enqueue would
be needed when actually placing the skb into the appropriate software
queue for the qdisc, so it'd be quick.

 I lost track of how/where things went since the last 
 discussion; so i need to wrap my mind around it to make 
 sensisble suggestions - I know the core patches are in the 
 kernel but havent paid attention to details and if you look 
 at my second patch youd see a comment in
 dev_batch_xmit() which says i need to scrutinize multiqueue more. 

No worries.  I'll try to get things together on my end and provide some
patches to add a per-queue lock.  In the meantime, I'll take a much
closer look at the batching code, since I've stopped looking at the
patches in-depth about a month ago.  :-(

Thanks,
-PJ Waskiewicz
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Atomic operation question.

2007-09-24 Thread Tang, Changqing



HI, I have a question for atmoic operation. If incoming atomic
operations are from
both ports of that HCA, can it work correctly ?


Thanks.
--CQ
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread jamal

On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote:

 I've looked at that as a candidate to use.  The lock for enqueue would
 be needed when actually placing the skb into the appropriate software
 queue for the qdisc, so it'd be quick.

The enqueue is easy to comprehend. The single device queue lock should
suffice. The dequeue is interesting:
Maybe you can point me to some doc or describe to me the dequeue aspect;
are you planning to have an array of txlocks per, one per ring?
How is the policy to define the qdisc queues locked/mapped to tx rings? 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread Waskiewicz Jr, Peter P

 On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote:
 
  I've looked at that as a candidate to use.  The lock for 
 enqueue would 
  be needed when actually placing the skb into the 
 appropriate software 
  queue for the qdisc, so it'd be quick.
 
 The enqueue is easy to comprehend. The single device queue 
 lock should suffice. The dequeue is interesting:

We should make sure we're symmetric with the locking on enqueue to
dequeue.  If we use the single device queue lock on enqueue, then
dequeue will also need to check that lock in addition to the individual
queue lock.  The details of this are more trivial than the actual
dequeue to make it efficient though.

 Maybe you can point me to some doc or describe to me the 
 dequeue aspect; are you planning to have an array of txlocks 
 per, one per ring?
 How is the policy to define the qdisc queues locked/mapped to 
 tx rings? 

The dequeue locking would be pushed into the qdisc itself.  This is how
I had it originally, and it did make the code more complex, but it was
successful at breaking the heavily-contended queue_lock apart.  I have a
subqueue structure right now in netdev, which only has queue_state (for
netif_{start|stop}_subqueue).  This state is checked in sch_prio right
now in the dequeue for both prio and rr.  My approach is to add a
queue_lock in that struct, so each queue allocated by the driver would
have a lock per queue.  Then in dequeue, that lock would be taken when
the skb is about to be dequeued.

The skb-queue_mapping field also maps directly to the queue index
itself, so it can be unlocked easily outside of the context of the
dequeue function.  The policy would be to use a spin_trylock() in
dequeue, so that dequeue can still do work if enqueue or another dequeue
is busy.  And the allocation of qdisc queues to device queues is assumed
to be one-to-one (that's how the qdisc behaves now).

I really just need to put my nose to the grindstone and get the patches
together and to the list...stay tuned.

Thanks,
-PJ Waskiewicz
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: [PATCH] uDAPL 2.0 mods to co-exist with uDAPL 1.2

2007-09-24 Thread Arlin Davis


James Lentini wrote:


Comments below:

-
+# version-info current:revision:age


What does this comment do?

just a comment regarding revisioning.




 #
-# This example shows netdev name, enabling administrator to use same copy 
across cluster
+# Add examples for multiple interfaces and IPoIB HA fail over, and bonding


The previous line is TODO, right? I'd suggest annotating it with that 
text to make it clear to users.

ok



--- a/test/dtest/dtest.c
+++ b/test/dtest/dtest.c
@@ -44,7 +44,7 @@
 #include inttypes.h
 
 #ifndef DAPL_PROVIDER

-#define DAPL_PROVIDER OpenIB-cma
+#define DAPL_PROVIDER OpenIB-2-cma


Should we update OpenIB to ofa? Obviously, this isn't necessary as 
part of this change


I didn't want to change the 1.2 names for compatibility reasons but for 
2.0 we could move to ofa names for both libraries and provider names. 
For example, libdaplcma.so becomes libdaplofa.so, OpenIB-cma becomes ofa.


For example dat.conf 2.0 entries would look like this:

ofa u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 ib0 0 
ofa-1 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 ib1 0 
ofa-2 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 ib2 0 
ofa-3 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 ib3 0 
ofa-bond u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 bond0 0 

Is that what you had in mind?

-arlin
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread Stephen Hemminger

On Mon, 24 Sep 2007 16:47:06 -0700
Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote:

  On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote:
  
   I've looked at that as a candidate to use.  The lock for 
  enqueue would 
   be needed when actually placing the skb into the 
  appropriate software 
   queue for the qdisc, so it'd be quick.
  
  The enqueue is easy to comprehend. The single device queue 
  lock should suffice. The dequeue is interesting:
 
 We should make sure we're symmetric with the locking on enqueue to
 dequeue.  If we use the single device queue lock on enqueue, then
 dequeue will also need to check that lock in addition to the individual
 queue lock.  The details of this are more trivial than the actual
 dequeue to make it efficient though.
 
  Maybe you can point me to some doc or describe to me the 
  dequeue aspect; are you planning to have an array of txlocks 
  per, one per ring?
  How is the policy to define the qdisc queues locked/mapped to 
  tx rings? 
 
 The dequeue locking would be pushed into the qdisc itself.  This is how
 I had it originally, and it did make the code more complex, but it was
 successful at breaking the heavily-contended queue_lock apart.  I have a
 subqueue structure right now in netdev, which only has queue_state (for
 netif_{start|stop}_subqueue).  This state is checked in sch_prio right
 now in the dequeue for both prio and rr.  My approach is to add a
 queue_lock in that struct, so each queue allocated by the driver would
 have a lock per queue.  Then in dequeue, that lock would be taken when
 the skb is about to be dequeued.
 
 The skb-queue_mapping field also maps directly to the queue index
 itself, so it can be unlocked easily outside of the context of the
 dequeue function.  The policy would be to use a spin_trylock() in
 dequeue, so that dequeue can still do work if enqueue or another dequeue
 is busy.  And the allocation of qdisc queues to device queues is assumed
 to be one-to-one (that's how the qdisc behaves now).
 
 I really just need to put my nose to the grindstone and get the patches
 together and to the list...stay tuned.
 
 Thanks,
 -PJ Waskiewicz
 -


Since we are redoing this, is there any way to make the whole TX path
more lockless?  The existing model seems to be more of a monitor than
a real locking model.
-- 
Stephen Hemminger [EMAIL PROTECTED]
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCHES] TX batching

2007-09-24 Thread Jeff Garzik


jamal wrote:
If the intel folks will accept the patch i'd really like to kill 
the e1000 LLTX interface.



If I understood DaveM correctly, it is sounding like we want to 
deprecate all of use LLTX on real hardware?  If so, several such 
projects might be considered, as well as possibly simplifying TX 
batching work perhaps.


Also, WRT e1000 specifically, I was hoping to minimize changes, and 
focus people on e1000e.


e1000e replaces (deprecates) large portions of e1000, namely the support 
for the PCI Express modern chips.  When e1000e has proven itself in the 
field, we can potentially look at several e1000 simplifications, during 
the large scale code removal that becomes possible.


Jeff


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread Waskiewicz Jr, Peter P

  I really just need to put my nose to the grindstone and get the 
  patches together and to the list...stay tuned.
  
  Thanks,
  -PJ Waskiewicz
  -
 
 
 Since we are redoing this, is there any way to make the whole 
 TX path more lockless?  The existing model seems to be more 
 of a monitor than a real locking model.

That seems quite reasonable.  I will certainly see what I can do.

Thanks Stephen,

-PJ Waskiewicz
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

57 matches

Mail list logo