[ofa-general] Re: [query] Multi path discovery in openSM
If there are multiple paths between two end nodes in a network and I set the LMC 0 then whether the openSM itself identifies those routes and updates the switch forwarding tables or is it the duty of some other consumer ?? OpenSM. I am using min-hop algorithm with openSM. Now in this case, if there are multiple paths (some are not min-hop paths) will the openSM(LMC 0) configure those paths? regards, Mahesh ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching
jamal wrote: On Sun, 2007-23-09 at 12:36 -0700, Kok, Auke wrote: please be reminded that we're going to strip down e1000 and most of the features should go into e1000e, which has much less hardware workarounds. I'm still reluctant to putting in new stuff in e1000 - I really want to chop it down first ;) sure - the question then is, will you take those changes if i use e1000e? theres a few cleanups that have nothing to do with batching; take a look at the modified e1000 on the git tree. that's bad to begin with :) - please send those separately so I can fasttrack them into e1000e and e1000 where applicable. But yes, I'm very inclined to merge more features into e1000e than e1000. I intend to put multiqueue support into e1000e, as *all* of the hardware that it will support has multiple queues. Putting in any other performance feature like tx batching would absolutely be interesting. Auke ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: [query] Multi path discovery in openSM
OpenSM will always use min-hop paths (no matter what routing algorithm except maybe for LASH). If you use the default algorithms OpenSM will tend to spread traffic such that if you have used LMC=1 (2 LIDs per port) The two paths going to LID0 and LID1 will go through different systems or if not possible through different nodes. EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Keshetti Mahesh Sent: Monday, September 24, 2007 8:54 AM To: openIB Subject: [ofa-general] Re: [query] Multi path discovery in openSM If there are multiple paths between two end nodes in a network and I set the LMC 0 then whether the openSM itself identifies those routes and updates the switch forwarding tables or is it the duty of some other consumer ?? OpenSM. I am using min-hop algorithm with openSM. Now in this case, if there are multiple paths (some are not min-hop paths) will the openSM(LMC 0) configure those paths? regards, Mahesh ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2
Quoting Steve Wise [EMAIL PROTECTED]: Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 Please pull the latest from my libcxgb3 git repos to update the ofed-1.2.5 and ofed-1.3 libcxgb3 release. This will update to version 1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms. git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5 This looks wrong. 1.2.X releases are done from ofed_1_2 branch. 1.2.5 is just a tag. What do you want me to do? and git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3 OK for that one. -- MST ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] ipoib patches - resend subset
Hi Roland, as per your request for a smaller number of changes, I resend this subset of the previous series. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 1/11] IB/ipoib: high dma support
Add high dma support to ipoib This patch assumes all IB devices support 64 bit DMA. Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- Index: linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- linux-2.6.23-rc1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-15 20:50:16.0 +0300 +++ linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-15 20:50:27.0 +0300 @@ -1079,6 +1079,8 @@ static struct net_device *ipoib_add_port SET_NETDEV_DEV(priv-dev, hca-dma_device); + priv-dev-features |= NETIF_F_HIGHDMA; + result = ib_query_pkey(hca, port, 0, priv-pkey); if (result) { printk(KERN_WARNING %s: ib_query_pkey port %d failed (ret = %d)\n, ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 3/11] ib_core: add checksum offload support
Add checksum offload support to the core Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- A device that publishes IB_DEVICE_IP_CSUM actually supports calculating checksum on transmit and provides indication whether the checksum is OK on receive. Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h === --- ofa_1_3_dev_kernel.orig/include/rdma/ib_verbs.h 2007-09-24 13:24:22.0 +0200 +++ ofa_1_3_dev_kernel/include/rdma/ib_verbs.h 2007-09-24 13:24:40.0 +0200 @@ -95,7 +95,8 @@ enum ib_device_cap_flags { IB_DEVICE_N_NOTIFY_CQ = (114), IB_DEVICE_ZERO_STAG = (115), IB_DEVICE_SEND_W_INV= (116), - IB_DEVICE_MEM_WINDOW= (117) + IB_DEVICE_MEM_WINDOW= (117), + IB_DEVICE_IP_CSUM = (118), }; enum ib_atomic_cap { @@ -431,6 +432,7 @@ struct ib_wc { u8 sl; u8 dlid_path_bits; u8 port_num; /* valid only for DR SMPs on switches */ + int csum_ok; }; enum ib_cq_notify_flags { @@ -615,7 +617,9 @@ enum ib_send_flags { IB_SEND_FENCE = 1, IB_SEND_SIGNALED= (11), IB_SEND_SOLICITED = (12), - IB_SEND_INLINE = (13) + IB_SEND_INLINE = (13), + IB_SEND_IP_CSUM = (14), + IB_SEND_UDP_TCP_CSUM= (15) }; struct ib_sge { ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 5/11]: mlx4_ib: add checksum offload support
Add checksum offload support to mlx4 Signed-off-by: Ali Ayub [EMAIL PROTECTED] Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- Index: ofa_1_3_dev_kernel/include/linux/mlx4/cq.h === --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/cq.h 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/include/linux/mlx4/cq.h 2007-09-24 12:36:46.0 +0200 @@ -45,11 +45,11 @@ struct mlx4_cqe { u8 sl; u8 reserved1; __be16 rlid; - u32 reserved2; + __be32 ipoib_status; __be32 byte_cnt; __be16 wqe_index; __be16 checksum; - u8 reserved3[3]; + u8 reserved2[3]; u8 owner_sr_opcode; }; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/cq.c 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c 2007-09-24 12:38:29.0 +0200 @@ -439,6 +439,8 @@ static int mlx4_ib_poll_one(struct mlx4_ wc-wc_flags |= be32_to_cpu(cqe-g_mlpath_rqpn) 0x8000 ? IB_WC_GRH : 0; wc-pkey_index = be32_to_cpu(cqe-immed_rss_invalid) 16; + wc-csum_ok = be32_to_cpu(cqe-ipoib_status) 0x1000 + be16_to_cpu(cqe-checksum) == 0x; } return 0; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c2007-09-24 12:36:46.0 +0200 @@ -100,6 +100,8 @@ static int mlx4_ib_query_device(struct i props-device_cap_flags |= IB_DEVICE_AUTO_PATH_MIG; if (dev-dev-caps.flags MLX4_DEV_CAP_FLAG_UD_AV_PORT) props-device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE; + if (dev-dev-caps.flags MLX4_DEV_CAP_FLAG_IPOIB_CSUM) + props-device_cap_flags |= IB_DEVICE_IP_CSUM; props-vendor_id = be32_to_cpup((__be32 *) (out_mad-data + 36)) 0xff; @@ -626,6 +628,9 @@ static void *mlx4_ib_add(struct mlx4_dev ibdev-ib_dev.unmap_fmr = mlx4_ib_unmap_fmr; ibdev-ib_dev.dealloc_fmr = mlx4_ib_fmr_dealloc; + if (ibdev-dev-caps.flags MLX4_DEV_CAP_FLAG_IPOIB_CSUM) + ibdev-ib_dev.flags |= IB_DEVICE_IP_CSUM; + if (init_node_data(ibdev)) goto err_map; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-24 12:36:46.0 +0200 @@ -1433,6 +1433,10 @@ int mlx4_ib_post_send(struct ib_qp *ibqp cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE) : 0) | (wr-send_flags IB_SEND_SOLICITED ? cpu_to_be32(MLX4_WQE_CTRL_SOLICITED) : 0) | + ((wr-send_flags IB_SEND_IP_CSUM) ? +cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM) : 0) | + ((wr-send_flags IB_SEND_UDP_TCP_CSUM) ? +cpu_to_be32(MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0) | qp-sq_signal_bits; if (wr-opcode == IB_WR_SEND_WITH_IMM || Index: ofa_1_3_dev_kernel/include/linux/mlx4/qp.h === --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/qp.h 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/include/linux/mlx4/qp.h 2007-09-24 12:36:46.0 +0200 @@ -162,6 +162,8 @@ enum { MLX4_WQE_CTRL_FENCE = 1 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 2, MLX4_WQE_CTRL_SOLICITED = 1 1, + MLX4_WQE_CTRL_IP_CSUM = 1 4, + MLX4_WQE_CTRL_TCP_UDP_CSUM = 1 5, }; struct mlx4_wqe_ctrl_seg { Index: ofa_1_3_dev_kernel/drivers/net/mlx4/fw.c === --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/fw.c 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/fw.c2007-09-24 12:36:46.0 +0200 @@ -741,6 +741,9 @@ int mlx4_INIT_HCA(struct mlx4_dev *dev, MLX4_PUT(inbox, (u8) (PAGE_SHIFT - 12), INIT_HCA_UAR_PAGE_SZ_OFFSET); MLX4_PUT(inbox, param-log_uar_sz, INIT_HCA_LOG_UAR_SZ_OFFSET); + if (dev-caps.flags MLX4_DEV_CAP_FLAG_IPOIB_CSUM) +
[ofa-general] [PATCH 6/11] IB/ipoib: add checksum offload support
Add checksum offload support to ipoib Signed-off-by: Eli Cohen [EMAIL PROTECTED] Signed-off-by: Ali Ayub [EMAIL PROTECTED] --- Add checksum offload support to ipoib Signed-off-by: Eli Cohen [EMAIL PROTECTED] Signed-off-by: Ali Ayub [EMAIL PROTECTED] --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:09:21.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:49:00.0 +0200 @@ -86,6 +86,7 @@ enum { IPOIB_MCAST_STARTED = 8, IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, + IPOIB_FLAG_RX_CSUM= 11, IPOIB_MAX_BACKOFF_SECONDS = 16, Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 12:23:26.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 13:05:21.0 +0200 @@ -1258,6 +1258,13 @@ static ssize_t set_mode(struct device *d set_bit(IPOIB_FLAG_ADMIN_CM, priv-flags); ipoib_warn(priv, enabling connected mode will cause multicast packet drops\n); + + /* clear ipv6 flag too */ + dev-features = ~NETIF_F_IP_CSUM; + + priv-tx_wr.send_flags = + ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); + ipoib_flush_paths(dev); return count; } @@ -1266,6 +1273,10 @@ static ssize_t set_mode(struct device *d clear_bit(IPOIB_FLAG_ADMIN_CM, priv-flags); dev-mtu = min(priv-mcast_mtu, dev-mtu); ipoib_flush_paths(dev); + + if (priv-ca-flags IB_DEVICE_IP_CSUM) + dev-features |= NETIF_F_IP_CSUM; /* ipv6 too */ + return count; } Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-24 11:57:02.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-24 13:03:27.0 +0200 @@ -37,6 +37,7 @@ #include linux/delay.h #include linux/dma-mapping.h +#include linux/ip.h #include rdma/ib_cache.h @@ -231,6 +232,16 @@ static void ipoib_ib_handle_rx_wc(struct skb-dev = dev; /* XXX get correct PACKET_ type here */ skb-pkt_type = PACKET_HOST; + + /* check rx csum */ + if (test_bit(IPOIB_FLAG_RX_CSUM, priv-flags) likely(wc-csum_ok)) { + /* Note: this is a specific requirement for Mellanox + HW but since it is the only HW currently supporting + checksum offload I put it here */ + if struct iphdr *)(skb-data))-ihl) == 5) + skb-ip_summed = CHECKSUM_UNNECESSARY; + } + netif_receive_skb(skb); repost: @@ -396,6 +407,15 @@ void ipoib_send(struct net_device *dev, return; } + if (priv-ca-flags IB_DEVICE_IP_CSUM + skb-ip_summed == CHECKSUM_PARTIAL) + priv-tx_wr.send_flags |= + IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM; + else + priv-tx_wr.send_flags = + ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); + + if (unlikely(post_send(priv, priv-tx_head (ipoib_sendq_size - 1), address-ah, qpn, tx_req-mapping, skb_headlen(skb), Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 12:23:00.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 13:04:52.0 +0200 @@ -1109,6 +1109,29 @@ int ipoib_add_pkey_attr(struct net_devic return device_create_file(dev-dev, dev_attr_pkey); } +static void set_tx_csum(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (test_bit(IPOIB_FLAG_ADMIN_CM, priv-flags)) + return; + + if (!(priv-ca-flags IB_DEVICE_IP_CSUM)) + return; + + dev-features |= NETIF_F_SG | NETIF_F_IP_CSUM; /* turn on ipv6 too */ +} + +static void set_rx_csum(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (!(priv-ca-flags IB_DEVICE_IP_CSUM)) + return; + + set_bit(IPOIB_FLAG_RX_CSUM, priv-flags); +} + static struct net_device *ipoib_add_port(const char *format, struct
[ofa-general] [PATCH 8/11]: Add support for modifying CQ params
Add support for modifying CQ parameters for controlling event generation moderation. This allows to control the rate of event (interrupt) generation by specifying a minimum number of CQEs or a minimum period of time required to generate an event. Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h === --- ofa_1_3_dev_kernel.orig/include/rdma/ib_verbs.h 2007-09-24 12:33:41.0 +0200 +++ ofa_1_3_dev_kernel/include/rdma/ib_verbs.h 2007-09-24 13:07:59.0 +0200 @@ -967,6 +967,8 @@ struct ib_device { int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); + int(*modify_cq)(struct ib_cq *cq, u16 cq_count, + u16 cq_period); int(*destroy_cq)(struct ib_cq *cq); int(*resize_cq)(struct ib_cq *cq, int cqe, struct ib_udata *udata); @@ -1372,6 +1374,16 @@ struct ib_cq *ib_create_cq(struct ib_dev int ib_resize_cq(struct ib_cq *cq, int cqe); /** + * ib_modify_cq - Modifies moderation params of the CQ + * @cq: The CQ to modify. + * @cq_count: number of CQEs that will tirgger an event + * @cq_period: max period of time beofre triggering an event + * + * Users can examine the cq structure to determine the actual CQ size. + */ +int ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period); + +/** * ib_destroy_cq - Destroys the specified CQ. * @cq: The CQ to destroy. */ Index: ofa_1_3_dev_kernel/drivers/infiniband/core/verbs.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/core/verbs.c 2007-09-24 11:19:03.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/core/verbs.c 2007-09-24 13:07:59.0 +0200 @@ -628,6 +628,13 @@ struct ib_cq *ib_create_cq(struct ib_dev } EXPORT_SYMBOL(ib_create_cq); +int ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) +{ + return cq-device-modify_cq ? + cq-device-modify_cq(cq, cq_count, cq_period) : -ENOSYS; +} +EXPORT_SYMBOL(ib_modify_cq); + int ib_destroy_cq(struct ib_cq *cq) { if (atomic_read(cq-usecnt)) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 9/11] mlx4_ib: add support for modifying CQ parameters
Add support for modifying CQ parameters. Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- Add support for modifying CQ parameters. Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2007-09-24 12:36:46.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c2007-09-24 13:08:55.0 +0200 @@ -613,6 +613,7 @@ static void *mlx4_ib_add(struct mlx4_dev ibdev-ib_dev.post_send = mlx4_ib_post_send; ibdev-ib_dev.post_recv = mlx4_ib_post_recv; ibdev-ib_dev.create_cq = mlx4_ib_create_cq; + ibdev-ib_dev.modify_cq = mlx4_ib_modify_cq; ibdev-ib_dev.destroy_cq= mlx4_ib_destroy_cq; ibdev-ib_dev.poll_cq = mlx4_ib_poll_cq; ibdev-ib_dev.req_notify_cq = mlx4_ib_arm_cq; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/cq.c 2007-09-24 12:38:29.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c 2007-09-24 13:08:55.0 +0200 @@ -91,6 +91,25 @@ static struct mlx4_cqe *next_cqe_sw(stru return get_sw_cqe(cq, cq-mcq.cons_index); } +int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) +{ + struct mlx4_ib_cq *mcq = to_mcq(cq); + struct mlx4_ib_dev *dev = to_mdev(cq-device); + struct mlx4_cq_context *context; + int err; + + context = kzalloc(sizeof *context, GFP_KERNEL); + if (!context) + return -ENOMEM; + + context-cq_period = cpu_to_be16(cq_period); + context-cq_max_count = cpu_to_be16(cq_count); + err = mlx4_cq_modify(dev-dev, mcq-mcq, context, 1); + + kfree(context); + return err; +} + struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata) Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-09-24 11:19:03.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-09-24 13:08:55.0 +0200 @@ -249,6 +249,7 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct struct ib_udata *udata); int mlx4_ib_dereg_mr(struct ib_mr *mr); +int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period); struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata); Index: ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c === --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/cq.c 2007-09-24 11:19:03.0 +0200 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c2007-09-24 13:08:55.0 +0200 @@ -38,33 +38,11 @@ #include linux/hardirq.h #include linux/mlx4/cmd.h +#include linux/mlx4/cq.h #include mlx4.h #include icm.h -struct mlx4_cq_context { - __be32 flags; - u16 reserved1[3]; - __be16 page_offset; - __be32 logsize_usrpage; - u8 reserved2; - u8 cq_period; - u8 reserved3; - u8 cq_max_count; - u8 reserved4[3]; - u8 comp_eqn; - u8 log_page_size; - u8 reserved5[2]; - u8 mtt_base_addr_h; - __be32 mtt_base_addr_l; - __be32 last_notified_index; - __be32 solicit_producer_index; - __be32 consumer_index; - __be32 producer_index; - u32 reserved6[2]; - __be64 db_rec_addr; -}; - #define MLX4_CQ_STATUS_OK ( 0 28) #define MLX4_CQ_STATUS_OVERFLOW( 9 28) #define MLX4_CQ_STATUS_WRITE_FAIL (10 28) @@ -121,6 +99,13 @@ static int mlx4_SW2HW_CQ(struct mlx4_dev MLX4_CMD_TIME_CLASS_A); } +static int mlx4_MODIFY_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, +int cq_num, u32 opmod) +{ + return mlx4_cmd(dev, mailbox-dma, cq_num, opmod, MLX4_CMD_MODIFY_CQ, + MLX4_CMD_TIME_CLASS_A); +} + static int mlx4_HW2SW_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox,
[ofa-general] [PATCH 10/11]: IB/ipoib modify cq params
Implement support for modifying IPOIB CQ moderation params This can be used to tune at run time the paramters controlling the event (interrupt) generation rate and thus reduce the overhead incurred by hadling interrupts resulting in better throughput. Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 13:07:43.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 13:12:21.0 +0200 @@ -270,6 +270,13 @@ struct ipoib_cm_dev_priv { struct ib_recv_wr rx_wr; }; +struct ipoib_ethtool_st { + u16 rx_coalesce_usecs; + u16 tx_coalesce_usecs; + u16 rx_max_coalesced_frames; + u16 tx_max_coalesced_frames; +}; + /* * Device private locking: tx_lock protects members used in TX fast * path (and we use LLTX so upper layers don't do extra locking). @@ -346,6 +353,7 @@ struct ipoib_dev_priv { struct dentry *mcg_dentry; struct dentry *path_dentry; #endif + struct ipoib_ethtool_st etool; }; struct ipoib_ah { Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_etool.c 2007-09-24 13:07:43.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c 2007-09-24 13:09:26.0 +0200 @@ -44,9 +44,49 @@ static void ipoib_get_drvinfo(struct net strncpy(drvinfo-driver, ipoib, sizeof(drvinfo-driver) - 1); } +static int ipoib_get_coalesce(struct net_device *dev, + struct ethtool_coalesce *coal) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + coal-rx_coalesce_usecs = priv-etool.rx_coalesce_usecs; + coal-tx_coalesce_usecs = priv-etool.tx_coalesce_usecs; + coal-rx_max_coalesced_frames = priv-etool.rx_max_coalesced_frames; + coal-rx_max_coalesced_frames = priv-etool.tx_max_coalesced_frames; + + return 0; +} + +static int ipoib_set_coalesce(struct net_device *dev, + struct ethtool_coalesce *coal) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + if (coal-rx_coalesce_usecs 0x|| + coal-tx_coalesce_usecs 0x|| + coal-rx_max_coalesced_frames 0x || + coal-tx_max_coalesced_frames 0x) + return -EINVAL; + + ret = ib_modify_cq(priv-cq, coal-rx_max_coalesced_frames, + coal-rx_coalesce_usecs); + if (ret) + return ret; + + priv-etool.rx_coalesce_usecs = coal-rx_coalesce_usecs; + priv-etool.tx_coalesce_usecs = coal-tx_coalesce_usecs; + priv-etool.rx_max_coalesced_frames = coal-rx_max_coalesced_frames; + priv-etool.tx_max_coalesced_frames = coal-rx_max_coalesced_frames; + + return 0; +} + static const struct ethtool_ops ipoib_ethtool_ops = { .get_drvinfo= ipoib_get_drvinfo, .get_tso= ethtool_op_get_tso, + .get_coalesce = ipoib_get_coalesce, + .set_coalesce = ipoib_set_coalesce, }; void ipoib_set_ethtool_ops(struct net_device *dev) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 11/11]: mlx4_core use fixed CQ moderation paramters
From: Michael S. Tsirkin [EMAIL PROTECTED] Subject: IB/ipoib: support for sending gather skbs Enable interrupt coalescing for CQs in mlx4. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- Index: ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c === --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/cq.c 2007-09-24 13:08:55.0 +0200 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c2007-09-24 13:12:42.0 +0200 @@ -43,6 +43,14 @@ #include mlx4.h #include icm.h +static int cq_max_count = 16; +static int cq_period = 10; + +module_param(cq_max_count, int, 0444); +MODULE_PARM_DESC(cq_max_count, number of CQEs to generate event); +module_param(cq_period, int, 0444); +MODULE_PARM_DESC(cq_period, time in usec for CQ event generation); + #define MLX4_CQ_STATUS_OK ( 0 28) #define MLX4_CQ_STATUS_OVERFLOW( 9 28) #define MLX4_CQ_STATUS_WRITE_FAIL (10 28) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 4/11] ib_mthca: add checksum offload support
Add checksum offload support in mthca Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- resending - adding the openfabrics list Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-09-24 12:34:59.0 +0200 @@ -1377,6 +1377,9 @@ int mthca_INIT_HCA(struct mthca_dev *dev MTHCA_PUT(inbox, param-uarc_base, INIT_HCA_UAR_CTX_BASE_OFFSET); } + if (dev-device_cap_flags IB_DEVICE_IP_CSUM) + *(inbox + INIT_HCA_FLAGS2_OFFSET / 4) |= cpu_to_be32(7 3); + err = mthca_cmd(dev, mailbox-dma, 0, 0, CMD_INIT_HCA, HZ, status); mthca_free_mailbox(dev, mailbox); Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.h === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cmd.h 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.h 2007-09-24 12:34:59.0 +0200 @@ -103,6 +103,7 @@ enum { DEV_LIM_FLAG_RAW_IPV6 = 1 4, DEV_LIM_FLAG_RAW_ETHER = 1 5, DEV_LIM_FLAG_SRQ= 1 6, + DEV_LIM_FLAG_IPOIB_CSUM = 1 7, DEV_LIM_FLAG_BAD_PKEY_CNTR = 1 8, DEV_LIM_FLAG_BAD_QKEY_CNTR = 1 9, DEV_LIM_FLAG_MW = 1 16, Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cq.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cq.c 2007-09-24 12:36:06.0 +0200 @@ -119,7 +119,8 @@ struct mthca_cqe { __be32 my_qpn; __be32 my_ee; __be32 rqpn; - __be16 sl_g_mlpath; + u8 sl_ipok; + u8 g_mlpath; __be16 rlid; __be32 imm_etype_pkey_eec; __be32 byte_cnt; @@ -498,6 +499,7 @@ static inline int mthca_poll_one(struct int is_send; int free_cqe = 1; int err = 0; + u16 checksum; cqe = next_cqe_sw(cq); if (!cqe) @@ -639,12 +641,14 @@ static inline int mthca_poll_one(struct break; } entry-slid= be16_to_cpu(cqe-rlid); - entry-sl = be16_to_cpu(cqe-sl_g_mlpath) 12; + entry-sl = cqe-sl_ipok 4; entry-src_qp = be32_to_cpu(cqe-rqpn) 0xff; - entry-dlid_path_bits = be16_to_cpu(cqe-sl_g_mlpath) 0x7f; + entry-dlid_path_bits = cqe-g_mlpath 0x7f; entry-pkey_index = be32_to_cpu(cqe-imm_etype_pkey_eec) 16; - entry-wc_flags |= be16_to_cpu(cqe-sl_g_mlpath) 0x80 ? - IB_WC_GRH : 0; + entry-wc_flags |= cqe-g_mlpath 0x80 ? IB_WC_GRH : 0; + checksum = (be32_to_cpu(cqe-rqpn) 24) | + ((be32_to_cpu(cqe-my_ee) 16) 0xff00); + entry-csum_ok = (cqe-sl_ipok 1 checksum == 0x); } entry-status = IB_WC_SUCCESS; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_main.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_main.c 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_main.c 2007-09-24 12:34:59.0 +0200 @@ -289,6 +289,10 @@ static int mthca_dev_lim(struct mthca_de if (dev_lim-flags DEV_LIM_FLAG_SRQ) mdev-mthca_flags |= MTHCA_FLAG_SRQ; + if (mthca_is_memfree(mdev)) + if (dev_lim-flags DEV_LIM_FLAG_IPOIB_CSUM) + mdev-device_cap_flags |= IB_DEVICE_IP_CSUM; + return 0; } @@ -1125,6 +1129,8 @@ static int __mthca_init_one(struct pci_d if (err) goto err_cmd; + mdev-ib_dev.flags = mdev-device_cap_flags; + if (mdev-fw_ver mthca_hca_table[hca_type].latest_fw) { mthca_warn(mdev, HCA FW version %d.%d.%03d is old (%d.%d.%03d is current).\n, (int) (mdev-fw_ver 32), (int) (mdev-fw_ver 16) 0x, Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_qp.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2007-09-24 11:19:08.0 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_qp.c 2007-09-24 12:34:59.0 +0200 @@ -2024,6 +2024,10 @@ int mthca_arbel_post_send(struct ib_qp * cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0)
Re: [ofa-general] Re: [query] Multi path discovery in openSM
On 09:05 Mon 24 Sep , Eitan Zahavi wrote: OpenSM will always use min-hop paths (no matter what routing algorithm I would clarify here - For LMC 0 OpenSM will choose different paths between _discovered_ shortest paths. For min-hop algorithm those shortest paths are real min-hops paths. For Up/Down it is min-hop paths which satisfies Up/Down constraint. except maybe for LASH). For LASH too (LASH is abbreviation of LAyered SHortest paths). There a different layers (VLs in case of IB) are used for credit loops resolution. However current LASH implementation does not support LMC 0. Sasha If you use the default algorithms OpenSM will tend to spread traffic such that if you have used LMC=1 (2 LIDs per port) The two paths going to LID0 and LID1 will go through different systems or if not possible through different nodes. EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Keshetti Mahesh Sent: Monday, September 24, 2007 8:54 AM To: openIB Subject: [ofa-general] Re: [query] Multi path discovery in openSM If there are multiple paths between two end nodes in a network and I set the LMC 0 then whether the openSM itself identifies those routes and updates the switch forwarding tables or is it the duty of some other consumer ?? OpenSM. I am using min-hop algorithm with openSM. Now in this case, if there are multiple paths (some are not min-hop paths) will the openSM(LMC 0) configure those paths? regards, Mahesh ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2
Quoting Michael S. Tsirkin [EMAIL PROTECTED]: Subject: Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 Quoting Steve Wise [EMAIL PROTECTED]: Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 Please pull the latest from my libcxgb3 git repos to update the ofed-1.2.5 and ofed-1.3 libcxgb3 release. This will update to version 1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms. git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5 This looks wrong. 1.2.X releases are done from ofed_1_2 branch. 1.2.5 is just a tag. What do you want me to do? I figured it out. done. -- MST ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCHv3] IB/ipoib: HW checksum support
Add module option hw_csum: when set, IPoIB will report HW CSUM and S/G support, and rely on hardware end-to-end transport checksum (ICRC) instead of software-level protocol checksums. Forwarding such packets outside the IB subnet would increase the risk of data corruption, so it is safest not to set hw_csum flag on gateways. To reduce the chance of this routing triggering data corruption by mistake, on RX we set skb checksum field to CHECKSUM_UNNECESSARY - this way if such a packet ends up outside the IB network, it is detected as malformed and dropped. To enable interoperability with IEEE IPoIB, checksum for outgoing packets is calculated in software unless the remote advertises hw_csum capability by setting a bit in hardware address flag. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- This patch has to be applied on top of [PATCH 2/11] IB/ipoib: support for sending gather skbs. Updates since v2: Enable interoperability with IEEE IPoIB. Split out S/G support to a separate patch. Updates since v1: fixed thinko in setting header flags. When applied on top of previously posted mlx4 patches, and with hw_csum enabled on both ends, this patch speeds up single-stream netperf bandwidth on connectx DDR from 1000 to 1250 MBytes/sec. diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 285c143..485f979 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -86,6 +86,7 @@ enum { IPOIB_MCAST_STARTED = 8, IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, + IPOIB_FLAG_HW_CSUM= 11, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -104,9 +105,11 @@ enum { /* structs */ +#define IPOIB_HEADER_F_HWCSUM 0x1 + struct ipoib_header { __be16 proto; - u16 reserved; + __be16 flags; }; struct ipoib_pseudoheader { @@ -430,6 +478,8 @@ void ipoib_pkey_poll(struct work_struct *work); int ipoib_pkey_dev_delay_open(struct net_device *dev); void ipoib_drain_cq(struct net_device *dev); +#define IPOIB_FLAGS_HWCSUM 0x01 + #ifdef CONFIG_INFINIBAND_IPOIB_CM #define IPOIB_FLAGS_RC 0x80 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 08b4676..a308e92 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -407,6 +407,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; int frags; + struct ipoib_header *header; ipoib_dbg_data(priv, cm recv completion: id %d, status: %d\n, wr_id, wc-status); @@ -469,7 +470,10 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc-byte_len, newskb); - skb-protocol = ((struct ipoib_header *) skb-data)-proto; + header = (struct ipoib_header *)skb-data; + skb-protocol = header-proto; + if (header-flags cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) + skb-ip_summed = CHECKSUM_UNNECESSARY; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 1094488..59b1735 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -170,6 +170,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int wr_id = wc-wr_id ~IPOIB_OP_RECV; struct sk_buff *skb; + struct ipoib_header *header; u64 addr; ipoib_dbg_data(priv, recv completion: id %d, status: %d\n, @@ -220,7 +221,10 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_put(skb, wc-byte_len); skb_pull(skb, IB_GRH_BYTES); - skb-protocol = ((struct ipoib_header *) skb-data)-proto; + header = (struct ipoib_header *)skb-data; + skb-protocol = header-proto; + if (header-flags cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) + skb-ip_summed = CHECKSUM_UNNECESSARY; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 894b1dc..74d10e6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -55,11 +55,14 @@ MODULE_LICENSE(Dual BSD/GPL); int ipoib_sendq_size __read_mostly = IPOIB_TX_RING_SIZE; int ipoib_recvq_size __read_mostly = IPOIB_RX_RING_SIZE; +static int ipoib_hw_csum __read_mostly = 0; module_param_named(send_queue_size, ipoib_sendq_size, int, 0444); MODULE_PARM_DESC(send_queue_size, Number of descriptors in send queue); module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444);
[ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2
Michael S. Tsirkin wrote: Quoting Michael S. Tsirkin [EMAIL PROTECTED]: Subject: Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 I figured it out. done. And I did a new build of OFED 1.2.5 daily (look at http://www.openfabrics.org/builds/connectx/latest.txt) Tziporet ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister
Roland Dreier wrote: The action in bonding to a detach of slave is to unregister the master (see patch 10). This can't be done from the context of unregister_netdevice itself (it is protected by rtnl_lock). I'm confused. Your patch has: + ipoib_slave_detach(cpriv-dev); unregister_netdev(cpriv-dev); And ipoib_slave_detach() is: +static inline void ipoib_slave_detach(struct net_device *dev) +{ + rtnl_lock(); + netdev_slave_detach(dev); + rtnl_unlock(); +} so you are calling netdev_slave_detach() with the rtnl lock held. Why can't you make the same call from the start of unregister_netdevice()? Anyway, if the rtnl lock is a problem, can you just add the call to netdev_slave_detach() to unregister_netdev() before it takes the rtnl lock? - R. Your comment made me do a little rethinking. In bonding, device is released by calling unregister_netdevice() that doesn't take the rtnl_lock (unlike unregister_netdev() that does). I guess that this made me confused to think that this is not possible. So, I guess I could put the detach notification in unregister_netedev() and the reaction to the notification in the bonding driver would not block. However, I looked one more time at the code of unregister_netdevice() and found out that nothing prevents from calling unregister_netdevice() again when the notification NETDEV_GOING_DOWN is sent. I tried that and it works. I have a new set of patches without sending a slave detach and I will send it soon. Thanks for the comment Roland. It makes this patch simpler. I'd also like to give a credit to Jay for the idea of using NETDEV_GOING_DOWN notification instead of NETDEV_CHANGE+IFF_SLAVE_DETACH. He suggested it a while ago but I wrongly thought that it wouldn't work. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2
Michael S. Tsirkin wrote: Quoting Steve Wise [EMAIL PROTECTED]: Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 Please pull the latest from my libcxgb3 git repos to update the ofed-1.2.5 and ofed-1.3 libcxgb3 release. This will update to version 1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms. git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5 Go look at http://www.openfabrics.org/git/?p=ofed_1_2_5/libcxgb3.git;a=summary It has a ofed_1_2_5 branch. I believe Vlad setup the build scripts to handle this. Yes? This looks wrong. 1.2.X releases are done from ofed_1_2 branch. 1.2.5 is just a tag. What do you want me to do? and git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3 OK for that one. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2
Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 Michael S. Tsirkin wrote: Quoting Steve Wise [EMAIL PROTECTED]: Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 Please pull the latest from my libcxgb3 git repos to update the ofed-1.2.5 and ofed-1.3 libcxgb3 release. This will update to version 1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms. git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5 Go look at http://www.openfabrics.org/git/?p=ofed_1_2_5/libcxgb3.git;a=summary It has a ofed_1_2_5 branch. I believe Vlad setup the build scripts to handle this. Yes? This looks wrong. 1.2.X releases are done from ofed_1_2 branch. 1.2.5 is just a tag. What do you want me to do? and git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3 OK for that one. It's OK, done for both. -- MST ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [ewg] OFED teleconference today
Jeff Squyres wrote: Friendly reminder: the OFED teleconference is several hours from now (Monday, September 24, 2007). Noon US eastern / 9am US Pacific / -=6pm Israel=- 1. Monday, Sep 24, code 210062024 (***TODAY***) Agenda: 1. Agree on the new OFED 1.3 schedule: * Feature freeze - Sep 25 * Alpha release - Oct 1 * Beta release - Oct 17 (may change according to 2.6.24 rc1 availability) * RC1 - Oct 24 * RC2 - Nov 7 * RC3 - Nov 20 * RC4 - Dec 4 * GA release - Dec 18 2. Agree to move to kernel base 2.6.24 Start with what we have now (2.6.23) and move to 2.6.24 when RC1 is available. This will reduce many patches and with the new timeline seems more appropriate. Please send if you have any other agenda items Tziporet ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver
This patch series is the sixth version (see below link to V5) of the suggested changes to the bonding driver so it would be able to support non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. Patches 1-8 were originally submitted in V5 and patch 9 is an addition by Jay. Major changes from the previous version: 1. Remove the patches to net/core. Bonding will use the NETDEV_GOING_DOWN notification instead of NETDEV_CHANGE+IFF_SLAVE_DETACH. This reduces the amount of patches from 11 to 9. Links to earlier discussion: 1. A discussion in netdev about bonding support for IPoIB. http://lists.openwall.net/netdev/2006/11/30/46 2. V5 series http://lists.openfabrics.org/pipermail/general/2007-September/040996.html ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH V6 1/9] IB/ipoib: Bound the net device to the ipoib_neigh structue
IPoIB uses a two layer neighboring scheme, such that for each struct neighbour whose device is an ipoib one, there is a struct ipoib_neigh buddy which is created on demand at the tx flow by an ipoib_neigh_alloc(skb-dst-neighbour) call. When using the bonding driver, neighbours are created by the net stack on behalf of the bonding (master) device. On the tx flow the bonding code gets an skb such that skb-dev points to the master device, it changes this skb to point on the slave device and calls the slave hard_start_xmit function. Under this scheme, ipoib_neigh_destructor assumption that for each struct neighbour it gets, n-dev is an ipoib device and hence netdev_priv(n-dev) can be casted to struct ipoib_dev_priv is buggy. To fix it, this patch adds a dev field to struct ipoib_neigh which is used instead of the struct neighbour dev one, when n-dev-flags has the IFF_MASTER bit set. Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/infiniband/ulp/ipoib/ipoib.h |4 +++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++- drivers/infiniband/ulp/ipoib/ipoib_multicast.c |3 ++- 3 files changed, 20 insertions(+), 11 deletions(-) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h === --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h2007-09-18 17:09:26.534874404 +0200 @@ -328,6 +328,7 @@ struct ipoib_neigh { struct sk_buff_head queue; struct neighbour *neighbour; + struct net_device *dev; struct list_headlist; }; @@ -344,7 +345,8 @@ static inline struct ipoib_neigh **to_ip INFINIBAND_ALEN, sizeof(void *)); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh, + struct net_device *dev); void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh); extern struct workqueue_struct *ipoib_workqueue; Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:23:54.725744661 +0200 @@ -511,7 +511,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = ipoib_neigh_alloc(skb-dst-neighbour); + neigh = ipoib_neigh_alloc(skb-dst-neighbour, skb-dev); if (!neigh) { ++priv-stats.tx_dropped; dev_kfree_skb_any(skb); @@ -830,6 +830,13 @@ static void ipoib_neigh_cleanup(struct n unsigned long flags; struct ipoib_ah *ah = NULL; + neigh = *to_ipoib_neigh(n); + if (neigh) { + priv = netdev_priv(neigh-dev); + ipoib_dbg(priv, neigh_destructor for bonding device: %s\n, + n-dev-name); + } else + return; ipoib_dbg(priv, neigh_cleanup for %06x IPOIB_GID_FMT \n, IPOIB_QPN(n-ha), @@ -837,13 +844,10 @@ static void ipoib_neigh_cleanup(struct n spin_lock_irqsave(priv-lock, flags); - neigh = *to_ipoib_neigh(n); - if (neigh) { - if (neigh-ah) - ah = neigh-ah; - list_del(neigh-list); - ipoib_neigh_free(n-dev, neigh); - } + if (neigh-ah) + ah = neigh-ah; + list_del(neigh-list); + ipoib_neigh_free(n-dev, neigh); spin_unlock_irqrestore(priv-lock, flags); @@ -851,7 +855,8 @@ static void ipoib_neigh_cleanup(struct n ipoib_put_ah(ah); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour, + struct net_device *dev) { struct ipoib_neigh *neigh; @@ -860,6 +865,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(st return NULL; neigh-neighbour = neighbour; + neigh-dev = dev; *to_ipoib_neigh(neighbour) = neigh; skb_queue_head_init(neigh-queue); ipoib_cm_set(neigh, NULL); Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c === --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-18 17:09:26.536874045 +0200 @@ -727,7 +727,8 @@ out: if (skb-dst skb-dst-neighbour !*to_ipoib_neigh(skb-dst-neighbour)) { -
[ofa-general] [PATCH V6 2/9] IB/ipoib: Verify address handle validity on send
When the bonding device senses a carrier loss of its active slave it replaces that slave with a new one. In between the times when the carrier of an IPoIB device goes down and ipoib_neigh is destroyed, it is possible that the bonding driver will send a packet on a new slave that uses an old ipoib_neigh. This patch detects and prevents this from happenning. Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/infiniband/ulp/ipoib/ipoib_main.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:09:26.535874225 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:10:22.375853147 +0200 @@ -686,9 +686,10 @@ static int ipoib_start_xmit(struct sk_bu goto out; } } else if (neigh-ah) { - if (unlikely(memcmp(neigh-dgid.raw, + if (unlikely((memcmp(neigh-dgid.raw, skb-dst-neighbour-ha + 4, - sizeof(union ib_gid { + sizeof(union ib_gid))) || +(neigh-dev != dev))) { spin_lock(priv-lock); /* * It's safe to call ipoib_put_ah() inside ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH V6 3/9] net/bonding: Enable bonding to enslave non ARPHRD_ETHER
This patch changes some of the bond netdevice attributes and functions to be that of the active slave for the case of the enslaved device not being of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(), which are netdevice **type** dependent and hence might be not appropriate for devices of other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types, as was concluded over the v1 discussion. IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this IPoIB device is bounded to. The QP is a resource created by the IB HW and the GID is an identifier burned into the HCA (i have omitted here some details which are not important for the bonding RFC). Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/net/bonding/bond_main.c | 39 +++ 1 files changed, 39 insertions(+) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 10:08:59.0 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:13.424688411 +0300 @@ -1237,6 +1237,26 @@ static int bond_compute_features(struct return 0; } + +static void bond_setup_by_slave(struct net_device *bond_dev, + struct net_device *slave_dev) +{ + bond_dev-hard_header = slave_dev-hard_header; + bond_dev-rebuild_header= slave_dev-rebuild_header; + bond_dev-hard_header_cache = slave_dev-hard_header_cache; + bond_dev-header_cache_update = slave_dev-header_cache_update; + bond_dev-hard_header_parse = slave_dev-hard_header_parse; + + bond_dev-neigh_setup = slave_dev-neigh_setup; + + bond_dev-type = slave_dev-type; + bond_dev-hard_header_len = slave_dev-hard_header_len; + bond_dev-addr_len = slave_dev-addr_len; + + memcpy(bond_dev-broadcast, slave_dev-broadcast, + slave_dev-addr_len); +} + /* enslave device slave to bond device master */ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) { @@ -1311,6 +1331,25 @@ int bond_enslave(struct net_device *bond goto err_undo_flags; } + /* set bonding device ether type by slave - bonding netdevices are +* created with ether_setup, so when the slave type is not ARPHRD_ETHER +* there is a need to override some of the type dependent attribs/funcs. +* +* bond ether type mutual exclusion - don't allow slaves of dissimilar +* ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond +*/ + if (bond-slave_cnt == 0) { + if (slave_dev-type != ARPHRD_ETHER) + bond_setup_by_slave(bond_dev, slave_dev); + } else if (bond_dev-type != slave_dev-type) { + printk(KERN_ERR DRV_NAME : %s ether type (%d) is different + from other slaves (%d), can not enslave it.\n, + slave_dev-name, + slave_dev-type, bond_dev-type); + res = -EINVAL; + goto err_undo_flags; + } + if (slave_dev-set_mac_address == NULL) { printk(KERN_ERR DRV_NAME : %s: Error: The slave device you specified does ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH V6 4/9] net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address()
This patch allows for enslaving netdevices which do not support the set_mac_address() function. In that case the bond mac address is the one of the active slave, where remote peers are notified on the mac address (neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs (this is already done by the bonding code). Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/net/bonding/bond_main.c | 87 +++- drivers/net/bonding/bonding.h |1 2 files changed, 60 insertions(+), 28 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 10:54:13.0 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.971632881 +0300 @@ -1095,6 +1095,14 @@ void bond_change_active_slave(struct bon if (new_active) { bond_set_slave_active_flags(new_active); } + + /* when bonding does not set the slave MAC address, the bond MAC +* address is the one of the active slave. +*/ + if (new_active !bond-do_set_mac_addr) + memcpy(bond-dev-dev_addr, new_active-dev-dev_addr, + new_active-dev-addr_len); + bond_send_gratuitous_arp(bond); } } @@ -1351,13 +1359,22 @@ int bond_enslave(struct net_device *bond } if (slave_dev-set_mac_address == NULL) { - printk(KERN_ERR DRV_NAME - : %s: Error: The slave device you specified does - not support setting the MAC address. - Your kernel likely does not support slave - devices.\n, bond_dev-name); - res = -EOPNOTSUPP; - goto err_undo_flags; + if (bond-slave_cnt == 0) { + printk(KERN_WARNING DRV_NAME + : %s: Warning: The first slave device you + specified does not support setting the MAC + address. This bond MAC address would be that + of the active slave.\n, bond_dev-name); + bond-do_set_mac_addr = 0; + } else if (bond-do_set_mac_addr) { + printk(KERN_ERR DRV_NAME + : %s: Error: The slave device you specified + does not support setting the MAC addres,. + but this bond uses this practice. \n + , bond_dev-name); + res = -EOPNOTSUPP; + goto err_undo_flags; + } } new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL); @@ -1378,16 +1395,18 @@ int bond_enslave(struct net_device *bond */ memcpy(new_slave-perm_hwaddr, slave_dev-dev_addr, ETH_ALEN); - /* -* Set slave to master's mac address. The application already -* set the master's mac address to that of the first slave -*/ - memcpy(addr.sa_data, bond_dev-dev_addr, bond_dev-addr_len); - addr.sa_family = slave_dev-type; - res = dev_set_mac_address(slave_dev, addr); - if (res) { - dprintk(Error %d calling set_mac_address\n, res); - goto err_free; + if (bond-do_set_mac_addr) { + /* +* Set slave to master's mac address. The application already +* set the master's mac address to that of the first slave +*/ + memcpy(addr.sa_data, bond_dev-dev_addr, bond_dev-addr_len); + addr.sa_family = slave_dev-type; + res = dev_set_mac_address(slave_dev, addr); + if (res) { + dprintk(Error %d calling set_mac_address\n, res); + goto err_free; + } } res = netdev_set_master(slave_dev, bond_dev); @@ -1612,9 +1631,11 @@ err_close: dev_close(slave_dev); err_restore_mac: - memcpy(addr.sa_data, new_slave-perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev-type; - dev_set_mac_address(slave_dev, addr); + if (bond-do_set_mac_addr) { + memcpy(addr.sa_data, new_slave-perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev-type; + dev_set_mac_address(slave_dev, addr); + } err_free: kfree(new_slave); @@ -1792,10 +1813,12 @@ int bond_release(struct net_device *bond /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original (permanent) mac address */ - memcpy(addr.sa_data, slave-perm_hwaddr, ETH_ALEN); -
[ofa-general] [PATCH V6 5/9] net/bonding: Enable IP multicast for bonding IPoIB devices
Allow to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for multicast joins taking place after the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/net/bonding/bond_main.c |5 +++-- drivers/net/bonding/bond_sysfs.c |6 ++ 2 files changed, 5 insertions(+), 6 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 10:54:41.0 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.431862446 +0300 @@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond /* bond must be initialized by bond_open() before enslaving */ if (!(bond_dev-flags IFF_UP)) { - dprintk(Error, master_dev is not up\n); - return -EPERM; + printk(KERN_WARNING DRV_NAME +%s: master_dev is not up in bond_enslave\n, + bond_dev-name); } /* already enslaved */ Index: net-2.6/drivers/net/bonding/bond_sysfs.c === --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:08:58.0 +0300 +++ net-2.6/drivers/net/bonding/bond_sysfs.c2007-08-15 10:55:48.432862269 +0300 @@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru /* Quick sanity check -- is the bond interface up? */ if (!(bond-dev-flags IFF_UP)) { - printk(KERN_ERR DRV_NAME - : %s: Unable to update slaves because interface is down.\n, + printk(KERN_WARNING DRV_NAME + : %s: doing slave updates when interface is down.\n, bond-dev-name); - ret = -EPERM; - goto out; } /* Note: We can't hold bond-lock here, as bond_create grabs it. */ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH V6 6/9] net/bonding: Handlle wrong assumptions that slave is always an Ethernet device
bonding sometimes uses Ethernet constants (such as MTU and address length) which are not good when it enslaves non Ethernet devices (such as InfiniBand). Signed-off-by: Moni Shoua monis at voltaire.com --- drivers/net/bonding/bond_main.c |3 ++- drivers/net/bonding/bond_sysfs.c | 10 -- drivers/net/bonding/bonding.h|1 + 3 files changed, 11 insertions(+), 3 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-09-24 12:52:33.0 +0200 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-09-24 12:57:33.411459811 +0200 @@ -1224,7 +1224,8 @@ static int bond_compute_features(struct struct slave *slave; struct net_device *bond_dev = bond-dev; unsigned long features = bond_dev-features; - unsigned short max_hard_header_len = ETH_HLEN; + unsigned short max_hard_header_len = max((u16)ETH_HLEN, + bond_dev-hard_header_len); int i; features = ~(NETIF_F_ALL_CSUM | BOND_VLAN_FEATURES); Index: net-2.6/drivers/net/bonding/bond_sysfs.c === --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-09-24 12:55:09.0 +0200 +++ net-2.6/drivers/net/bonding/bond_sysfs.c2007-09-24 13:00:23.752680721 +0200 @@ -260,6 +260,7 @@ static ssize_t bonding_store_slaves(stru char command[IFNAMSIZ + 1] = { 0, }; char *ifname; int i, res, found, ret = count; + u32 original_mtu; struct slave *slave; struct net_device *dev = NULL; struct bonding *bond = to_bond(d); @@ -325,6 +326,7 @@ static ssize_t bonding_store_slaves(stru } /* Set the slave's MTU to match the bond */ + original_mtu = dev-mtu; if (dev-mtu != bond-dev-mtu) { if (dev-change_mtu) { res = dev-change_mtu(dev, @@ -339,6 +341,9 @@ static ssize_t bonding_store_slaves(stru } rtnl_lock(); res = bond_enslave(bond-dev, dev); + bond_for_each_slave(bond, slave, i) + if (strnicmp(slave-dev-name, ifname, IFNAMSIZ) == 0) + slave-original_mtu = original_mtu; rtnl_unlock(); if (res) { ret = res; @@ -351,6 +356,7 @@ static ssize_t bonding_store_slaves(stru bond_for_each_slave(bond, slave, i) if (strnicmp(slave-dev-name, ifname, IFNAMSIZ) == 0) { dev = slave-dev; + original_mtu = slave-original_mtu; break; } if (dev) { @@ -365,9 +371,9 @@ static ssize_t bonding_store_slaves(stru } /* set the slave MTU to the default */ if (dev-change_mtu) { - dev-change_mtu(dev, 1500); + dev-change_mtu(dev, original_mtu); } else { - dev-mtu = 1500; + dev-mtu = original_mtu; } } else { Index: net-2.6/drivers/net/bonding/bonding.h === --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-09-24 12:55:09.0 +0200 +++ net-2.6/drivers/net/bonding/bonding.h 2007-09-24 12:57:33.412459636 +0200 @@ -156,6 +156,7 @@ struct slave { s8 link;/* one of BOND_LINK_ */ s8 state; /* one of BOND_STATE_ */ u32original_flags; + u32original_mtu; u32link_failure_count; u16speed; u8 duplex; ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] PATCH V6 7/9] net/bonding: Delay sending of gratuitous ARP to avoid failure
Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit in dev-state field is on. This improves the chances for the arp packet to be transmitted. Signed-off-by: Moni Shoua monis at voltaire.com --- drivers/net/bonding/bond_main.c | 24 +--- drivers/net/bonding/bonding.h |1 + 2 files changed, 22 insertions(+), 3 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 10:56:33.0 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 11:04:37.221123652 +0300 @@ -1102,8 +1102,14 @@ void bond_change_active_slave(struct bon if (new_active !bond-do_set_mac_addr) memcpy(bond-dev-dev_addr, new_active-dev-dev_addr, new_active-dev-addr_len); - - bond_send_gratuitous_arp(bond); + if (bond-curr_active_slave + test_bit(__LINK_STATE_LINKWATCH_PENDING, + bond-curr_active_slave-dev-state)) { + dprintk(delaying gratuitous arp on %s\n, + bond-curr_active_slave-dev-name); + bond-send_grat_arp = 1; + } else + bond_send_gratuitous_arp(bond); } } @@ -2083,6 +2089,17 @@ void bond_mii_monitor(struct net_device * program could monitor the link itself if needed. */ + if (bond-send_grat_arp) { + if (bond-curr_active_slave test_bit(__LINK_STATE_LINKWATCH_PENDING, + bond-curr_active_slave-dev-state)) + dprintk(Needs to send gratuitous arp but not yet\n); + else { + dprintk(sending delayed gratuitous arp on on %s\n, + bond-curr_active_slave-dev-name); + bond_send_gratuitous_arp(bond); + bond-send_grat_arp = 0; + } + } read_lock(bond-curr_slave_lock); oldcurrent = bond-curr_active_slave; read_unlock(bond-curr_slave_lock); @@ -2484,7 +2501,7 @@ static void bond_send_gratuitous_arp(str if (bond-master_ip) { bond_arp_send(slave-dev, ARPOP_REPLY, bond-master_ip, - bond-master_ip, 0); + bond-master_ip, 0); } list_for_each_entry(vlan, bond-vlan_list, vlan_list) { @@ -4293,6 +4310,7 @@ static int bond_init(struct net_device * bond-current_arp_slave = NULL; bond-primary_slave = NULL; bond-dev = bond_dev; + bond-send_grat_arp = 0; INIT_LIST_HEAD(bond-vlan_list); /* Initialize the device entry points */ Index: net-2.6/drivers/net/bonding/bonding.h === --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-15 10:56:33.0 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-08-15 11:05:41.516451497 +0300 @@ -187,6 +187,7 @@ struct bonding { struct timer_list arp_timer; s8 kill_timers; s8 do_set_mac_addr; + s8 send_grat_arp; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH V6 8/9] net/bonding: Destroy bonding master when last slave is gone
When bonding enslaves non Ethernet devices it takes pointers to functions in the module that owns the slaves. In this case it becomes unsafe to keep the bonding master registered after last slave was unenslaved because we don't know if the pointers are still valid. Destroying the bond when slave_cnt is zero ensures that these functions be used anymore. Signed-off-by: Moni Shoua monis at voltaire.com --- drivers/net/bonding/bond_main.c | 37 + drivers/net/bonding/bond_sysfs.c |9 + drivers/net/bonding/bonding.h|3 +++ 3 files changed, 45 insertions(+), 4 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-09-24 14:01:24.055441842 +0200 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-09-24 14:05:05.658979207 +0200 @@ -1256,6 +1256,7 @@ static int bond_compute_features(struct static void bond_setup_by_slave(struct net_device *bond_dev, struct net_device *slave_dev) { + struct bonding *bond = bond_dev-priv; bond_dev-hard_header = slave_dev-hard_header; bond_dev-rebuild_header= slave_dev-rebuild_header; bond_dev-hard_header_cache = slave_dev-hard_header_cache; @@ -1270,6 +1271,7 @@ static void bond_setup_by_slave(struct n memcpy(bond_dev-broadcast, slave_dev-broadcast, slave_dev-addr_len); + bond-setup_by_slave = 1; } /* enslave device slave to bond device master */ @@ -1838,6 +1840,35 @@ int bond_release(struct net_device *bond } /* +* Destroy a bonding device. +* Must be under rtnl_lock when this function is called. +*/ +void bond_destroy(struct bonding *bond) +{ + bond_deinit(bond-dev); + bond_destroy_sysfs_entry(bond); + unregister_netdevice(bond-dev); +} + +/* +* First release a slave and than destroy the bond if no more slaves iare left. +* Must be under rtnl_lock when this function is called. +*/ +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev) +{ + struct bonding *bond = bond_dev-priv; + int ret; + + ret = bond_release(bond_dev, slave_dev); + if ((ret == 0) (bond-slave_cnt == 0)) { + printk(KERN_INFO DRV_NAME %s: destroying bond %s.\n, + bond_dev-name); + bond_destroy(bond); + } + return ret; +} + +/* * This function releases all slaves. */ static int bond_release_all(struct net_device *bond_dev) @@ -3337,6 +3368,11 @@ static int bond_slave_netdev_event(unsig * ... Or is it this? */ break; + case NETDEV_GOING_DOWN: + dprintk(slave %s is going down\n, slave_dev-name); + if (bond-setup_by_slave) + bond_release_and_destroy(bond_dev, slave_dev); + break; case NETDEV_CHANGEMTU: /* * TODO: Should slaves be allowed to @@ -4311,6 +4347,7 @@ static int bond_init(struct net_device * bond-primary_slave = NULL; bond-dev = bond_dev; bond-send_grat_arp = 0; + bond-setup_by_slave = 0; INIT_LIST_HEAD(bond-vlan_list); /* Initialize the device entry points */ Index: net-2.6/drivers/net/bonding/bonding.h === --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-09-24 14:01:24.055441842 +0200 +++ net-2.6/drivers/net/bonding/bonding.h 2007-09-24 14:01:24.627340013 +0200 @@ -188,6 +188,7 @@ struct bonding { s8 kill_timers; s8 do_set_mac_addr; s8 send_grat_arp; + s8 setup_by_slave; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; @@ -295,6 +296,8 @@ static inline void bond_unset_master_alb struct vlan_entry *bond_next_vlan(struct bonding *bond, struct vlan_entry *curr); int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb, struct net_device *slave_dev); int bond_create(char *name, struct bond_params *params, struct bonding **newbond); +void bond_destroy(struct bonding *bond); +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev); void bond_deinit(struct net_device *bond_dev); int bond_create_sysfs(void); void bond_destroy_sysfs(void); Index: net-2.6/drivers/net/bonding/bond_sysfs.c === --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-09-24 14:01:23.523536550 +0200 +++ net-2.6/drivers/net/bonding/bond_sysfs.c2007-09-24 14:01:24.628339835 +0200 @@ -164,9 +164,7 @@ static ssize_t bonding_store_bonds(struc printk(KERN_INFO DRV_NAME : %s
[ofa-general] [PATCH 9/9] bonding: Optionally allow ethernet slaves to keep own MAC
Update the don't change MAC of slaves functionality added in previous changes to be a generic option, rather than something tied to IB devices, as it's occasionally useful for regular ethernet devices as well. Adds fail_over_mac option (which is automatically enabled for IB slaves), applicable only to active-backup mode. Includes documentation update. Updates bonding driver version to 3.2.0. Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] --- Documentation/networking/bonding.txt | 33 +++ drivers/net/bonding/bond_main.c | 57 + drivers/net/bonding/bond_sysfs.c | 49 + drivers/net/bonding/bonding.h|6 ++-- 4 files changed, 121 insertions(+), 24 deletions(-) diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 1da5666..1134062 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -281,6 +281,39 @@ downdelay will be rounded down to the nearest multiple. The default value is 0. +fail_over_mac + + Specifies whether active-backup mode should set all slaves to + the same MAC address (the traditional behavior), or, when + enabled, change the bond's MAC address when changing the + active interface (i.e., fail over the MAC address itself). + + Fail over MAC is useful for devices that cannot ever alter + their MAC address, or for devices that refuse incoming + broadcasts with their own source MAC (which interferes with + the ARP monitor). + + The down side of fail over MAC is that every device on the + network must be updated via gratuitous ARP, vs. just updating + a switch or set of switches (which often takes place for any + traffic, not just ARP traffic, if the switch snoops incoming + traffic to update its tables) for the traditional method. If + the gratuitous ARP is lost, communication may be disrupted. + + When fail over MAC is used in conjuction with the mii monitor, + devices which assert link up prior to being able to actually + transmit and receive are particularly susecptible to loss of + the gratuitous ARP, and an appropriate updelay setting may be + required. + + A value of 0 disables fail over MAC, and is the default. A + value of 1 enables fail over MAC. This option is enabled + automatically if the first slave added cannot change its MAC + address. This option may be modified via sysfs only when no + slaves are present in the bond. + + This option was added in bonding version 3.2.0. + lacp_rate Option specifying the rate in which we'll ask our link partner diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 77caca3..c01ff9d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -97,6 +97,7 @@ static char *xmit_hash_policy = NULL; static int arp_interval = BOND_LINK_ARP_INTERV; static char *arp_ip_target[BOND_MAX_ARP_TARGETS] = { NULL, }; static char *arp_validate = NULL; +static int fail_over_mac = 0; struct bond_params bonding_defaults; module_param(max_bonds, int, 0); @@ -130,6 +131,8 @@ module_param_array(arp_ip_target, charp, NULL, 0); MODULE_PARM_DESC(arp_ip_target, arp targets in n.n.n.n form); module_param(arp_validate, charp, 0); MODULE_PARM_DESC(arp_validate, validate src/dst of ARP probes: none (default), active, backup or all); +module_param(fail_over_mac, int, 0); +MODULE_PARM_DESC(fail_over_mac, For active-backup, do not set all slaves to the same MAC. 0 of off (default), 1 for on.); /*- Global variables */ @@ -1099,7 +1102,7 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active) /* when bonding does not set the slave MAC address, the bond MAC * address is the one of the active slave. */ - if (new_active !bond-do_set_mac_addr) + if (new_active bond-params.fail_over_mac) memcpy(bond-dev-dev_addr, new_active-dev-dev_addr, new_active-dev-addr_len); if (bond-curr_active_slave @@ -1371,16 +1374,16 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) if (slave_dev-set_mac_address == NULL) { if (bond-slave_cnt == 0) { printk(KERN_WARNING DRV_NAME - : %s: Warning: The first slave device you - specified does not support setting the MAC - address. This bond MAC address would be that - of the active slave.\n, bond_dev-name); - bond-do_set_mac_addr = 0; - } else
Re: [ofa-general] Re: [ewg] OFED teleconference today
I cannot make the meeting today. I vote for 2.6.24 base. There is still the outstanding iwarp port space issue that will need to be pulled into ofed-1.3 when it finalizes. But its a bug fix really, so not a new feature I guess. Tziporet Koren wrote: Jeff Squyres wrote: Friendly reminder: the OFED teleconference is several hours from now (Monday, September 24, 2007). Noon US eastern / 9am US Pacific / -=6pm Israel=- 1. Monday, Sep 24, code 210062024 (***TODAY***) Agenda: 1. Agree on the new OFED 1.3 schedule: * Feature freeze - Sep 25 * Alpha release - Oct 1 * Beta release - Oct 17 (may change according to 2.6.24 rc1 availability) * RC1 - Oct 24 * RC2 - Nov 7 * RC3 - Nov 20 * RC4 - Dec 4 * GA release - Dec 18 2. Agree to move to kernel base 2.6.24 Start with what we have now (2.6.23) and move to 2.6.24 when RC1 is available. This will reduce many patches and with the new timeline seems more appropriate. Please send if you have any other agenda items Tziporet ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH V6 5/9] net/bonding: Enable IP multicast for bonding IPoIB devices
On Mon, 24 Sep 2007 17:37:00 +0200 Moni Shoua [EMAIL PROTECTED] wrote: Allow to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for multicast joins taking place after the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/net/bonding/bond_main.c |5 +++-- drivers/net/bonding/bond_sysfs.c |6 ++ 2 files changed, 5 insertions(+), 6 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.0 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.431862446 +0300 @@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond /* bond must be initialized by bond_open() before enslaving */ if (!(bond_dev-flags IFF_UP)) { - dprintk(Error, master_dev is not up\n); - return -EPERM; + printk(KERN_WARNING DRV_NAME + %s: master_dev is not up in bond_enslave\n, + bond_dev-name); } /* already enslaved */ Index: net-2.6/drivers/net/bonding/bond_sysfs.c === --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:08:58.0 +0300 +++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:55:48.432862269 +0300 @@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru /* Quick sanity check -- is the bond interface up? */ if (!(bond-dev-flags IFF_UP)) { - printk(KERN_ERR DRV_NAME -: %s: Unable to update slaves because interface is down.\n, + printk(KERN_WARNING DRV_NAME +: %s: doing slave updates when interface is down.\n, bond-dev-name); - ret = -EPERM; - goto out; } Please get rid of the warning. Make bonding work correctly and allow enslave/remove of device when bonding is down. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [BUG report / PATCH] fix race in the core multicast management
Now, is this case there was --no-- previous event, when the port was brought back online there was PORT_ACTIVE event (its a driver issue which we look at). However, from the view point of the SA there was GID out event, so the HCA port was dropped out from the multicast group and the multicast routing (spanning tree, MFTs configuration etc) was computed without this port being included. This is the ipoib logging of what happens from its perspective (I have added the event number to the port state change event print): Do you know why there wasn't some sort of port down event? node 1 - switch A - switch B - switch C - SA The host would only see port up/down events as of changes in the link state in the local port or in the port which is connected to it through the cable. So, if you brought the link down/up between switches A B, node 1 wouldn't receive any events, but it would be removed from the multicast group? - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] RE: OFA website edits
Jeff Becker wrote: I'm OK with these suggestions. Please let me know what you would like implemented. Thanks. I tried changing my WEB_README, and the updates didn't show up on the download page. How often should be the page be updated? - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch
I used git-format-patch to extract patches from this tree and add them to ofed 1.3 kernel tree. Thanks ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] rdma_cm connect / disconnect / reject race....resulting in crash....
Sean, per our discussion here's the problem description from Olaf... We start to shut down the connection, and call rdma_destroy_qp on our cm_id. We haven't executed rdma_destroy_id yet. Now apparently a connect reject message comes in from the other host, and cma_ib_handler() is called with an event of IB_CM_REJ_RECEIVED. It calls cma_modify_qp_err, which for some odd reason tries to modify the exact same QP we just destroyed. The crash looks like this: RDS/IB: connection request while the connection exist: 11.0.0.18, disconnecting and reconnecting ic f7ccb800 ic-i_cm_id f7cb2a00 rdma_destroy_qp(f7cb2a00) Unable to handle kernel NULL pointer dereference at virtual address 00f8 EIP is at ib_modify_qp+0x5/0xe [ib_core] Stack: f7cb2a00 f8ac36af 0006 1a0f4680 f6742e7c c011cc85 c495ede0 f671ce30 c495ede0 c495ede0 0086 c495ede0 c011d1a3 f671ce30 f671ce30 0002 c4966de0 0002 c495ede0 0001 0001 Call Trace: [f8ac36af] cma_modify_qp_err+0x22/0x2d [rdma_cm] [...] [f8ac3371] cma_disable_remove+0x35/0x3b [rdma_cm] [f8ac3e31] cma_ib_handler+0xe6/0x158 [rdma_cm] [f89150f7] cm_process_work+0x4a/0x80 [ib_cm] [f8916c33] cm_rej_handler+0xd3/0x114 [ib_cm] It dies trying to dereference qp-device-modify_qp because qp-device is NULL. If you check the stack, you'll see the exact same cm_id that we just called rdma_destroy_qp() on (note that the printk(rdma_destroy_qp) that appears above comes *after* the call itself, so by the time this is printed, the QP is dead already. That's easy, I thought. Obviously, rdma_destroy_qp just forgets to clear cm_id-qp after destroying the queue pair: void rdma_destroy_qp(struct rdma_cm_id *id) { ib_destroy_qp(id-qp); + id-qp = NULL; } But that didn't really fix it. So either there's something else going on which I don't grok yet, or this is just another case of bad locking. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] RE: OFA website edits
Hi Sean. I just talked to Jeff Scott about this, as he had announced the new downloads page. It turns out that the new page does not use my php page that automatically updates, but rather took a snapshot of the page state. That's why your update doesn't show up. He said he would try to fix this. -jeff On 9/24/07, Sean Hefty [EMAIL PROTECTED] wrote: Jeff Becker wrote: I'm OK with these suggestions. Please let me know what you would like implemented. Thanks. I tried changing my WEB_README, and the updates didn't show up on the download page. How often should be the page be updated? - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
I have submitted this before; but here it is again. Against net-2.6.24 from yesterday for this and all following patches. cheers, jamal Hi Jamal, I've been (slowly) working on resurrecting the original design of my multiqueue patches to address this exact issue of the queue_lock being a hot item. I added a queue_lock to each queue in the subqueue struct, and in the enqueue and dequeue, just lock that queue instead of the global device queue_lock. The only two issues to overcome are the QDISC_RUNNING state flag, since that also serializes entry into the qdisc_restart() function, and the qdisc statistics maintenance, which needs to be serialized. Do you think this work along with your patch will benefit from one another? I apologize for not having working patches right now, but I am working on them slowly as I have some blips of spare time. Thanks, -PJ Waskiewicz ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.
I'm sure I had seen a previous email in this thread that suggested using a userspace library to open a socket in the shared port space. It seems that suggestion was dropped without reason. Does anyone know why? Thanks, Glenn. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Steve Wise Sent: Sunday, September 23, 2007 3:37 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; general@lists.openfabrics.org Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts. iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts. Version 3: - don't use list_del_init() where list_del() is sufficient. Version 2: - added a per-device mutex for the address and listening endpoints lists. - wait for all replies if sending multiple passive_open requests to rnic. - log warning if no addresses are available when a listen is issued. - tested --- Design: The sysadmin creates for iwarp use only alias interfaces of the form devname:iw* where devname is the native interface name (eg eth0) for the iwarp netdev device. The alias label can be anything starting with iw. The iw immediately after the ':' is the key used by the iw_cxgb3 driver. EG: ifconfig eth0 192.168.70.123 up ifconfig eth0:iw1 192.168.71.123 up ifconfig eth0:iw2 192.168.72.123 up In the above example, 192.168.70/24 is for TCP traffic, while 192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use. The rdma-only interface must be on its own IP subnet. This allows routing all rdma traffic onto this interface. The iWARP driver must translate all listens on address 0.0.0.0 to the set of rdma-only ip addresses for the device in question. This prevents incoming connect requests to the TCP ipaddresses from going up the rdma stack. Implementation Details: - The iw_cxgb3 driver registers for inetaddr events via register_inetaddr_notifier(). This allows tracking the iwarp-only addresses/subnets as they get added and deleted. The iwarp driver maintains a list of the current iwarp-only addresses. - The iw_cxgb3 driver builds the list of iwarp-only addresses for its devices at module insert time. This is needed because the inetaddr notifier callbacks don't replay address-add events when someone registers. So the driver must build the initial list at module load time. - When a listen is done on address 0.0.0.0, then the iw_cxgb3 driver must translate that into a set of listens on the iwarp-only addresses. This is implemented by maintaining a list of stid/addr entries per listening endpoint. - When a new iwarp-only address is added or removed, the iw_cxgb3 driver must traverse the set of listening endpoints and update them accordingly. This allows an application to bind to 0.0.0.0 prior to the iwarp-only interfaces being configured. It also allows changing the iwarp-only set of addresses and getting the expected behavior for apps already bound to 0.0.0.0. This is done by maintaining a list of listening endpoints off the device struct. - The address list, the listening endpoint list, and each list of stid/addrs in use per listening endpoint are all protected via a mutex per iw_cxgb3 device. Signed-off-by: Steve Wise [EMAIL PROTECTED] --- drivers/infiniband/hw/cxgb3/iwch.c| 125 drivers/infiniband/hw/cxgb3/iwch.h| 11 + drivers/infiniband/hw/cxgb3/iwch_cm.c | 259 +++-- drivers/infiniband/hw/cxgb3/iwch_cm.h | 15 ++ 4 files changed, 360 insertions(+), 50 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c index 0315c9d..d81d46e 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.c +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -63,6 +63,123 @@ struct cxgb3_client t3c_client = { static LIST_HEAD(dev_list); static DEFINE_MUTEX(dev_mutex); +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr; + + addr = kmalloc(sizeof *addr, GFP_KERNEL); + if (!addr) { + printk(KERN_ERR MOD %s - failed to alloc memory!\n, + __FUNCTION__); + return; + } + addr-ifa = ifa; + mutex_lock(rnicp-mutex); + list_add_tail(addr-entry, rnicp-addrlist); + mutex_unlock(rnicp-mutex); +} + +static void remove_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr, *tmp; + + mutex_lock(rnicp-mutex); + list_for_each_entry_safe(addr, tmp, rnicp-addrlist, entry) { + if (addr-ifa == ifa) { + list_del(addr-entry); + kfree(addr); + goto out; + } + } +out: + mutex_unlock(rnicp-mutex); +} + +static int netdev_is_ours(struct iwch_dev *rnicp, struct net_device *netdev) +{ + int i; + + for (i = 0; i rnicp-rdev.port_info.nports; i++) +
Re: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.
I'm sure I had seen a previous email in this thread that suggested using a userspace library to open a socket in the shared port space. It seems that suggestion was dropped without reason. Does anyone know why? Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, iSER, etc). Does the neteffect NIC have the same issue as cxgb3 here? What are your thoughts on how to handle this? - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] rdma/cm: add locking around QP accesses
If a user allocates a QP on an rdma_cm_id, the rdma_cm will automatically transition the QP through its states (RTR, RTS, error, etc.) While the QP state transitions are occurring, the QP itself must remain valid. Provide locking around the QP pointer to prevent its destruction while accessing the pointer. This fixes an issue reported by Olaf Kirch from Oracle that resulted in a system crash: An incoming connection arrives and we decide to tear down the nascent connection. The remote ends decides to do the same. We start to shut down the connection, and call rdma_destroy_qp on our cm_id. ... Now apparently a 'connect reject' message comes in from the other host, and cma_ib_handler() is called with an event of IB_CM_REJ_RECEIVED. It calls cma_modify_qp_err, which for some odd reason tries to modify the exact same QP we just destroyed. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- Rick, can you please test this patch and let me know if it fixes your problem? drivers/infiniband/core/cma.c | 90 +++-- 1 files changed, 60 insertions(+), 30 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9ffb998..c6a6dba 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -120,6 +120,8 @@ struct rdma_id_private { enum cma_state state; spinlock_t lock; + struct mutexqp_mutex; + struct completion comp; atomic_trefcount; wait_queue_head_t wait_remove; @@ -387,6 +389,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, id_priv-id.event_handler = event_handler; id_priv-id.ps = ps; spin_lock_init(id_priv-lock); + mutex_init(id_priv-qp_mutex); init_completion(id_priv-comp); atomic_set(id_priv-refcount, 1); init_waitqueue_head(id_priv-wait_remove); @@ -472,61 +475,86 @@ EXPORT_SYMBOL(rdma_create_qp); void rdma_destroy_qp(struct rdma_cm_id *id) { - ib_destroy_qp(id-qp); + struct rdma_id_private *id_priv; + + id_priv = container_of(id, struct rdma_id_private, id); + mutex_lock(id_priv-qp_mutex); + ib_destroy_qp(id_priv-id.qp); + id_priv-id.qp = NULL; + mutex_unlock(id_priv-qp_mutex); } EXPORT_SYMBOL(rdma_destroy_qp); -static int cma_modify_qp_rtr(struct rdma_cm_id *id) +static int cma_modify_qp_rtr(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - if (!id-qp) - return 0; + mutex_lock(id_priv-qp_mutex); + if (!id_priv-id.qp) { + ret = 0; + goto out; + } /* Need to update QP attributes from default values. */ qp_attr.qp_state = IB_QPS_INIT; - ret = rdma_init_qp_attr(id, qp_attr, qp_attr_mask); + ret = rdma_init_qp_attr(id_priv-id, qp_attr, qp_attr_mask); if (ret) - return ret; + goto out; - ret = ib_modify_qp(id-qp, qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv-id.qp, qp_attr, qp_attr_mask); if (ret) - return ret; + goto out; qp_attr.qp_state = IB_QPS_RTR; - ret = rdma_init_qp_attr(id, qp_attr, qp_attr_mask); + ret = rdma_init_qp_attr(id_priv-id, qp_attr, qp_attr_mask); if (ret) - return ret; + goto out; - return ib_modify_qp(id-qp, qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv-id.qp, qp_attr, qp_attr_mask); +out: + mutex_unlock(id_priv-qp_mutex); + return ret; } -static int cma_modify_qp_rts(struct rdma_cm_id *id) +static int cma_modify_qp_rts(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - if (!id-qp) - return 0; + mutex_lock(id_priv-qp_mutex); + if (!id_priv-id.qp) { + ret = 0; + goto out; + } qp_attr.qp_state = IB_QPS_RTS; - ret = rdma_init_qp_attr(id, qp_attr, qp_attr_mask); + ret = rdma_init_qp_attr(id_priv-id, qp_attr, qp_attr_mask); if (ret) - return ret; + goto out; - return ib_modify_qp(id-qp, qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv-id.qp, qp_attr, qp_attr_mask); +out: + mutex_unlock(id_priv-qp_mutex); + return ret; } -static int cma_modify_qp_err(struct rdma_cm_id *id) +static int cma_modify_qp_err(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; + int ret; - if (!id-qp) - return 0; + mutex_lock(id_priv-qp_mutex); + if (!id_priv-id.qp) { + ret = 0; + goto out; + } qp_attr.qp_state = IB_QPS_ERR; - return ib_modify_qp(id-qp, qp_attr, IB_QP_STATE); + ret = ib_modify_qp(id_priv-id.qp, qp_attr,
Re: [ofa-general] Re: A question about rdma_get_cm_event
Note that the private data length _is_ correct for iwarp. So the man pages should mention that this is an IB-only issue maybe? And maybe indicate that transport-independent applications should not rely on the length... I modified the man pages to describe private_data_len as: Specifies the size of the user-controlled data buffer. Note that the actual amount of data transferred to the remote side is transport dependent and may be larger than that requested. These changes have been pushed into my git tree. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] osm/osm_sa_path_record: trivial cosmetic chage
Trivial fix in osm_sa_path_record.c Signed-off-by: Yevgeny Kliteynik [EMAIL PROTECTED] --- opensm/opensm/osm_sa_path_record.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c index 3b183d9..ce75ec8 100644 --- a/opensm/opensm/osm_sa_path_record.c +++ b/opensm/opensm/osm_sa_path_record.c @@ -723,7 +723,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, if (pkey) { p_prtn = (osm_prtn_t *) cl_qmap_get(p_rcv-p_subn-prtn_pkey_tbl, - pkey cl_ntoh16((uint16_t) ~ + pkey cl_hton16((uint16_t) ~ 0x8000)); if (p_prtn == (osm_prtn_t *) cl_qmap_end(p_rcv-p_subn-prtn_pkey_tbl)) -- 1.5.1.4 ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.
On Mon, 2007-09-24 at 16:30 -0500, Glenn Grundstrom wrote: -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Monday, September 24, 2007 2:33 PM To: Glenn Grundstrom Cc: Steve Wise; [EMAIL PROTECTED]; general@lists.openfabrics.org Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts. I'm sure I had seen a previous email in this thread that suggested using a userspace library to open a socket in the shared port space. It seems that suggestion was dropped without reason. Does anyone know why? Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, iSER, etc). The kernel apps could open a Linux tcp socket and create an RDMA socket connection. Both calls are standard Linux kernel architected routines. This approach was NAK'd by David Miller and others... Doesn't NFSoRDMA already open a TCP socket and another for RDMA traffic (ports 2049 2050 if I remember correctly)? The NFS RDMA transport driver does not open a socket for the RDMA connection. It uses a different port in order to allow both TCP and RDMA mounts to the same filer. I currently don't know if iSER, RDS, etc. already do the same thing, but if they don't, they probably could very easily. Woe be to those who do so... Does the neteffect NIC have the same issue as cxgb3 here? What are your thoughts on how to handle this? Yes, the NetEffect RNIC will have the same issue as Chelsio. And all Future RNIC's which support a unified tcp address with Linux will as well. Steve has put a lot of thought and energy into the problem, but I don't think users admins will be very happy with us in the long run. Agreed. In summary, short of having the rdma_cm share kernel port space, I'd like to see the equivalent in userspace and have the kernel apps handle the issue in a similar way as described above. There are a few technical issues to work through (like passing the userspace IP address to the kernel), This just moves the socket creation to code that is outside the purview the kernel maintainers. The exchanging of the 4-tuple created with the kernel module, however, is back in the kernel and in the maintainer's control and responsibility. In my view anything like this will be viewed as an attempt to sneak code into the kernel that the maintainer has already vehemently rejected. This will make people angry and damage the cooperative working relationship that we are trying to build. but I think we can solve that just like other information that gets passed from user into the IB/RDMA kernel modules. Sharing the IP 4-tuple space cooperatively with the core in any fashion has been nak'd. Without this cooperation, the options we've been able to come up with are administrative/policy based approaches. Any ideas you have along these lines are welcome. Tom Glenn. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 12:12 -0700, Waskiewicz Jr, Peter P wrote: Hi Jamal, I've been (slowly) working on resurrecting the original design of my multiqueue patches to address this exact issue of the queue_lock being a hot item. I added a queue_lock to each queue in the subqueue struct, and in the enqueue and dequeue, just lock that queue instead of the global device queue_lock. The only two issues to overcome are the QDISC_RUNNING state flag, since that also serializes entry into the qdisc_restart() function, and the qdisc statistics maintenance, which needs to be serialized. Do you think this work along with your patch will benefit from one another? The one thing that seems obvious is to use dev-hard_prep_xmit() in the patches i posted to select the xmit ring. You should be able to do figure out the txmit ring without holding any lock. I lost track of how/where things went since the last discussion; so i need to wrap my mind around it to make sensisble suggestions - I know the core patches are in the kernel but havent paid attention to details and if you look at my second patch youd see a comment in dev_batch_xmit() which says i need to scrutinize multiqueue more. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching
jamal wrote: On Mon, 2007-24-09 at 00:00 -0700, Kok, Auke wrote: that's bad to begin with :) - please send those separately so I can fasttrack them into e1000e and e1000 where applicable. Ive been CCing you ;- Most of the changes are readability and reusability with the batching. But yes, I'm very inclined to merge more features into e1000e than e1000. I intend to put multiqueue support into e1000e, as *all* of the hardware that it will support has multiple queues. Putting in any other performance feature like tx batching would absolutely be interesting. I looked at the e1000e and it is very close to e1000 so i should be able to move the changes easily. Most importantly, can i kill LLTX? For tx batching, we have to wait to see how Dave wants to move forward; i will have the patches but it is not something you need to push until we see where that is going. hmm, I though I already removed that, but now I see some remnants from that. By all means, please send a separate patch for that! Auke ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [DOC] Net batching driver howto
I have updated the driver howto to match the patches i posted yesterday. attached. cheers, jamal Heres the begining of a howto for driver authors. The intended audience for this howto is people already familiar with netdevices. 1.0 Netdevice Pre-requisites -- For hardware based netdevices, you must have at least hardware that is capable of doing DMA with many descriptors; i.e having hardware with a queue length of 3 (as in some fscked ethernet hardware) is not very useful in this case. 2.0 What is new in the driver API --- There are 3 new methods and one new variable introduced. These are: 1)dev-hard_prep_xmit() 2)dev-hard_end_xmit() 3)dev-hard_batch_xmit() 4)dev-xmit_win 2.1 Using Core driver changes - To provide context, lets look at a typical driver abstraction for dev-hard_start_xmit(). It has 4 parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. A driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)use its dev-hard_prep_xmit() method to achieve #a 2)use its dev-hard_end_xmit() method to achieve #d 3)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) Note: There are drivers which may need not support any of the two methods (example the tun driver i patched) so the two methods are essentially optional. 2.1.1 Theory of operation -- The core will first do the packet formatting by invoking your supplied dev-hard_prep_xmit() method. It will then pass you the packet via your dev-hard_start_xmit() method for as many as packets you have advertised (via dev-xmit_win) you can consume. Lastly it will invoke your dev-hard_end_xmit() when it completes passing you all the packets queued for you. 2.1.1.1 Locking rules - dev-hard_prep_xmit() is invoked without holding any tx lock but the rest are under TX_LOCK(). So you have to ensure that whatever you put it dev-hard_prep_xmit() doesnt require locking. 2.1.1.2 The slippery LLTX - LLTX drivers present a challenge in that we have to introduce a deviation from the norm and require the -hard_batch_xmit() method. An LLTX driver presents us with -hard_batch_xmit() to which we pass it a list of packets in a dev-blist skb queue. It is then the responsibility of the -hard_batch_xmit() to exercise steps #b and #c for all packets passed in the dev-blist. Step #a and #d are done by the core should you register presence of dev-hard_prep_xmit() and dev-hard_end_xmit() in your setup. 2.1.1.3 xmit_win dev-xmit_win variable is set by the driver to tell us how much space it has in its rings/queues. dev-xmit_win is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us, whenever it invokes netif_wake_queue, how much space it has for descriptors by setting this variable. 3.0 Driver Essentials - The typical driver tx state machine is: -1- +Core sends packets +-- Driver puts packet onto hardware queue +if hardware queue is full, netif_stop_queue(dev) + -2- +core stops sending because of netif_stop_queue(dev) .. .. time passes ... .. -3- +--- driver has transmitted packets, opens up tx path by invoking netif_wake_queue(dev) -1- +Cycle repeats and core sends more packets (step 1). 3.1 Driver pre-requisite -- This is _a very important_ requirement in making batching useful. The pre-requisite for batching changes is that the driver should provide a low threshold to open up the tx path. Drivers such as tg3 and e1000 already do this. Before you invoke netif_wake_queue(dev) you check if there is a threshold of space reached to insert new packets. Heres an example of how i added it to tun driver. Observe the setting of dev-xmit_win --- +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ .. .. u32 t = skb_queue_len(tun-readq); if (netif_queue_stopped(tun-dev) t NETDEV_LTT) { tun-dev-xmit_win = tun-dev-tx_queue_len; netif_wake_queue(tun-dev); } --- Heres how the batching e1000 driver does it: -- if (unlikely(cleaned netif_carrier_ok(netdev) E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) { if (netif_queue_stopped(netdev)) { int rspace = E1000_DESC_UNUSED(tx_ring) -
[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
The one thing that seems obvious is to use dev-hard_prep_xmit() in the patches i posted to select the xmit ring. You should be able to do figure out the txmit ring without holding any lock. I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. I lost track of how/where things went since the last discussion; so i need to wrap my mind around it to make sensisble suggestions - I know the core patches are in the kernel but havent paid attention to details and if you look at my second patch youd see a comment in dev_batch_xmit() which says i need to scrutinize multiqueue more. No worries. I'll try to get things together on my end and provide some patches to add a per-queue lock. In the meantime, I'll take a much closer look at the batching code, since I've stopped looking at the patches in-depth about a month ago. :-( Thanks, -PJ Waskiewicz ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Atomic operation question.
HI, I have a question for atmoic operation. If incoming atomic operations are from both ports of that HCA, can it work correctly ? Thanks. --CQ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. The enqueue is easy to comprehend. The single device queue lock should suffice. The dequeue is interesting: Maybe you can point me to some doc or describe to me the dequeue aspect; are you planning to have an array of txlocks per, one per ring? How is the policy to define the qdisc queues locked/mapped to tx rings? cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. The enqueue is easy to comprehend. The single device queue lock should suffice. The dequeue is interesting: We should make sure we're symmetric with the locking on enqueue to dequeue. If we use the single device queue lock on enqueue, then dequeue will also need to check that lock in addition to the individual queue lock. The details of this are more trivial than the actual dequeue to make it efficient though. Maybe you can point me to some doc or describe to me the dequeue aspect; are you planning to have an array of txlocks per, one per ring? How is the policy to define the qdisc queues locked/mapped to tx rings? The dequeue locking would be pushed into the qdisc itself. This is how I had it originally, and it did make the code more complex, but it was successful at breaking the heavily-contended queue_lock apart. I have a subqueue structure right now in netdev, which only has queue_state (for netif_{start|stop}_subqueue). This state is checked in sch_prio right now in the dequeue for both prio and rr. My approach is to add a queue_lock in that struct, so each queue allocated by the driver would have a lock per queue. Then in dequeue, that lock would be taken when the skb is about to be dequeued. The skb-queue_mapping field also maps directly to the queue index itself, so it can be unlocked easily outside of the context of the dequeue function. The policy would be to use a spin_trylock() in dequeue, so that dequeue can still do work if enqueue or another dequeue is busy. And the allocation of qdisc queues to device queues is assumed to be one-to-one (that's how the qdisc behaves now). I really just need to put my nose to the grindstone and get the patches together and to the list...stay tuned. Thanks, -PJ Waskiewicz ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH] uDAPL 2.0 mods to co-exist with uDAPL 1.2
James Lentini wrote: Comments below: - +# version-info current:revision:age What does this comment do? just a comment regarding revisioning. # -# This example shows netdev name, enabling administrator to use same copy across cluster +# Add examples for multiple interfaces and IPoIB HA fail over, and bonding The previous line is TODO, right? I'd suggest annotating it with that text to make it clear to users. ok --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -44,7 +44,7 @@ #include inttypes.h #ifndef DAPL_PROVIDER -#define DAPL_PROVIDER OpenIB-cma +#define DAPL_PROVIDER OpenIB-2-cma Should we update OpenIB to ofa? Obviously, this isn't necessary as part of this change I didn't want to change the 1.2 names for compatibility reasons but for 2.0 we could move to ofa names for both libraries and provider names. For example, libdaplcma.so becomes libdaplofa.so, OpenIB-cma becomes ofa. For example dat.conf 2.0 entries would look like this: ofa u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 ib0 0 ofa-1 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 ib1 0 ofa-2 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 ib2 0 ofa-3 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 ib3 0 ofa-bond u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 bond0 0 Is that what you had in mind? -arlin ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 24 Sep 2007 16:47:06 -0700 Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote: On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. The enqueue is easy to comprehend. The single device queue lock should suffice. The dequeue is interesting: We should make sure we're symmetric with the locking on enqueue to dequeue. If we use the single device queue lock on enqueue, then dequeue will also need to check that lock in addition to the individual queue lock. The details of this are more trivial than the actual dequeue to make it efficient though. Maybe you can point me to some doc or describe to me the dequeue aspect; are you planning to have an array of txlocks per, one per ring? How is the policy to define the qdisc queues locked/mapped to tx rings? The dequeue locking would be pushed into the qdisc itself. This is how I had it originally, and it did make the code more complex, but it was successful at breaking the heavily-contended queue_lock apart. I have a subqueue structure right now in netdev, which only has queue_state (for netif_{start|stop}_subqueue). This state is checked in sch_prio right now in the dequeue for both prio and rr. My approach is to add a queue_lock in that struct, so each queue allocated by the driver would have a lock per queue. Then in dequeue, that lock would be taken when the skb is about to be dequeued. The skb-queue_mapping field also maps directly to the queue index itself, so it can be unlocked easily outside of the context of the dequeue function. The policy would be to use a spin_trylock() in dequeue, so that dequeue can still do work if enqueue or another dequeue is busy. And the allocation of qdisc queues to device queues is assumed to be one-to-one (that's how the qdisc behaves now). I really just need to put my nose to the grindstone and get the patches together and to the list...stay tuned. Thanks, -PJ Waskiewicz - Since we are redoing this, is there any way to make the whole TX path more lockless? The existing model seems to be more of a monitor than a real locking model. -- Stephen Hemminger [EMAIL PROTECTED] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching
jamal wrote: If the intel folks will accept the patch i'd really like to kill the e1000 LLTX interface. If I understood DaveM correctly, it is sounding like we want to deprecate all of use LLTX on real hardware? If so, several such projects might be considered, as well as possibly simplifying TX batching work perhaps. Also, WRT e1000 specifically, I was hoping to minimize changes, and focus people on e1000e. e1000e replaces (deprecates) large portions of e1000, namely the support for the PCI Express modern chips. When e1000e has proven itself in the field, we can potentially look at several e1000 simplifications, during the large scale code removal that becomes possible. Jeff ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
I really just need to put my nose to the grindstone and get the patches together and to the list...stay tuned. Thanks, -PJ Waskiewicz - Since we are redoing this, is there any way to make the whole TX path more lockless? The existing model seems to be more of a monitor than a real locking model. That seems quite reasonable. I will certainly see what I can do. Thanks Stephen, -PJ Waskiewicz ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general