Re: [PATCH] IB/IPoIB: Fix kernel panic on multicast flow

2016-01-07 Thread Christoph Lameter
On Thu, 7 Jan 2016, Erez Shitrit wrote:

> ipoib_mcast_restart_task calls ipoib_mcast_remove_list with the
> parameter mcast->dev. That mcast is a temporary (used as an iterator)
> variable that may be uninitialized.
> There is no need to send the variable dev to the function, as each mcast
> has its dev as a member in the mcast struct.

Reviewed-by: Christoph Lameter <c...@linux.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/sysfs: Fix sparse warning on attr_id

2016-01-04 Thread Christoph Lameter
On Sun, 3 Jan 2016, ira.we...@intel.com wrote:

> Attributed ID was declared as an int while the value should really be big
> endian 16.

Reviewed-by: Christoph Lameter <c...@linux.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] IB/core: sysfs.c: Fix PerfMgt ClassPortInfo handling

2015-12-29 Thread Christoph Lameter

Reviewed-by: Christoph Lameter <c...@linux.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Display extended counter set if available

2015-12-21 Thread Christoph Lameter
On Mon, 21 Dec 2015, Hal Rosenstock wrote:

> > Don't we need to change all the sysfs_remove_groups to use 
> > get_counter_table as
> > well?
>
> Looks like it to me too. Good catch.

Fix follows:

From: Christoph Lameter <c...@linux.com>
Subject: Fix sysfs entry removal by storing the table format in  pma_table

Store the table being used in the ib_port structure and use it when sysfs
entries have to be removed.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/core/sysfs.c
===
--- linux.orig/drivers/infiniband/core/sysfs.c
+++ linux/drivers/infiniband/core/sysfs.c
@@ -47,6 +47,7 @@ struct ib_port {
struct attribute_group gid_group;
struct attribute_group pkey_group;
u8 port_num;
+   struct attribute_group *pma_table;
 };

 struct port_attribute {
@@ -651,7 +652,8 @@ static int add_port(struct ib_device *de
return ret;
}

-   ret = sysfs_create_group(>kobj, get_counter_table(device));
+   p->pma_table = get_counter_table(device);
+   ret = sysfs_create_group(>kobj, p->pma_table);
if (ret)
goto err_put;

@@ -710,7 +712,7 @@ err_free_gid:
p->gid_group.attrs = NULL;

 err_remove_pma:
-   sysfs_remove_group(>kobj, _group);
+   sysfs_remove_group(>kobj, p->pma_table);

 err_put:
kobject_put(>kobj);
@@ -923,7 +925,7 @@ static void free_port_list_attributes(st
list_for_each_entry_safe(p, t, >port_list, entry) {
struct ib_port *port = container_of(p, struct ib_port, kobj);
list_del(>entry);
-   sysfs_remove_group(p, _group);
+   sysfs_remove_group(p, port->pma_table);
sysfs_remove_group(p, >pkey_group);
sysfs_remove_group(p, >gid_group);
kobject_put(p);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Display extended counter set if available

2015-12-21 Thread Christoph Lameter
V2->V3: Add check for NOIETF mode and create special table
  for that case.

Check if the extended counters are available and if so
create the proper extended and additional counters.

Reviewed-by: Hal Rosenstock <h...@mellanox.com>
Signed-off-by: Christoph Lameter <c...@linux.com>
---
 drivers/infiniband/core/sysfs.c | 104 +++-
 include/rdma/ib_pma.h   |   1 +
 2 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 34dcc23..b179fca 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -320,6 +320,13 @@ struct port_table_attribute port_pma_attr_##_name = {  
\
.attr_id = IB_PMA_PORT_COUNTERS ,   \
 }
 
+#define PORT_PMA_ATTR_EXT(_name, _width, _offset)  \
+struct port_table_attribute port_pma_attr_ext_##_name = {  \
+   .attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
+   .index = (_offset) | ((_width) << 16),  \
+   .attr_id = IB_PMA_PORT_COUNTERS_EXT ,   \
+}
+
 /*
  * Get a Perfmgmt MAD block of data.
  * Returns error code or the number of bytes retrieved.
@@ -400,6 +407,11 @@ static ssize_t show_pma_counter(struct ib_port *p, struct 
port_attribute *attr,
ret = sprintf(buf, "%u\n",
  be32_to_cpup((__be32 *)data));
break;
+   case 64:
+   ret = sprintf(buf, "%llu\n",
+   be64_to_cpup((__be64 *)data));
+   break;
+
default:
ret = 0;
}
@@ -424,6 +436,18 @@ static PORT_PMA_ATTR(port_rcv_data , 13, 32, 
224);
 static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256);
 static PORT_PMA_ATTR(port_rcv_packets  , 15, 32, 288);
 
+/*
+ * Counters added by extended set
+ */
+static PORT_PMA_ATTR_EXT(port_xmit_data, 64,  64);
+static PORT_PMA_ATTR_EXT(port_rcv_data , 64, 128);
+static PORT_PMA_ATTR_EXT(port_xmit_packets , 64, 192);
+static PORT_PMA_ATTR_EXT(port_rcv_packets  , 64, 256);
+static PORT_PMA_ATTR_EXT(unicast_xmit_packets  , 64, 320);
+static PORT_PMA_ATTR_EXT(unicast_rcv_packets   , 64, 384);
+static PORT_PMA_ATTR_EXT(multicast_xmit_packets, 64, 448);
+static PORT_PMA_ATTR_EXT(multicast_rcv_packets , 64, 512);
+
 static struct attribute *pma_attrs[] = {
_pma_attr_symbol_error.attr.attr,
_pma_attr_link_error_recovery.attr.attr,
@@ -444,11 +468,65 @@ static struct attribute *pma_attrs[] = {
NULL
 };
 
+static struct attribute *pma_attrs_ext[] = {
+   _pma_attr_symbol_error.attr.attr,
+   _pma_attr_link_error_recovery.attr.attr,
+   _pma_attr_link_downed.attr.attr,
+   _pma_attr_port_rcv_errors.attr.attr,
+   _pma_attr_port_rcv_remote_physical_errors.attr.attr,
+   _pma_attr_port_rcv_switch_relay_errors.attr.attr,
+   _pma_attr_port_xmit_discards.attr.attr,
+   _pma_attr_port_xmit_constraint_errors.attr.attr,
+   _pma_attr_port_rcv_constraint_errors.attr.attr,
+   _pma_attr_local_link_integrity_errors.attr.attr,
+   _pma_attr_excessive_buffer_overrun_errors.attr.attr,
+   _pma_attr_VL15_dropped.attr.attr,
+   _pma_attr_ext_port_xmit_data.attr.attr,
+   _pma_attr_ext_port_rcv_data.attr.attr,
+   _pma_attr_ext_port_xmit_packets.attr.attr,
+   _pma_attr_ext_port_rcv_packets.attr.attr,
+   _pma_attr_ext_unicast_rcv_packets.attr.attr,
+   _pma_attr_ext_unicast_xmit_packets.attr.attr,
+   _pma_attr_ext_multicast_rcv_packets.attr.attr,
+   _pma_attr_ext_multicast_xmit_packets.attr.attr,
+   NULL
+};
+
+static struct attribute *pma_attrs_noietf[] = {
+   _pma_attr_symbol_error.attr.attr,
+   _pma_attr_link_error_recovery.attr.attr,
+   _pma_attr_link_downed.attr.attr,
+   _pma_attr_port_rcv_errors.attr.attr,
+   _pma_attr_port_rcv_remote_physical_errors.attr.attr,
+   _pma_attr_port_rcv_switch_relay_errors.attr.attr,
+   _pma_attr_port_xmit_discards.attr.attr,
+   _pma_attr_port_xmit_constraint_errors.attr.attr,
+   _pma_attr_port_rcv_constraint_errors.attr.attr,
+   _pma_attr_local_link_integrity_errors.attr.attr,
+   _pma_attr_excessive_buffer_overrun_errors.attr.attr,
+   _pma_attr_VL15_dropped.attr.attr,
+   _pma_attr_ext_port_xmit_data.attr.attr,
+   _pma_attr_ext_port_rcv_data.attr.attr,
+   _pma_attr_ext_port_xmit_packets.attr.attr,
+   _pma_attr_ext_port_rcv_packets.attr.attr,
+   NULL
+};
+
 static struct attribute_group pma_group = {
.name  = "counters",
.attrs  = pma_attrs
 };
 
+static struct attribute_group pma_group_ext = {
+   .name  = "counters",
+   .attrs  = pma_at

[PATCH 0/3] IB core: 64 bit counter support V3

2015-12-21 Thread Christoph Lameter
V2->V3
  - Also add support for NOIETF counter mode where we have 64 bit
counters but not the multicast/unicast counters.
  - Add Reviewed-by's from Hal.

V1->V2
  - Add detection of the capability for 64 bit counter support
  - Lots of improvements as a result of suggestions by Hal Rosenstock.

Currently we only use 32 bits for the packet and byte counters. There have
been extended countes available for some time but we have no support for
those yet upstream. We keep having issues with 32 bit counters wrapping.
Especially the byte counter can wrap frequently (as in multiple times per
minute)

This patch adds 4 new counters (for full extended mode) and updates 4 32
bit counters to use the 64 bit sizes (for NOIETF and full extended mode)
so that they no longer wrap.

Should the device not support 64 bit counters then only the original 32
bit counters will be visible.

This patchset can be pulled from my git repo on kernel.org

git pull git://git.kernel.org/pub/scm/linux/kernle/git/christoph/rdma.git 
counter_64bit

Thanks to Hal Rosenstock and Ira Weiny for reviewing this patchset.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Specify attribute_id in port_table_attribute

2015-12-21 Thread Christoph Lameter
Add the attr_id on port_table_attribute since we will have to add
a different port_table_attribute for the extended attribute soon.

Reviewed-by: Hal Rosenstock <h...@mellanox.com>
Signed-off-by: Christoph Lameter <c...@linux.com>
---
 drivers/infiniband/core/sysfs.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index acefe85..34dcc23 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -39,6 +39,7 @@
 #include 
 
 #include 
+#include 
 
 struct ib_port {
struct kobject kobj;
@@ -65,6 +66,7 @@ struct port_table_attribute {
struct port_attribute   attr;
charname[8];
int index;
+   int attr_id;
 };
 
 static ssize_t port_attr_show(struct kobject *kobj,
@@ -314,7 +316,8 @@ static ssize_t show_port_pkey(struct ib_port *p, struct 
port_attribute *attr,
 #define PORT_PMA_ATTR(_name, _counter, _width, _offset)
\
 struct port_table_attribute port_pma_attr_##_name = {  \
.attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
-   .index = (_offset) | ((_width) << 16) | ((_counter) << 24)  \
+   .index = (_offset) | ((_width) << 16) | ((_counter) << 24), \
+   .attr_id = IB_PMA_PORT_COUNTERS ,   \
 }
 
 /*
@@ -376,7 +379,7 @@ static ssize_t show_pma_counter(struct ib_port *p, struct 
port_attribute *attr,
ssize_t ret;
u8 data[8];
 
-   ret = get_perf_mad(p->ibdev, p->port_num, cpu_to_be16(0x12), ,
+   ret = get_perf_mad(p->ibdev, p->port_num, tab_attr->attr_id, ,
40 + offset / 8, sizeof(data));
if (ret < 0)
return sprintf(buf, "N/A (no PMA)\n");
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Create get_perf_mad function in sysfs.c

2015-12-21 Thread Christoph Lameter
Create a new function to retrieve performance management
data from the existing code in get_pma_counter().

Reviewed-by: Hal Rosenstock <h...@mellanox.com>
Signed-off-by: Christoph Lameter <c...@linux.com>
---
 drivers/infiniband/core/sysfs.c | 62 ++---
 1 file changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index b1f37d4..acefe85 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -317,21 +317,21 @@ struct port_table_attribute port_pma_attr_##_name = { 
\
.index = (_offset) | ((_width) << 16) | ((_counter) << 24)  \
 }
 
-static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
-   char *buf)
+/*
+ * Get a Perfmgmt MAD block of data.
+ * Returns error code or the number of bytes retrieved.
+ */
+static int get_perf_mad(struct ib_device *dev, int port_num, int attr,
+   void *data, int offset, size_t size)
 {
-   struct port_table_attribute *tab_attr =
-   container_of(attr, struct port_table_attribute, attr);
-   int offset = tab_attr->index & 0x;
-   int width  = (tab_attr->index >> 16) & 0xff;
-   struct ib_mad *in_mad  = NULL;
-   struct ib_mad *out_mad = NULL;
+   struct ib_mad *in_mad;
+   struct ib_mad *out_mad;
size_t mad_size = sizeof(*out_mad);
u16 out_mad_pkey_index = 0;
ssize_t ret;
 
-   if (!p->ibdev->process_mad)
-   return sprintf(buf, "N/A (no PMA)\n");
+   if (!dev->process_mad)
+   return -ENOSYS;
 
in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
@@ -344,12 +344,12 @@ static ssize_t show_pma_counter(struct ib_port *p, struct 
port_attribute *attr,
in_mad->mad_hdr.mgmt_class= IB_MGMT_CLASS_PERF_MGMT;
in_mad->mad_hdr.class_version = 1;
in_mad->mad_hdr.method= IB_MGMT_METHOD_GET;
-   in_mad->mad_hdr.attr_id   = cpu_to_be16(0x12); /* PortCounters */
+   in_mad->mad_hdr.attr_id   = attr;
 
-   in_mad->data[41] = p->port_num; /* PortSelect field */
+   in_mad->data[41] = port_num;/* PortSelect field */
 
-   if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY,
-p->port_num, NULL, NULL,
+   if ((dev->process_mad(dev, IB_MAD_IGNORE_MKEY,
+port_num, NULL, NULL,
 (const struct ib_mad_hdr *)in_mad, mad_size,
 (struct ib_mad_hdr *)out_mad, _size,
 _mad_pkey_index) &
@@ -358,31 +358,49 @@ static ssize_t show_pma_counter(struct ib_port *p, struct 
port_attribute *attr,
ret = -EINVAL;
goto out;
}
+   memcpy(data, out_mad->data + offset, size);
+   ret = size;
+out:
+   kfree(in_mad);
+   kfree(out_mad);
+   return ret;
+}
+
+static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
+   char *buf)
+{
+   struct port_table_attribute *tab_attr =
+   container_of(attr, struct port_table_attribute, attr);
+   int offset = tab_attr->index & 0x;
+   int width  = (tab_attr->index >> 16) & 0xff;
+   ssize_t ret;
+   u8 data[8];
+
+   ret = get_perf_mad(p->ibdev, p->port_num, cpu_to_be16(0x12), ,
+   40 + offset / 8, sizeof(data));
+   if (ret < 0)
+   return sprintf(buf, "N/A (no PMA)\n");
 
switch (width) {
case 4:
-   ret = sprintf(buf, "%u\n", (out_mad->data[40 + offset / 8] >>
+   ret = sprintf(buf, "%u\n", (*data >>
(4 - (offset % 8))) & 0xf);
break;
case 8:
-   ret = sprintf(buf, "%u\n", out_mad->data[40 + offset / 8]);
+   ret = sprintf(buf, "%u\n", *data);
break;
case 16:
ret = sprintf(buf, "%u\n",
- be16_to_cpup((__be16 *)(out_mad->data + 40 + 
offset / 8)));
+ be16_to_cpup((__be16 *)data));
break;
case 32:
ret = sprintf(buf, "%u\n",
- be32_to_cpup((__be32 *)(out_mad->data + 40 + 
offset / 8)));
+ be32_to_cpup((__be32 *)data));
break;
default:
ret = 0;
}
 
-out:
-   kfree(in_mad);
-   kfree(out_mad);
-
return ret;
 }
 
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Isolate common list remove code

2015-12-21 Thread Christoph Lameter
Code cleanup to remove multicast specific code from ipoib_main.c

The removal of a list of multicast groups occurs in three places.
Create a new function ipoib_mcast_remove_list(). Use this new
function in ipoib_main.c too.
That in turn allows the dropping of two functions that were
exported from ipoib_multicast.c for expiration of mc groups.

Reviewed-by: Iraq Weiny <ira.we...@intel.com>
Signed-off-by: Christoph Lameter <c...@linux.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  3 +--
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  7 ++-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 24 ++--
 3 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 3ede103..989c409 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -495,7 +495,6 @@ void ipoib_dev_cleanup(struct net_device *dev);
 void ipoib_mcast_join_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
-void ipoib_mcast_free(struct ipoib_mcast *mc);
 
 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
@@ -549,7 +548,7 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter,
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid, int set_qkey);
-int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast);
+void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list);
 struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);
 
 int ipoib_init_qp(struct net_device *dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 7d32818..483ff20 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1150,7 +1150,7 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
unsigned long flags;
int i;
LIST_HEAD(remove_list);
-   struct ipoib_mcast *mcast, *tmcast;
+   struct ipoib_mcast *mcast;
struct net_device *dev = priv->dev;
 
if (test_bit(IPOIB_STOP_NEIGH_GC, >flags))
@@ -1207,10 +1207,7 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
 
 out_unlock:
spin_unlock_irqrestore(>lock, flags);
-   list_for_each_entry_safe(mcast, tmcast, _list, list) {
-   ipoib_mcast_leave(dev, mcast);
-   ipoib_mcast_free(mcast);
-   }
+   ipoib_mcast_remove_list(dev, _list);
 }
 
 static void ipoib_reap_neigh(struct work_struct *work)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index f357ca6..8acb420a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -106,7 +106,7 @@ static void __ipoib_mcast_schedule_join_thread(struct 
ipoib_dev_priv *priv,
queue_delayed_work(priv->wq, >mcast_task, 0);
 }
 
-void ipoib_mcast_free(struct ipoib_mcast *mcast)
+static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
struct net_device *dev = mcast->dev;
int tx_dropped = 0;
@@ -677,7 +677,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev)
return 0;
 }
 
-int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
+static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
int ret = 0;
@@ -704,6 +704,16 @@ int ipoib_mcast_leave(struct net_device *dev, struct 
ipoib_mcast *mcast)
return 0;
 }
 
+void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list)
+{
+   struct ipoib_mcast *mcast, *tmcast;
+
+   list_for_each_entry_safe(mcast, tmcast, remove_list, list) {
+   ipoib_mcast_leave(dev, mcast);
+   ipoib_mcast_free(mcast);
+   }
+}
+
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -810,10 +820,7 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
if (test_bit(IPOIB_MCAST_FLAG_BUSY, >flags))
wait_for_completion(>done);
 
-   list_for_each_entry_safe(mcast, tmcast, _list, list) {
-   ipoib_mcast_leave(dev, mcast);
-   ipoib_mcast_free(mcast);
-   }
+   ipoib_mcast_remove_list(dev, _list);
 }
 
 static int ipoib_mcast_addr_is_valid(const u8 *addr, const u8 *broadcast)
@@ -939,10 +946,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
if (test_bit(IPOIB_MCAST_FLAG_BUSY, >flags))
wait_for_completion(>done);
 
-   list_for_

[PATCH 2/2] Move multicast specific code out of ipoib_main.c

2015-12-21 Thread Christoph Lameter
V1->V2:
- Rename function as requested by Ira

Code cleanup to move multicast specific code that checks for
a sendonly join to ipoib_multicast.c. This allows the removal
of the export of __ipoib_mcast_find().

Signed-off-by: Christoph Lameter <c...@linux.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  3 ++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  | 13 +
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 -
 3 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 989c409..a924933 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -549,7 +549,8 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter,
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid, int set_qkey);
 void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list);
-struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);
+void ipoib_check_and_add_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
+   struct list_head *remove_list);
 
 int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 483ff20..620d9ca 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1150,7 +1150,6 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
unsigned long flags;
int i;
LIST_HEAD(remove_list);
-   struct ipoib_mcast *mcast;
struct net_device *dev = priv->dev;
 
if (test_bit(IPOIB_STOP_NEIGH_GC, >flags))
@@ -1179,18 +1178,8 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
  
lockdep_is_held(>lock))) != NULL) {
/* was the neigh idle for two GC periods */
if (time_after(neigh_obsolete, neigh->alive)) {
-   u8 *mgid = neigh->daddr + 4;
 
-   /* Is this multicast ? */
-   if (*mgid == 0xff) {
-   mcast = __ipoib_mcast_find(dev, mgid);
-
-   if (mcast && 
test_bit(IPOIB_MCAST_FLAG_SENDONLY, >flags)) {
-   list_del(>list);
-   rb_erase(>rb_node, 
>multicast_tree);
-   list_add_tail(>list, 
_list);
-   }
-   }
+   ipoib_check_and_add_mcast_sendonly(priv, 
neigh->daddr + 4, _list);
 
rcu_assign_pointer(*np,
   
rcu_dereference_protected(neigh->hnext,
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 8acb420a..ab79b87 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -153,7 +153,7 @@ static struct ipoib_mcast *ipoib_mcast_alloc(struct 
net_device *dev,
return mcast;
 }
 
-struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
+static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void 
*mgid)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct rb_node *n = priv->multicast_tree.rb_node;
@@ -704,6 +704,25 @@ static int ipoib_mcast_leave(struct net_device *dev, 
struct ipoib_mcast *mcast)
return 0;
 }
 
+/*
+ * Check if the multicast group is sendonly. If so remove it from the maps
+ * and add to the remove list
+ */
+void ipoib_check_and_add_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
+   struct list_head *remove_list)
+{
+   /* Is this multicast ? */
+   if (*mgid == 0xff) {
+   struct ipoib_mcast *mcast = __ipoib_mcast_find(priv->dev, mgid);
+
+   if (mcast && test_bit(IPOIB_MCAST_FLAG_SENDONLY, 
>flags)) {
+   list_del(>list);
+   rb_erase(>rb_node, >multicast_tree);
+   list_add_tail(>list, remove_list);
+   }
+   }
+}
+
 void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list)
 {
struct ipoib_mcast *mcast, *tmcast;
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] IB multicast cleanup patches V2

2015-12-21 Thread Christoph Lameter
V1->V2
 - Add Reviewed by's for first patch from Ira Weiny
 - Change name of ipoib_check_mcast_sendonly() to
ipoib_check_and_add_mcast_sendonly() as requested by Ira

This patchset cleans up the code a bit after the last round of multicast
patches related to the sendonly join logic. Some of the bits of code
landed in ipoib_main.c instead of ipoib_multicast.c.

- Move the multicastbits into that file so that everything is neatly together
- Reduce the number of functions exported from ipoib_multicast.c

This patchset can be retrieved from a git repo on kernel.org via

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/rdma.git 
cleanup

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Isolate common list remove code

2015-12-21 Thread Christoph Lameter
On Mon, 21 Dec 2015, Leon Romanovsky wrote:

> On Mon, Dec 21, 2015 at 08:42:53AM -0600, Christoph Lameter wrote:
> > Code cleanup to remove multicast specific code from ipoib_main.c
> >
> > The removal of a list of multicast groups occurs in three places.
> > Create a new function ipoib_mcast_remove_list(). Use this new
> > function in ipoib_main.c too.
> > That in turn allows the dropping of two functions that were
> > exported from ipoib_multicast.c for expiration of mc groups.
> >
> > Reviewed-by: Iraq Weiny <ira.we...@intel.com>
> Iraq Weiny --> Ira Weiny

Ohh.. Bad typo.

> > +void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
> > *remove_list)
> Will it be beneficial to inline this function?

As far as I know it is not run in a latency critical context and the code
is too heavy for that. In particular we are calling other functions that
are not inlined.

> > +{
> > +   struct ipoib_mcast *mcast, *tmcast;
> > +
> > +   list_for_each_entry_safe(mcast, tmcast, remove_list, list) {
> > +   ipoib_mcast_leave(dev, mcast);
> > +   ipoib_mcast_free(mcast);
> > +   }
> > +}
> > +
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] bject: IB Core: Display extended counter set if available

2015-12-18 Thread Christoph Lameter
On Thu, 17 Dec 2015, Hal Rosenstock wrote:

> > +   if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH) {
> > +   /* We have extended counters */
> > +
> > +   if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH_NOIETF)
> > +   /* But not the IETF ones */
> > +   return _group_noietf;
>
> These 2 capability bits are mutually exclusive so I think it should be:
>
>   if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH) {
>   /* We have extended counters */
>   return _group_ext;
>   }
>
>   if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH_NOIETF)
>   /* But not the IETF ones */
>   return _group_noietf;

This case would then use the 64 bit counters despite of the
IB_PMA_CLASS_CAP_EXT_WIDTH not being set.

>   }
>
>   return _group;
>
> > +

The tables contain all the counters each. So we would need another table
of counters that has the ietf counters but not the 64 bit extended ones?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] bject: IB Core: Display extended counter set if available

2015-12-18 Thread Christoph Lameter
On Fri, 18 Dec 2015, Hal Rosenstock wrote:

> > This case would then use the 64 bit counters despite of the
> > IB_PMA_CLASS_CAP_EXT_WIDTH not being set.
>
> Yes, IB_PMA_CLASS_CAP_EXT_WIDTH means all extended counters including
> IETF ones whereas IB_PMA_CLASS_CAP_EXT_WIDTH_NOIETF means extended
> counters without IETF ones ([uni multi]cast [rcv xmit] pkts).

Ok so I updated the add on patch to the following. Doug: Is this enough
or do you want another rollup?


From: Christoph Lameter <c...@linux.com>
Subject: IB core counters: Support noietf extended counters V2

V1-V2: Fix logic to detect when 64 bit counter are available
based on Hal's suggestions.

Detect if we have extended counters but not IETF counters.
For that we need a special table and create a function that
returns the table address.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/core/sysfs.c
===
--- linux.orig/drivers/infiniband/core/sysfs.c
+++ linux/drivers/infiniband/core/sysfs.c
@@ -493,6 +493,26 @@ static struct attribute *pma_attrs_ext[]
NULL
 };

+static struct attribute *pma_attrs_noietf[] = {
+   _pma_attr_symbol_error.attr.attr,
+   _pma_attr_link_error_recovery.attr.attr,
+   _pma_attr_link_downed.attr.attr,
+   _pma_attr_port_rcv_errors.attr.attr,
+   _pma_attr_port_rcv_remote_physical_errors.attr.attr,
+   _pma_attr_port_rcv_switch_relay_errors.attr.attr,
+   _pma_attr_port_xmit_discards.attr.attr,
+   _pma_attr_port_xmit_constraint_errors.attr.attr,
+   _pma_attr_port_rcv_constraint_errors.attr.attr,
+   _pma_attr_local_link_integrity_errors.attr.attr,
+   _pma_attr_excessive_buffer_overrun_errors.attr.attr,
+   _pma_attr_VL15_dropped.attr.attr,
+   _pma_attr_ext_port_xmit_data.attr.attr,
+   _pma_attr_ext_port_rcv_data.attr.attr,
+   _pma_attr_ext_port_xmit_packets.attr.attr,
+   _pma_attr_ext_port_rcv_packets.attr.attr,
+   NULL
+};
+
 static struct attribute_group pma_group = {
.name  = "counters",
.attrs  = pma_attrs
@@ -503,6 +523,11 @@ static struct attribute_group pma_group_
.attrs  = pma_attrs_ext
 };

+static struct attribute_group pma_group_noietf = {
+   .name  = "counters",
+   .attrs  = pma_attrs_noietf
+};
+
 static void ib_port_release(struct kobject *kobj)
 {
struct ib_port *p = container_of(kobj, struct ib_port, kobj);
@@ -576,10 +601,10 @@ err:
 }

 /*
- * Check if the port supports the Extended Counters.
- * Return error code of 0 for success
+ * Figure out which counter table to use depending on
+ * the device capabilities.
  */
-static int port_check_extended_counters(struct ib_device *dev)
+static struct attribute_group *get_counter_table(struct ib_device *dev)
 {
int ret = 0;
struct ib_class_port_info cpi;
@@ -587,12 +612,18 @@ static int port_check_extended_counters(
ret = get_perf_mad(dev, 0, IB_PMA_CLASS_PORT_INFO, , 40, 
sizeof(cpi));

if (ret >= 0) {
-   if (!(cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH) &&
-   !(cpi.capability_mask && 
IB_PMA_CLASS_CAP_EXT_WIDTH_NOIETF))
-   ret = -ENOSYS;
+
+   if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH)
+   /* We have extended counters */
+   return _group_ext;
+
+   if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH_NOIETF)
+   /* But not the IETF ones */
+   return _group_noietf;
}

-   return ret;
+   /* Fall back to normal counters */
+   return _group;
 }

 static int add_port(struct ib_device *device, int port_num,
@@ -623,11 +654,7 @@ static int add_port(struct ib_device *de
return ret;
}

-   ret = sysfs_create_group(>kobj,
-   port_check_extended_counters(device) ?
-   _group_ext :
-   _group);
-
+   ret = sysfs_create_group(>kobj, get_counter_table(device));
if (ret)
goto err_put;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-17 Thread Christoph Lameter
On Thu, 17 Dec 2015, Hal Rosenstock wrote:

> > + * Get a MAD block of data.
>
> Nit: Get PerfMgt MAD block of data

Ok.

> > + * Returns error code or the number of bytes retrieved.
> > + */
> > +static int get_mad(struct ib_device *dev, int port_num, int attr,
>
> Nit: Maybe this is too verbose but better name might be get_perf_mad

Ok.

> > +static int port_check_extended_counters(struct ib_device *dev, int port)
> > +{
> > +   int ret = 0;
> > +   struct ib_class_port_info cpi;
> > +
> > +   ret = get_mad(dev, port, IB_PMA_CLASS_PORT_INFO, , 40, sizeof(cpi));
>
> ClassPortInfo is per class not per class per port so need to indicate to
> get_mad whether a port is supplied or not or conditionalize based on
> attr ID.

I thought a port is always supplied since we get the info for a particular
port and the directory only exists if there is a port?

> > -   ret = sysfs_create_group(>kobj, _group);
> > +   ret = sysfs_create_group(>kobj,
> > +   port_check_extended_counters(device, port_num) ?
> > +   _group_ext :
> > +   _group);
>
> PortExtendedCounters does not have all the error counters in
> PortCounters so this isn't an either or. When extended port counters are
> supported should still include the original port counters with the
> exception of the [xmit rcv] [pkts data] which should come from the
> extended counters.

The original port counters are still included. The _ext table refers to
both extended and regular counters.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] IB core counters: Specify attribute_id in port_table_attribute

2015-12-17 Thread Christoph Lameter
Add the attr_id on port_table_attribute since we will have to add
a different port_table_attribute for the extended attribute soon.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/core/sysfs.c
===
--- linux.orig/drivers/infiniband/core/sysfs.c
+++ linux/drivers/infiniband/core/sysfs.c
@@ -39,6 +39,7 @@
 #include 
 
 #include 
+#include 
 
 struct ib_port {
struct kobject kobj;
@@ -65,6 +66,7 @@ struct port_table_attribute {
struct port_attribute   attr;
charname[8];
int index;
+   int attr_id;
 };
 
 static ssize_t port_attr_show(struct kobject *kobj,
@@ -314,7 +316,8 @@ static ssize_t show_port_pkey(struct ib_
 #define PORT_PMA_ATTR(_name, _counter, _width, _offset)
\
 struct port_table_attribute port_pma_attr_##_name = {  \
.attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
-   .index = (_offset) | ((_width) << 16) | ((_counter) << 24)  \
+   .index = (_offset) | ((_width) << 16) | ((_counter) << 24), \
+   .attr_id = IB_PMA_PORT_COUNTERS ,   \
 }
 
 /*
@@ -376,7 +379,7 @@ static ssize_t show_pma_counter(struct i
ssize_t ret;
u8 data[8];
 
-   ret = get_perf_mad(p->ibdev, p->port_num, cpu_to_be16(0x12), ,
+   ret = get_perf_mad(p->ibdev, p->port_num, tab_attr->attr_id, ,
40 + offset / 8, sizeof(data));
if (ret < 0)
return sprintf(buf, "N/A (no PMA)\n");

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] IB 64 bit counter support V2

2015-12-17 Thread Christoph Lameter
V1->V2 Add detection of the capability for 64 bit counter support and lots
  of improvements as a result of suggestions by Hal Rosenstock.

Currently we only use 32 bits for the packet and byte counters. There have been
extended countes available for some time but we have no support for those
yet upstream. We keep having issues with 32 bit counters wrapping. Especially
the byte counter can wrap frequently (as in multiple times per minute)

This patch adds 4 new counters and updates 4 32 bit counters to use the
64 bit sizes so that they no longer wrap.

Should the device not support 64 bit counters then only the original 32
bit counters will be visible.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-17 Thread Christoph Lameter
On Thu, 17 Dec 2015, Hal Rosenstock wrote:

> > I thought a port is always supplied since we get the info for a particular
> > port and the directory only exists if there is a port?
>
> Yes, but there is no port (PortSelect) field in ClassPortInfo attribute
> unlike the PortCounters and PortExtendedCounters attributes.

Ok but its valid for all ports on that class right? Then this does not
matter?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] bject: IB Core: Display extended counter set if available

2015-12-17 Thread Christoph Lameter
On Thu, 17 Dec 2015, Hal Rosenstock wrote:

> > -   ret = sysfs_create_group(>kobj, _group);
> > +   ret = sysfs_create_group(>kobj,
> > +   port_check_extended_counters(device) ?
> > +   _group_ext :
>
> Would be nice to populate 2 different groups based on whether PMA
> supports full extended counters or extended counters without the IETF
> ones (no [uni mcast] [rcv xmit] pkt counters) in sysfs.

So port_check_extended_counters need to return another value for this.
The IETF ones are the uni/mcast xxx counters?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] bject: IB Core: Display extended counter set if available

2015-12-17 Thread Christoph Lameter
On Thu, 17 Dec 2015, Hal Rosenstock wrote:

> On 12/17/2015 4:28 PM, Christoph Lameter wrote:
> > So port_check_extended_counters need to return another value for this.
> > The IETF ones are the uni/mcast xxx counters?
>
> Yes

Ok. Then this patch on top of the last one should give us all of what you
want:



Subject: IB core counters: Support noietf extended counters

Detect if we have extended counters but not IETF counters.
For that we need a special table and create a function that
returns the table address.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/core/sysfs.c
===
--- linux.orig/drivers/infiniband/core/sysfs.c
+++ linux/drivers/infiniband/core/sysfs.c
@@ -493,6 +493,26 @@ static struct attribute *pma_attrs_ext[]
NULL
 };

+static struct attribute *pma_attrs_noietf[] = {
+   _pma_attr_symbol_error.attr.attr,
+   _pma_attr_link_error_recovery.attr.attr,
+   _pma_attr_link_downed.attr.attr,
+   _pma_attr_port_rcv_errors.attr.attr,
+   _pma_attr_port_rcv_remote_physical_errors.attr.attr,
+   _pma_attr_port_rcv_switch_relay_errors.attr.attr,
+   _pma_attr_port_xmit_discards.attr.attr,
+   _pma_attr_port_xmit_constraint_errors.attr.attr,
+   _pma_attr_port_rcv_constraint_errors.attr.attr,
+   _pma_attr_local_link_integrity_errors.attr.attr,
+   _pma_attr_excessive_buffer_overrun_errors.attr.attr,
+   _pma_attr_VL15_dropped.attr.attr,
+   _pma_attr_ext_port_xmit_data.attr.attr,
+   _pma_attr_ext_port_rcv_data.attr.attr,
+   _pma_attr_ext_port_xmit_packets.attr.attr,
+   _pma_attr_ext_port_rcv_packets.attr.attr,
+   NULL
+};
+
 static struct attribute_group pma_group = {
.name  = "counters",
.attrs  = pma_attrs
@@ -503,6 +523,11 @@ static struct attribute_group pma_group_
.attrs  = pma_attrs_ext
 };

+static struct attribute_group pma_group_noietf = {
+   .name  = "counters",
+   .attrs  = pma_attrs_noietf
+};
+
 static void ib_port_release(struct kobject *kobj)
 {
struct ib_port *p = container_of(kobj, struct ib_port, kobj);
@@ -576,23 +601,32 @@ err:
 }

 /*
- * Check if the port supports the Extended Counters.
- * Return error code of 0 for success
+ * Figure out which counter table to use depending on
+ * the device capabilities.
  */
-static int port_check_extended_counters(struct ib_device *dev)
+static struct attribute_group *get_counter_table(struct ib_device *dev)
 {
int ret = 0;
struct ib_class_port_info cpi;

ret = get_perf_mad(dev, 0, IB_PMA_CLASS_PORT_INFO, , 40, 
sizeof(cpi));

-   if (ret >= 0) {
-   if (!(cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH) &&
-   !(cpi.capability_mask && 
IB_PMA_CLASS_CAP_EXT_WIDTH_NOIETF))
-   ret = -ENOSYS;
+   if (ret < 0)
+   /* Fall back to normal counters */
+   return _group;
+
+
+   if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH) {
+   /* We have extended counters */
+
+   if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH_NOIETF)
+   /* But not the IETF ones */
+   return _group_noietf;
+
+   return _group_ext;
}

-   return ret;
+   return _group;
 }

 static int add_port(struct ib_device *device, int port_num,
@@ -623,11 +657,7 @@ static int add_port(struct ib_device *de
return ret;
}

-   ret = sysfs_create_group(>kobj,
-   port_check_extended_counters(device) ?
-   _group_ext :
-   _group);
-
+   ret = sysfs_create_group(>kobj, get_counter_table(device));
if (ret)
goto err_put;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] bject: IB Core: Display extended counter set if available

2015-12-17 Thread Christoph Lameter
Check if the extended counters are available and if so
create the proper extended and additional counters.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/core/sysfs.c
===
--- linux.orig/drivers/infiniband/core/sysfs.c
+++ linux/drivers/infiniband/core/sysfs.c
@@ -320,6 +320,13 @@ struct port_table_attribute port_pma_att
.attr_id = IB_PMA_PORT_COUNTERS ,   \
 }
 
+#define PORT_PMA_ATTR_EXT(_name, _width, _offset)  \
+struct port_table_attribute port_pma_attr_ext_##_name = {  \
+   .attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
+   .index = (_offset) | ((_width) << 16),  \
+   .attr_id = IB_PMA_PORT_COUNTERS_EXT ,   \
+}
+
 /*
  * Get a Perfmgmt MAD block of data.
  * Returns error code or the number of bytes retrieved.
@@ -349,7 +356,8 @@ static int get_perf_mad(struct ib_device
in_mad->mad_hdr.method= IB_MGMT_METHOD_GET;
in_mad->mad_hdr.attr_id   = attr;
 
-   in_mad->data[41] = port_num;/* PortSelect field */
+   if (port_num)
+   in_mad->data[41] = port_num;/* PortSelect field */
 
if ((dev->process_mad(dev, IB_MAD_IGNORE_MKEY,
 port_num, NULL, NULL,
@@ -400,6 +408,11 @@ static ssize_t show_pma_counter(struct i
ret = sprintf(buf, "%u\n",
  be32_to_cpup((__be32 *)data));
break;
+   case 64:
+   ret = sprintf(buf, "%llu\n",
+   be64_to_cpup((__be64 *)data));
+   break;
+
default:
ret = 0;
}
@@ -424,6 +437,18 @@ static PORT_PMA_ATTR(port_rcv_data
 static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256);
 static PORT_PMA_ATTR(port_rcv_packets  , 15, 32, 288);
 
+/*
+ * Counters added by extended set
+ */
+static PORT_PMA_ATTR_EXT(port_xmit_data, 64,  64);
+static PORT_PMA_ATTR_EXT(port_rcv_data , 64, 128);
+static PORT_PMA_ATTR_EXT(port_xmit_packets , 64, 192);
+static PORT_PMA_ATTR_EXT(port_rcv_packets  , 64, 256);
+static PORT_PMA_ATTR_EXT(unicast_xmit_packets  , 64, 320);
+static PORT_PMA_ATTR_EXT(unicast_rcv_packets   , 64, 384);
+static PORT_PMA_ATTR_EXT(multicast_xmit_packets, 64, 448);
+static PORT_PMA_ATTR_EXT(multicast_rcv_packets , 64, 512);
+
 static struct attribute *pma_attrs[] = {
_pma_attr_symbol_error.attr.attr,
_pma_attr_link_error_recovery.attr.attr,
@@ -444,11 +469,40 @@ static struct attribute *pma_attrs[] = {
NULL
 };
 
+static struct attribute *pma_attrs_ext[] = {
+   _pma_attr_symbol_error.attr.attr,
+   _pma_attr_link_error_recovery.attr.attr,
+   _pma_attr_link_downed.attr.attr,
+   _pma_attr_port_rcv_errors.attr.attr,
+   _pma_attr_port_rcv_remote_physical_errors.attr.attr,
+   _pma_attr_port_rcv_switch_relay_errors.attr.attr,
+   _pma_attr_port_xmit_discards.attr.attr,
+   _pma_attr_port_xmit_constraint_errors.attr.attr,
+   _pma_attr_port_rcv_constraint_errors.attr.attr,
+   _pma_attr_local_link_integrity_errors.attr.attr,
+   _pma_attr_excessive_buffer_overrun_errors.attr.attr,
+   _pma_attr_VL15_dropped.attr.attr,
+   _pma_attr_ext_port_xmit_data.attr.attr,
+   _pma_attr_ext_port_rcv_data.attr.attr,
+   _pma_attr_ext_port_xmit_packets.attr.attr,
+   _pma_attr_ext_port_rcv_packets.attr.attr,
+   _pma_attr_ext_unicast_rcv_packets.attr.attr,
+   _pma_attr_ext_unicast_xmit_packets.attr.attr,
+   _pma_attr_ext_multicast_rcv_packets.attr.attr,
+   _pma_attr_ext_multicast_xmit_packets.attr.attr,
+   NULL
+};
+
 static struct attribute_group pma_group = {
.name  = "counters",
.attrs  = pma_attrs
 };
 
+static struct attribute_group pma_group_ext = {
+   .name  = "counters",
+   .attrs  = pma_attrs_ext
+};
+
 static void ib_port_release(struct kobject *kobj)
 {
struct ib_port *p = container_of(kobj, struct ib_port, kobj);
@@ -521,6 +575,26 @@ err:
return NULL;
 }
 
+/*
+ * Check if the port supports the Extended Counters.
+ * Return error code of 0 for success
+ */
+static int port_check_extended_counters(struct ib_device *dev)
+{
+   int ret = 0;
+   struct ib_class_port_info cpi;
+
+   ret = get_perf_mad(dev, 0, IB_PMA_CLASS_PORT_INFO, , 40, 
sizeof(cpi));
+
+   if (ret >= 0) {
+   if (!(cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH) &&
+   !(cpi.capability_mask && 
IB_PMA_CLASS_CAP_EXT_WIDTH_NOIETF))
+   ret = -ENOSYS;
+   }
+
+   return ret;
+}
+
 static int add_port(str

[PATCH 1/3] IB Core: Create get_perf_mad function in sysfs.c

2015-12-17 Thread Christoph Lameter
Create a new function to retrieve performance management
data from the existing code in get_pma_counter().

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/core/sysfs.c
===
--- linux.orig/drivers/infiniband/core/sysfs.c
+++ linux/drivers/infiniband/core/sysfs.c
@@ -317,21 +317,21 @@ struct port_table_attribute port_pma_att
.index = (_offset) | ((_width) << 16) | ((_counter) << 24)  \
 }
 
-static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
-   char *buf)
+/*
+ * Get a Perfmgmt MAD block of data.
+ * Returns error code or the number of bytes retrieved.
+ */
+static int get_perf_mad(struct ib_device *dev, int port_num, int attr,
+   void *data, int offset, size_t size)
 {
-   struct port_table_attribute *tab_attr =
-   container_of(attr, struct port_table_attribute, attr);
-   int offset = tab_attr->index & 0x;
-   int width  = (tab_attr->index >> 16) & 0xff;
-   struct ib_mad *in_mad  = NULL;
-   struct ib_mad *out_mad = NULL;
+   struct ib_mad *in_mad;
+   struct ib_mad *out_mad;
size_t mad_size = sizeof(*out_mad);
u16 out_mad_pkey_index = 0;
ssize_t ret;
 
-   if (!p->ibdev->process_mad)
-   return sprintf(buf, "N/A (no PMA)\n");
+   if (!dev->process_mad)
+   return -ENOSYS;
 
in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
@@ -344,12 +344,12 @@ static ssize_t show_pma_counter(struct i
in_mad->mad_hdr.mgmt_class= IB_MGMT_CLASS_PERF_MGMT;
in_mad->mad_hdr.class_version = 1;
in_mad->mad_hdr.method= IB_MGMT_METHOD_GET;
-   in_mad->mad_hdr.attr_id   = cpu_to_be16(0x12); /* PortCounters */
+   in_mad->mad_hdr.attr_id   = attr;
 
-   in_mad->data[41] = p->port_num; /* PortSelect field */
+   in_mad->data[41] = port_num;/* PortSelect field */
 
-   if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY,
-p->port_num, NULL, NULL,
+   if ((dev->process_mad(dev, IB_MAD_IGNORE_MKEY,
+port_num, NULL, NULL,
 (const struct ib_mad_hdr *)in_mad, mad_size,
 (struct ib_mad_hdr *)out_mad, _size,
 _mad_pkey_index) &
@@ -358,31 +358,49 @@ static ssize_t show_pma_counter(struct i
ret = -EINVAL;
goto out;
}
+   memcpy(data, out_mad->data + offset, size);
+   ret = size;
+out:
+   kfree(in_mad);
+   kfree(out_mad);
+   return ret;
+}
+
+static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
+   char *buf)
+{
+   struct port_table_attribute *tab_attr =
+   container_of(attr, struct port_table_attribute, attr);
+   int offset = tab_attr->index & 0x;
+   int width  = (tab_attr->index >> 16) & 0xff;
+   ssize_t ret;
+   u8 data[8];
+
+   ret = get_perf_mad(p->ibdev, p->port_num, cpu_to_be16(0x12), ,
+   40 + offset / 8, sizeof(data));
+   if (ret < 0)
+   return sprintf(buf, "N/A (no PMA)\n");
 
switch (width) {
case 4:
-   ret = sprintf(buf, "%u\n", (out_mad->data[40 + offset / 8] >>
+   ret = sprintf(buf, "%u\n", (*data >>
(4 - (offset % 8))) & 0xf);
break;
case 8:
-   ret = sprintf(buf, "%u\n", out_mad->data[40 + offset / 8]);
+   ret = sprintf(buf, "%u\n", *data);
break;
case 16:
ret = sprintf(buf, "%u\n",
- be16_to_cpup((__be16 *)(out_mad->data + 40 + 
offset / 8)));
+ be16_to_cpup((__be16 *)data));
break;
case 32:
ret = sprintf(buf, "%u\n",
- be32_to_cpup((__be32 *)(out_mad->data + 40 + 
offset / 8)));
+ be32_to_cpup((__be32 *)data));
break;
default:
ret = 0;
}
 
-out:
-   kfree(in_mad);
-   kfree(out_mad);
-
return ret;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-16 Thread Christoph Lameter
On Wed, 16 Dec 2015, Christoph Lameter wrote:

> DRAFT: This is missing the check if this device supports
> extended counters.

Found some time and here is the patch with the detection of the extended
attribute through sending a mad request. Untested. Got the info on how
to do the proper mad request from an earlier patch by Or in 2011.


Subject: IB Core: Display extended counter set if available V2

Check if the extended counters are available and if so
create the proper extended and additional counters.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/core/sysfs.c
===
--- linux.orig/drivers/infiniband/core/sysfs.c
+++ linux/drivers/infiniband/core/sysfs.c
@@ -39,6 +39,7 @@
 #include 

 #include 
+#include 

 struct ib_port {
struct kobject kobj;
@@ -65,6 +66,7 @@ struct port_table_attribute {
struct port_attribute   attr;
charname[8];
int index;
+   int attr_id;
 };

 static ssize_t port_attr_show(struct kobject *kobj,
@@ -314,24 +316,33 @@ static ssize_t show_port_pkey(struct ib_
 #define PORT_PMA_ATTR(_name, _counter, _width, _offset)
\
 struct port_table_attribute port_pma_attr_##_name = {  \
.attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
-   .index = (_offset) | ((_width) << 16) | ((_counter) << 24)  \
+   .index = (_offset) | ((_width) << 16) | ((_counter) << 24), \
+   .attr_id = IB_PMA_PORT_COUNTERS ,   \
 }

-static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
-   char *buf)
+#define PORT_PMA_ATTR_EXT(_name, _width, _offset)  \
+struct port_table_attribute port_pma_attr_ext_##_name = {  \
+   .attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
+   .index = (_offset) | ((_width) << 16),  \
+   .attr_id = IB_PMA_PORT_COUNTERS_EXT ,   \
+}
+
+
+/*
+ * Get a MAD block of data.
+ * Returns error code or the number of bytes retrieved.
+ */
+static int get_mad(struct ib_device *dev, int port_num, int attr,
+   void *data, int offset, size_t size)
 {
-   struct port_table_attribute *tab_attr =
-   container_of(attr, struct port_table_attribute, attr);
-   int offset = tab_attr->index & 0x;
-   int width  = (tab_attr->index >> 16) & 0xff;
-   struct ib_mad *in_mad  = NULL;
-   struct ib_mad *out_mad = NULL;
+   struct ib_mad *in_mad;
+   struct ib_mad *out_mad;
size_t mad_size = sizeof(*out_mad);
u16 out_mad_pkey_index = 0;
ssize_t ret;

-   if (!p->ibdev->process_mad)
-   return sprintf(buf, "N/A (no PMA)\n");
+   if (!dev->process_mad)
+   return -ENOSYS;

in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
@@ -344,12 +355,12 @@ static ssize_t show_pma_counter(struct i
in_mad->mad_hdr.mgmt_class= IB_MGMT_CLASS_PERF_MGMT;
in_mad->mad_hdr.class_version = 1;
in_mad->mad_hdr.method= IB_MGMT_METHOD_GET;
-   in_mad->mad_hdr.attr_id   = cpu_to_be16(0x12); /* PortCounters */
+   in_mad->mad_hdr.attr_id   = attr;

-   in_mad->data[41] = p->port_num; /* PortSelect field */
+   in_mad->data[41] = port_num;/* PortSelect field */

-   if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY,
-p->port_num, NULL, NULL,
+   if ((dev->process_mad(dev, IB_MAD_IGNORE_MKEY,
+port_num, NULL, NULL,
 (const struct ib_mad_hdr *)in_mad, mad_size,
 (struct ib_mad_hdr *)out_mad, _size,
 _mad_pkey_index) &
@@ -358,31 +369,54 @@ static ssize_t show_pma_counter(struct i
ret = -EINVAL;
goto out;
}
+   memcpy(data, out_mad->data + offset, size);
+   ret = size;
+out:
+   kfree(in_mad);
+   kfree(out_mad);
+   return ret;
+}
+
+static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
+   char *buf)
+{
+   struct port_table_attribute *tab_attr =
+   container_of(attr, struct port_table_attribute, attr);
+   int offset = tab_attr->index & 0x;
+   int width  = (tab_attr->index >> 16) & 0xff;
+   ssize_t ret;
+   u8 data[8];
+
+   ret = get_mad(p->ibdev, p->port_num, tab_attr->attr_id, ,
+   40 + offset / 8, sizeof(data));
+   if (ret < 0)
+   return sprintf(buf, "N/A (no PMA)\n");

Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-16 Thread Christoph Lameter
On Tue, 15 Dec 2015, Doug Ledford wrote:

> On 12/15/2015 04:42 PM, Hal Rosenstock wrote:
> > On 12/15/2015 4:20 PM, Jason Gunthorpe wrote:
> >>> The unicast/multicast extended counters are not always supported -
>  depends on setting of PerfMgt ClassPortInfo
>  CapabilityMask.IsExtendedWidthSupportedNoIETF (bit 10).
> >
> >> Yes.. certainly this proposed patch needs to account for that and
> >> continue to use the 32 bit ones in that case.
> >
> > There are no 32 bit equivalents of those 4 "IETF" counters ([uni
> > multi]cast [xmit rcv] pkts).
> >
> > When not supported, perhaps it is best not to populate these counters in
> > sysfs so one can discern between counter not supported and 0 value.
> >
> > I'm still working on definitive mthca answer but think the attribute is
> > not supported there. Does anyone out there have an mthca setup where
> > they can try this ?
>
> Yes.

We can return ENOSYS for the counters not supported.

Or simply not create the sysfs files when the device is instantiated as
well as fall back to the 32 bit counters on instantiation for those
devices not supporting the extended set.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-16 Thread Christoph Lameter
On Tue, 15 Dec 2015, Jason Gunthorpe wrote:

> > The unicast/multicast extended counters are not always supported -
> > depends on setting of PerfMgt ClassPortInfo
> > CapabilityMask.IsExtendedWidthSupportedNoIETF (bit 10).
>
> Yes.. certainly this proposed patch needs to account for that and
> continue to use the 32 bit ones in that case.

So this is in struct ib_class_port_info the capability_mask? This does not
seem to be used anywhere in the IB core.

Here is a draft patch to change the counters depending on a bit (which I
do not know how to determine). So this would hopefully work if someone
would insert the proper check. Note that this patch no longer needs the
earlier 2 patches.

>From Christoph Lameter <c...@linux.com>
Subject: IB Core: Display extended counter set if available

Check if the extended counters are available and if so
create the proper extended and additional counters.

DRAFT: This is missing the check if this device supports
extended counters.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/core/sysfs.c
===
--- linux.orig/drivers/infiniband/core/sysfs.c
+++ linux/drivers/infiniband/core/sysfs.c
@@ -39,6 +39,7 @@
 #include 

 #include 
+#include 

 struct ib_port {
struct kobject kobj;
@@ -65,6 +66,7 @@ struct port_table_attribute {
struct port_attribute   attr;
charname[8];
int index;
+   int attr_id;
 };

 static ssize_t port_attr_show(struct kobject *kobj,
@@ -314,7 +316,15 @@ static ssize_t show_port_pkey(struct ib_
 #define PORT_PMA_ATTR(_name, _counter, _width, _offset)
\
 struct port_table_attribute port_pma_attr_##_name = {  \
.attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
-   .index = (_offset) | ((_width) << 16) | ((_counter) << 24)  \
+   .index = (_offset) | ((_width) << 16) | ((_counter) << 24), \
+   .attr_id = IB_PMA_PORT_COUNTERS ,   \
+}
+
+#define PORT_PMA_ATTR_EXT(_name, _width, _offset)  \
+struct port_table_attribute port_pma_attr_ext_##_name = {  \
+   .attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
+   .index = (_offset) | ((_width) << 16),  \
+   .attr_id = IB_PMA_PORT_COUNTERS_EXT ,   \
 }

 static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
@@ -344,7 +354,7 @@ static ssize_t show_pma_counter(struct i
in_mad->mad_hdr.mgmt_class= IB_MGMT_CLASS_PERF_MGMT;
in_mad->mad_hdr.class_version = 1;
in_mad->mad_hdr.method= IB_MGMT_METHOD_GET;
-   in_mad->mad_hdr.attr_id   = cpu_to_be16(0x12); /* PortCounters */
+   in_mad->mad_hdr.attr_id   = tab_attr->attr_id;

in_mad->data[41] = p->port_num; /* PortSelect field */

@@ -375,6 +385,11 @@ static ssize_t show_pma_counter(struct i
ret = sprintf(buf, "%u\n",
  be32_to_cpup((__be32 *)(out_mad->data + 40 + 
offset / 8)));
break;
+   case 64:
+   ret = sprintf(buf, "%llu\n",
+   be64_to_cpup((__be64 *)(out_mad->data + 40 + 
offset / 8)));
+   break;
+
default:
ret = 0;
}
@@ -403,6 +418,18 @@ static PORT_PMA_ATTR(port_rcv_data
 static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256);
 static PORT_PMA_ATTR(port_rcv_packets  , 15, 32, 288);

+/*
+ * Counters added by extended set
+ */
+static PORT_PMA_ATTR_EXT(port_xmit_data, 64,  64);
+static PORT_PMA_ATTR_EXT(port_rcv_data , 64, 128);
+static PORT_PMA_ATTR_EXT(port_xmit_packets , 64, 192);
+static PORT_PMA_ATTR_EXT(port_rcv_packets  , 64, 256);
+static PORT_PMA_ATTR_EXT(unicast_xmit_packets  , 64, 320);
+static PORT_PMA_ATTR_EXT(unicast_rcv_packets   , 64, 384);
+static PORT_PMA_ATTR_EXT(multicast_xmit_packets, 64, 448);
+static PORT_PMA_ATTR_EXT(multicast_rcv_packets , 64, 512);
+
 static struct attribute *pma_attrs[] = {
_pma_attr_symbol_error.attr.attr,
_pma_attr_link_error_recovery.attr.attr,
@@ -423,11 +450,40 @@ static struct attribute *pma_attrs[] = {
NULL
 };

+static struct attribute *pma_attrs_ext[] = {
+   _pma_attr_symbol_error.attr.attr,
+   _pma_attr_link_error_recovery.attr.attr,
+   _pma_attr_link_downed.attr.attr,
+   _pma_attr_port_rcv_errors.attr.attr,
+   _pma_attr_port_rcv_remote_physical_errors.attr.attr,
+   _pma_attr_port_rcv_switch_relay_errors.attr.attr,
+   _pma_attr_port_xmit_discards.

Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-15 Thread Christoph Lameter
On Mon, 14 Dec 2015, Hal Rosenstock wrote:

> > Mellanox should really confirm this for their hardware matrix.
>
> I am trying to get definitive answer to this.

I was told today on a conf call with a couple of Mellanox employees that
extended counters are always available.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] ipoib mcast sendonly join: Move multicast specific code out of ipoib_main.c.

2015-12-14 Thread Christoph Lameter
On Mon, 14 Dec 2015, Weiny, Ira wrote:

> > How about
> >=20
> > ipoib_check_and_add_mcast_sendonly()
>
> Better.

Fixup patch:


Subject: ipoib: Fix up naming of ipoib_check_and_add_mcast_sendonly

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -549,7 +549,7 @@ void ipoib_path_iter_read(struct ipoib_p
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid, int set_qkey);
 void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list);
-void ipoib_check_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
+void ipoib_check_and_add_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
struct list_head *remove_list);

 int ipoib_init_qp(struct net_device *dev);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1179,7 +1179,7 @@ static void __ipoib_reap_neigh(struct ip
/* was the neigh idle for two GC periods */
if (time_after(neigh_obsolete, neigh->alive)) {

-   ipoib_check_mcast_sendonly(priv, neigh->daddr + 
4, _list);
+   ipoib_check_and_add_mcast_sendonly(priv, 
neigh->daddr + 4, _list);

rcu_assign_pointer(*np,
   
rcu_dereference_protected(neigh->hnext,
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -708,7 +708,7 @@ static int ipoib_mcast_leave(struct net_
  * Check if the multicast group is sendonly. If so remove it from the maps
  * and add to the remove list
  */
-void ipoib_check_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
+void ipoib_check_and_add_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
struct list_head *remove_list)
 {
/* Is this multicast ? */
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-14 Thread Christoph Lameter
On Mon, 14 Dec 2015, Matan Barak wrote:

> > No idea what the counter is doing. Saw another EXT counter implementation
> > use 0 so I thought that was fine.
>
> It seems like a counter index, but I might be wrong though. If it is,
> don't we want to preserve the existing non-EXT schema for the new
> counters too?

I do not see any use of that field so I am not sure what to put in there.
Could it be obsolete?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ipoib mcast sendonly join: Move multicast specific code out of ipoib_main.c.

2015-12-14 Thread Christoph Lameter
On Fri, 11 Dec 2015, ira.weiny wrote:

> I think I would rather see this called something like
>
> ipoib_add_to_list_sendonly
>
> Or something...
>
> Calling it iboib_check* sounds like it should return a bool.

Hmm... It only adds the multicast group if the check was successful.

How about

ipoib_check_and_add_mcast_sendonly()


> > +void ipoib_check_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
> > +   struct list_head *remove_list)
> > +{
> > +   /* Is this multicast ? */
> > +   if (*mgid == 0xff) {
>
> Odd to see a mgid variable which is only u8?
>
> How about "gid_prefix"?

That is only used in the qib driver and there it is a field.

mgid is a pointer to the seres of bytes of the MGID and the first byte of
that signifies multicast if 0xff

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-14 Thread Christoph Lameter
On Mon, 14 Dec 2015, Matan Barak wrote:

> > +static PORT_PMA_ATTR(unicast_rcv_packets   ,  0, 64, 384, 
> > IB_PMA_PORT_COUNTERS_EXT);
> > +static PORT_PMA_ATTR(multicast_xmit_packets,  0, 64, 448, 
> > IB_PMA_PORT_COUNTERS_EXT);
> > +static PORT_PMA_ATTR(multicast_rcv_packets ,  0, 64, 512, 
> > IB_PMA_PORT_COUNTERS_EXT);
> >
>
> Why do we use 0 as the counter argument for all EXT counters?

No idea what the counter is doing. Saw another EXT counter implementation
use 0 so I thought that was fine.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] ipoib mcast sendonly join: Move multicast specific code out of ipoib_main.c.

2015-12-11 Thread Christoph Lameter
Code cleanup to move multicast specific code that checks for
a sendonly join to ipoib_multicast.c. This allows the removal
of the export of __ipoib_mcast_find().

Signed-off-by: Christoph Lameter <c...@linux.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  3 ++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  | 13 +
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 -
 3 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 989c409..a13f48c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -549,7 +549,8 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter,
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid, int set_qkey);
 void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list);
-struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);
+void ipoib_check_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
+   struct list_head *remove_list);
 
 int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 483ff20..6b16428 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1150,7 +1150,6 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
unsigned long flags;
int i;
LIST_HEAD(remove_list);
-   struct ipoib_mcast *mcast;
struct net_device *dev = priv->dev;
 
if (test_bit(IPOIB_STOP_NEIGH_GC, >flags))
@@ -1179,18 +1178,8 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
  
lockdep_is_held(>lock))) != NULL) {
/* was the neigh idle for two GC periods */
if (time_after(neigh_obsolete, neigh->alive)) {
-   u8 *mgid = neigh->daddr + 4;
 
-   /* Is this multicast ? */
-   if (*mgid == 0xff) {
-   mcast = __ipoib_mcast_find(dev, mgid);
-
-   if (mcast && 
test_bit(IPOIB_MCAST_FLAG_SENDONLY, >flags)) {
-   list_del(>list);
-   rb_erase(>rb_node, 
>multicast_tree);
-   list_add_tail(>list, 
_list);
-   }
-   }
+   ipoib_check_mcast_sendonly(priv, neigh->daddr + 
4, _list);
 
rcu_assign_pointer(*np,
   
rcu_dereference_protected(neigh->hnext,
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 8acb420a..1158819 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -153,7 +153,7 @@ static struct ipoib_mcast *ipoib_mcast_alloc(struct 
net_device *dev,
return mcast;
 }
 
-struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
+static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void 
*mgid)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct rb_node *n = priv->multicast_tree.rb_node;
@@ -704,6 +704,25 @@ static int ipoib_mcast_leave(struct net_device *dev, 
struct ipoib_mcast *mcast)
return 0;
 }
 
+/*
+ * Check if the multicast group is sendonly. If so remove it from the maps
+ * and add to the remove list
+ */
+void ipoib_check_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
+   struct list_head *remove_list)
+{
+   /* Is this multicast ? */
+   if (*mgid == 0xff) {
+   struct ipoib_mcast *mcast = __ipoib_mcast_find(priv->dev, mgid);
+
+   if (mcast && test_bit(IPOIB_MCAST_FLAG_SENDONLY, 
>flags)) {
+   list_del(>list);
+   rb_erase(>rb_node, >multicast_tree);
+   list_add_tail(>list, remove_list);
+   }
+   }
+}
+
 void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list)
 {
struct ipoib_mcast *mcast, *tmcast;
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] ipoib mcast sendonly join: Isolate common list remove code

2015-12-11 Thread Christoph Lameter
Code cleanup to remove multicast specific code from ipoib_main.c

The removal of a list of multicast groups occurs in three places.
Create a new function ipoib_mcast_remove_list(). Use this new
function in ipoib_main.c too.
That in turn allows the dropping of two functions that were
exported from ipoib_multicast.c for expiration of mc groups.

Signed-off-by: Christoph Lameter <c...@linux.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  3 +--
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  7 ++-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 24 ++--
 3 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 3ede103..989c409 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -495,7 +495,6 @@ void ipoib_dev_cleanup(struct net_device *dev);
 void ipoib_mcast_join_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
-void ipoib_mcast_free(struct ipoib_mcast *mc);
 
 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
@@ -549,7 +548,7 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter,
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid, int set_qkey);
-int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast);
+void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list);
 struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);
 
 int ipoib_init_qp(struct net_device *dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 7d32818..483ff20 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1150,7 +1150,7 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
unsigned long flags;
int i;
LIST_HEAD(remove_list);
-   struct ipoib_mcast *mcast, *tmcast;
+   struct ipoib_mcast *mcast;
struct net_device *dev = priv->dev;
 
if (test_bit(IPOIB_STOP_NEIGH_GC, >flags))
@@ -1207,10 +1207,7 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
 
 out_unlock:
spin_unlock_irqrestore(>lock, flags);
-   list_for_each_entry_safe(mcast, tmcast, _list, list) {
-   ipoib_mcast_leave(dev, mcast);
-   ipoib_mcast_free(mcast);
-   }
+   ipoib_mcast_remove_list(dev, _list);
 }
 
 static void ipoib_reap_neigh(struct work_struct *work)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index f357ca6..8acb420a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -106,7 +106,7 @@ static void __ipoib_mcast_schedule_join_thread(struct 
ipoib_dev_priv *priv,
queue_delayed_work(priv->wq, >mcast_task, 0);
 }
 
-void ipoib_mcast_free(struct ipoib_mcast *mcast)
+static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
struct net_device *dev = mcast->dev;
int tx_dropped = 0;
@@ -677,7 +677,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev)
return 0;
 }
 
-int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
+static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
int ret = 0;
@@ -704,6 +704,16 @@ int ipoib_mcast_leave(struct net_device *dev, struct 
ipoib_mcast *mcast)
return 0;
 }
 
+void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list)
+{
+   struct ipoib_mcast *mcast, *tmcast;
+
+   list_for_each_entry_safe(mcast, tmcast, remove_list, list) {
+   ipoib_mcast_leave(dev, mcast);
+   ipoib_mcast_free(mcast);
+   }
+}
+
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -810,10 +820,7 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
if (test_bit(IPOIB_MCAST_FLAG_BUSY, >flags))
wait_for_completion(>done);
 
-   list_for_each_entry_safe(mcast, tmcast, _list, list) {
-   ipoib_mcast_leave(dev, mcast);
-   ipoib_mcast_free(mcast);
-   }
+   ipoib_mcast_remove_list(dev, _list);
 }
 
 static int ipoib_mcast_addr_is_valid(const u8 *addr, const u8 *broadcast)
@@ -939,10 +946,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
if (test_bit(IPOIB_MCAST_FLAG_BUSY, >flags))
wait_for_completion(>done);
 
-   list_for_each_entry_safe(mcast, tmcast, _list, list) {
-   

[PATCH 0/2] IB multicast cleanup patches

2015-12-11 Thread Christoph Lameter
This patchset cleans up the code a bit after the last round of multicast
patches related to the sendonly join logic. Some of the bits of code
landed in ipoib_main.c instead of ipoib_multicast.c.

- Move the multicast bits into that file so that everything is neatly together
- Reduce the number of functions exported from ipoib_multicast.c

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] IB core: Allow specification of attr_id in PORT_PMA_ATTR macro

2015-12-11 Thread Christoph Lameter
This is necessary to support the extended attributes which involves
a different attribute id.

Signed-off-by: Christoph Lameter <c...@linux.com>
---
 drivers/infiniband/core/sysfs.c | 41 ++---
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index b1f37d4..1c8716f 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -39,6 +39,7 @@
 #include 
 
 #include 
+#include 
 
 struct ib_port {
struct kobject kobj;
@@ -65,6 +66,7 @@ struct port_table_attribute {
struct port_attribute   attr;
charname[8];
int index;
+   int attr_id;
 };
 
 static ssize_t port_attr_show(struct kobject *kobj,
@@ -311,10 +313,11 @@ static ssize_t show_port_pkey(struct ib_port *p, struct 
port_attribute *attr,
return sprintf(buf, "0x%04x\n", pkey);
 }
 
-#define PORT_PMA_ATTR(_name, _counter, _width, _offset)
\
+#define PORT_PMA_ATTR(_name, _counter, _width, _offset, _attr_id)  \
 struct port_table_attribute port_pma_attr_##_name = {  \
.attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
-   .index = (_offset) | ((_width) << 16) | ((_counter) << 24)  \
+   .index = (_offset) | ((_width) << 16) | ((_counter) << 24), \
+   .attr_id = _attr_id ,   \
 }
 
 static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
@@ -344,7 +347,7 @@ static ssize_t show_pma_counter(struct ib_port *p, struct 
port_attribute *attr,
in_mad->mad_hdr.mgmt_class= IB_MGMT_CLASS_PERF_MGMT;
in_mad->mad_hdr.class_version = 1;
in_mad->mad_hdr.method= IB_MGMT_METHOD_GET;
-   in_mad->mad_hdr.attr_id   = cpu_to_be16(0x12); /* PortCounters */
+   in_mad->mad_hdr.attr_id   = tab_attr->attr_id;
 
in_mad->data[41] = p->port_num; /* PortSelect field */
 
@@ -386,22 +389,22 @@ out:
return ret;
 }
 
-static PORT_PMA_ATTR(symbol_error  ,  0, 16,  32);
-static PORT_PMA_ATTR(link_error_recovery   ,  1,  8,  48);
-static PORT_PMA_ATTR(link_downed   ,  2,  8,  56);
-static PORT_PMA_ATTR(port_rcv_errors   ,  3, 16,  64);
-static PORT_PMA_ATTR(port_rcv_remote_physical_errors,  4, 16,  80);
-static PORT_PMA_ATTR(port_rcv_switch_relay_errors   ,  5, 16,  96);
-static PORT_PMA_ATTR(port_xmit_discards,  6, 16, 112);
-static PORT_PMA_ATTR(port_xmit_constraint_errors,  7,  8, 128);
-static PORT_PMA_ATTR(port_rcv_constraint_errors,  8,  8, 136);
-static PORT_PMA_ATTR(local_link_integrity_errors,  9,  4, 152);
-static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10,  4, 156);
-static PORT_PMA_ATTR(VL15_dropped  , 11, 16, 176);
-static PORT_PMA_ATTR(port_xmit_data, 12, 32, 192);
-static PORT_PMA_ATTR(port_rcv_data , 13, 32, 224);
-static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256);
-static PORT_PMA_ATTR(port_rcv_packets  , 15, 32, 288);
+static PORT_PMA_ATTR(symbol_error  ,  0, 16,  32, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(link_error_recovery   ,  1,  8,  48, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(link_downed   ,  2,  8,  56, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_rcv_errors   ,  3, 16,  64, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_rcv_remote_physical_errors,  4, 16,  80, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_rcv_switch_relay_errors   ,  5, 16,  96, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_xmit_discards,  6, 16, 112, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_xmit_constraint_errors,  7,  8, 128, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_rcv_constraint_errors,  8,  8, 136, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(local_link_integrity_errors,  9,  4, 152, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10,  4, 156, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(VL15_dropped  , 11, 16, 176, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_xmit_data, 12, 32, 192, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_rcv_data , 13, 32, 224, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_rcv_packets  , 15, 32, 288, 
IB_PMA_PORT_COUNTERS);
 
 static struct attribute *pma_attrs[] = {
_pma_attr_symbol_error.attr.attr,
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message t

[PATCH 0/3] IB 64 bit counter support

2015-12-11 Thread Christoph Lameter
Currently we only use 32 bits for the packet and byte counters. There have been
extended countes available for some time but we have no support for those
yet upstream. We keep having issues with 32 bit counters wrapping. Especially
the byte counter can wrap frequently (as in multiple times per minute)

This patch adds 4 new counters and updates 4 32 bit counters to use the
64 bit sizes so that they no longer wrap.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-11 Thread Christoph Lameter
Display the additional 64 bit counters available through the extended
set and replace the existing 32 bit counters if there is a 64 bit
alternative available.

Note: This requires universal support of extended counters in
the devices. If there are still devices around that do not
support extended counters then we will have to add some fallback
technique here.

Signed-off-by: Christoph Lameter <c...@linux.com>
---
 drivers/infiniband/core/sysfs.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 0083a4f..f7f2954 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -406,10 +406,14 @@ static PORT_PMA_ATTR(port_rcv_constraint_errors   ,  
8,  8, 136, IB_PMA_PORT_C
 static PORT_PMA_ATTR(local_link_integrity_errors,  9,  4, 152, 
IB_PMA_PORT_COUNTERS);
 static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10,  4, 156, 
IB_PMA_PORT_COUNTERS);
 static PORT_PMA_ATTR(VL15_dropped  , 11, 16, 176, 
IB_PMA_PORT_COUNTERS);
-static PORT_PMA_ATTR(port_xmit_data, 12, 32, 192, 
IB_PMA_PORT_COUNTERS);
-static PORT_PMA_ATTR(port_rcv_data , 13, 32, 224, 
IB_PMA_PORT_COUNTERS);
-static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256, 
IB_PMA_PORT_COUNTERS);
-static PORT_PMA_ATTR(port_rcv_packets  , 15, 32, 288, 
IB_PMA_PORT_COUNTERS);
+static PORT_PMA_ATTR(port_xmit_data,  0, 64,  64, 
IB_PMA_PORT_COUNTERS_EXT);
+static PORT_PMA_ATTR(port_rcv_data ,  0, 64, 128, 
IB_PMA_PORT_COUNTERS_EXT);
+static PORT_PMA_ATTR(port_xmit_packets ,  0, 64, 192, 
IB_PMA_PORT_COUNTERS_EXT);
+static PORT_PMA_ATTR(port_rcv_packets  ,  0, 64, 256, 
IB_PMA_PORT_COUNTERS_EXT);
+static PORT_PMA_ATTR(unicast_xmit_packets  ,  0, 64, 320, 
IB_PMA_PORT_COUNTERS_EXT);
+static PORT_PMA_ATTR(unicast_rcv_packets   ,  0, 64, 384, 
IB_PMA_PORT_COUNTERS_EXT);
+static PORT_PMA_ATTR(multicast_xmit_packets,  0, 64, 448, 
IB_PMA_PORT_COUNTERS_EXT);
+static PORT_PMA_ATTR(multicast_rcv_packets ,  0, 64, 512, 
IB_PMA_PORT_COUNTERS_EXT);
 
 static struct attribute *pma_attrs[] = {
_pma_attr_symbol_error.attr.attr,
@@ -428,6 +432,10 @@ static struct attribute *pma_attrs[] = {
_pma_attr_port_rcv_data.attr.attr,
_pma_attr_port_xmit_packets.attr.attr,
_pma_attr_port_rcv_packets.attr.attr,
+   _pma_attr_unicast_rcv_packets.attr.attr,
+   _pma_attr_unicast_xmit_packets.attr.attr,
+   _pma_attr_multicast_rcv_packets.attr.attr,
+   _pma_attr_multicast_xmit_packets.attr.attr,
NULL
 };
 
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] IB core: Support 64 bit values in the port counters

2015-12-11 Thread Christoph Lameter
Add a branch to display 64 bit values

Signed-off-by: Christoph Lameter <c...@linux.com>
---
 drivers/infiniband/core/sysfs.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 1c8716f..0083a4f 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -378,6 +378,11 @@ static ssize_t show_pma_counter(struct ib_port *p, struct 
port_attribute *attr,
ret = sprintf(buf, "%u\n",
  be32_to_cpup((__be32 *)(out_mad->data + 40 + 
offset / 8)));
break;
+   case 64:
+   ret = sprintf(buf, "%llu\n",
+   be64_to_cpup((__be64 *)(out_mad->data + 40 + 
offset / 8)));
+   break;
+
default:
ret = 0;
}
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stalled again

2015-12-01 Thread Christoph Lameter
On Tue, 1 Dec 2015, Or Gerlitz wrote:

> We're against into this... upstream is on 4.4-rc3 while your latest branch in
> kernel.org (the one that carries thefor-next tag) is rebased to 4.3-rc3...

Seems that everything was merged? So you can directly use 4.4-rc1
until he comes up with a for-next for 4.5.

> --> our internal build and review systems for patches to linux-rdma can'tmake
> any use of your tree. People here have to rebase their work against their own
> clones of Linus tree and can't work with our internal Gerrit rdma-next branch,
> etc, etc.

AFACIT The primary use for the next-trees is testing and merging. You
should be basing your work on 4.4-rc1 and all the patches you are carrying
need to apply to 4.4-rc1 cleanly. Unless you depend on functionality that
was added for the next merge cycle of course. But since there is no tree
yet nothing was added so there is nothing there for you to rely on.

Please do not base patches by default on -next tree's unless there is a
good reason for it. If you do otherwise the code base will change too
frequently. If the maintainer decides to drop a certain patchset your
patches may no longer apply cleanly.




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When does IB Multicast drop?

2015-11-25 Thread Christoph Lameter
On Tue, 24 Nov 2015, Anuj Kalia wrote:

> InfiniBand flow control is done at the link layer, so UD does not drop
> packets due to congestion.

Correct. But multicast packets are droped at the QP receive level if the
app does not provide enough buffers to accept the data stream. The
bufers can easily be overrun if one does not code carefully given that
the maximum number of those is 16K or so. These drops occurs silently.
Currently there is no accounting for these drops in the upstream kernel.

> AFAIK, UD only drops packets due to irrecoverable bit errors and
> network device failures. Mellanox's FDR physical layer has BER less
> than 10^(-15), and forward error correction on top of that, so an
> irrecoverable bit error is extremeley extremely rare.

Yep. These are extremely rare. We rely on reliable delivery of "unreliable
datagrams" here to avoid having messaging layers that request
retransmission on packet drops.

> If the network topology does not have multipath, (Mellanox) UD will
> not reorder packets to a particular destination sent from the same UD
> QP. There is probably some guarantee in multipath topologies, too.

Correct.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When does IB Multicast drop?

2015-11-25 Thread Christoph Lameter
On Wed, 25 Nov 2015, Peter Chinetti wrote:

> > Correct. But multicast packets are droped at the QP receive level if the
> > app does not provide enough buffers to accept the data stream. The
> > bufers can easily be overrun if one does not code carefully given that
> > the maximum number of those is 16K or so. These drops occurs silently.
> > Currently there is no accounting for these drops in the upstream kernel.

> How about when one of the destinations for the multicast group has its
> connection to the switch overloaded (because it is subscribing to many
> multicast groups whose combined bandwidth is momentarily greater than the
> bandwidth of the link to the switch). Are the messages destined for that
> endpoint dropped at the switch, or is traffic to the entire multicast group
> delayed?

The entire traffic to the muilticast group will be delayed.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libmlx5 0/7] Completion timestamping

2015-11-15 Thread Christoph Lameter
On Sun, 15 Nov 2015, Matan Barak wrote:

> This series adds support for completion timestamp. In order to
> support this feature, several extended verbs were implemented
> (as instructed in libibverbs).

This is the portion that
implements timestaping for libmlx5 and this patchset depends on another
one that needs to be merged into libibverbs.

Right?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 for-next 1/7] IB/core: Extend ib_uverbs_create_qp

2015-10-21 Thread Christoph Lameter
On Wed, 21 Oct 2015, Or Gerlitz wrote:

> Again, the kernel stack consuming rate here was < 1 bit a year when
> averaging over time since this was introduced. So we should be doing
> well for the coming ~10-20 years with this 32 bit field, and we can
> easily extend it later, I verified that with Haggai, so yes, don't
> want 64 bits now.

Hmmm... We are running out of dev_cap flags on mlx4 already. They are 32
bits.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ipoib mcast sendonly join: Isolate common list remove code

2015-10-16 Thread Christoph Lameter
From: Christoph Lameter <c...@linux.com>
Subject: ipoib mcast sendonly join: Isolate common list remove code

Code cleanup to remove multicast specific code from ipoib_main.c

The removal of a list of multicast groups occurs in three places.
Create a new function ipoib_mcast_remove_list(). Use this new
function in ipoib_main.c too.
That in turn allows the dropping of two functions that were
exported from ipoib_multicast.c for expiration of mc groups.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -495,7 +495,6 @@ void ipoib_dev_cleanup(struct net_device
 void ipoib_mcast_join_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
-void ipoib_mcast_free(struct ipoib_mcast *mc);

 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
@@ -549,7 +548,7 @@ void ipoib_path_iter_read(struct ipoib_p

 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid, int set_qkey);
-int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast);
+void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list);
 struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);

 int ipoib_init_qp(struct net_device *dev);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1150,7 +1150,7 @@ static void __ipoib_reap_neigh(struct ip
unsigned long flags;
int i;
LIST_HEAD(remove_list);
-   struct ipoib_mcast *mcast, *tmcast;
+   struct ipoib_mcast *mcast;
struct net_device *dev = priv->dev;

if (test_bit(IPOIB_STOP_NEIGH_GC, >flags))
@@ -1207,10 +1207,7 @@ static void __ipoib_reap_neigh(struct ip

 out_unlock:
spin_unlock_irqrestore(>lock, flags);
-   list_for_each_entry_safe(mcast, tmcast, _list, list) {
-   ipoib_mcast_leave(dev, mcast);
-   ipoib_mcast_free(mcast);
-   }
+   ipoib_mcast_remove_list(dev, _list);
 }

 static void ipoib_reap_neigh(struct work_struct *work)
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -106,7 +106,7 @@ static void __ipoib_mcast_schedule_join_
queue_delayed_work(priv->wq, >mcast_task, 0);
 }

-void ipoib_mcast_free(struct ipoib_mcast *mcast)
+static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
struct net_device *dev = mcast->dev;
int tx_dropped = 0;
@@ -677,7 +677,7 @@ int ipoib_mcast_stop_thread(struct net_d
return 0;
 }

-int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
+static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
int ret = 0;
@@ -704,6 +704,16 @@ int ipoib_mcast_leave(struct net_device
return 0;
 }

+void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list)
+{
+   struct ipoib_mcast *mcast, *tmcast;
+
+   list_for_each_entry_safe(mcast, tmcast, remove_list, list) {
+   ipoib_mcast_leave(dev, mcast);
+   ipoib_mcast_free(mcast);
+   }
+}
+
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -810,10 +820,7 @@ void ipoib_mcast_dev_flush(struct net_de
if (test_bit(IPOIB_MCAST_FLAG_BUSY, >flags))
wait_for_completion(>done);

-   list_for_each_entry_safe(mcast, tmcast, _list, list) {
-   ipoib_mcast_leave(dev, mcast);
-   ipoib_mcast_free(mcast);
-   }
+   ipoib_mcast_remove_list(dev, _list);
 }

 static int ipoib_mcast_addr_is_valid(const u8 *addr, const u8 *broadcast)
@@ -939,10 +946,7 @@ void ipoib_mcast_restart_task(struct wor
if (test_bit(IPOIB_MCAST_FLAG_BUSY, >flags))
wait_for_completion(>done);

-   list_for_each_entry_safe(mcast, tmcast, _list, list) {
-   ipoib_mcast_leave(mcast->dev, mcast);
-   ipoib_mcast_free(mcast);
-   }
+   ipoib_mcast_remove_list(mcast->dev, _list);

/*
 * Double check that we are still up
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" 

[PATCH] ipoib mcast sendonly join: Move multicast specific code out of ipoib_main.c.

2015-10-16 Thread Christoph Lameter
From: Christoph Lameter <c...@linux.com>
Subject: ipoib mcast sendonly join: Move multicast specific code out of 
ipoib_main.c.

Code cleanup to move multicast specific code that checks for
a sendonly join to ipoib_multicast.c. This allows the removal
of the export of __ipoib_mcast_find().

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1150,7 +1150,6 @@ static void __ipoib_reap_neigh(struct ip
unsigned long flags;
int i;
LIST_HEAD(remove_list);
-   struct ipoib_mcast *mcast;
struct net_device *dev = priv->dev;

if (test_bit(IPOIB_STOP_NEIGH_GC, >flags))
@@ -1179,18 +1178,8 @@ static void __ipoib_reap_neigh(struct ip
  
lockdep_is_held(>lock))) != NULL) {
/* was the neigh idle for two GC periods */
if (time_after(neigh_obsolete, neigh->alive)) {
-   u8 *mgid = neigh->daddr + 4;

-   /* Is this multicast ? */
-   if (*mgid == 0xff) {
-   mcast = __ipoib_mcast_find(dev, mgid);
-
-   if (mcast && 
test_bit(IPOIB_MCAST_FLAG_SENDONLY, >flags)) {
-   list_del(>list);
-   rb_erase(>rb_node, 
>multicast_tree);
-   list_add_tail(>list, 
_list);
-   }
-   }
+   ipoib_check_mcast_sendonly(priv, neigh->daddr + 
4, _list);

rcu_assign_pointer(*np,
   
rcu_dereference_protected(neigh->hnext,
Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -549,7 +549,8 @@ void ipoib_path_iter_read(struct ipoib_p
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid, int set_qkey);
 void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list);
-struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);
+void ipoib_check_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
+   struct list_head *remove_list);

 int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -153,7 +153,7 @@ static struct ipoib_mcast *ipoib_mcast_a
return mcast;
 }

-struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
+static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void 
*mgid)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct rb_node *n = priv->multicast_tree.rb_node;
@@ -704,6 +704,25 @@ static int ipoib_mcast_leave(struct net_
return 0;
 }

+/*
+ * Check if the multicast group is sendonly. If so remove it from the maps
+ * and add to the remove list
+ */
+void ipoib_check_mcast_sendonly(struct ipoib_dev_priv *priv, u8 *mgid,
+   struct list_head *remove_list)
+{
+   /* Is this multicast ? */
+   if (*mgid == 0xff) {
+   struct ipoib_mcast *mcast = __ipoib_mcast_find(priv->dev, mgid);
+
+   if (mcast && test_bit(IPOIB_MCAST_FLAG_SENDONLY, 
>flags)) {
+   list_del(>list);
+   rb_erase(>rb_node, >multicast_tree);
+   list_add_tail(>list, remove_list);
+   }
+   }
+}
+
 void ipoib_mcast_remove_list(struct net_device *dev, struct list_head 
*remove_list)
 {
struct ipoib_mcast *mcast, *tmcast;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 for-next 0/7] Add support for multicast loopback prevention to mlx4

2015-10-15 Thread Christoph Lameter
On Thu, 15 Oct 2015, eran ben elisha wrote:

> I rebased the series due to a small conflict in
> drivers/net/ethernet/mellanox/mlx4/fw.c

The git trees also need rebasing on github.com.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 for-next 0/7] Add support for multicast loopback prevention to mlx4

2015-10-14 Thread Christoph Lameter
On Wed, 26 Aug 2015, eran ben elisha wrote:

> > Do you have this in a git tree somewhere for testing?
>
> Yes,
> please pull from https://github.com/eranbenelisha/linux branch
> rebased-for-4.3/lb_prev
>
> If you with to get the user space as well for testing, please use:
> https://github.com/eranbenelisha/libibverbs branch for-linux-rdma-lb_prev
> https://github.com/eranbenelisha/libmlx4 branch for-linux-rdma-lb_prev

This needs to be rebased. Cannot pull on top of current linus tree.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Seeing WARN_ON in ib_dealloc_pd from ipoib in kernel 4.3-rc1-debug

2015-10-11 Thread Christoph Lameter
On Sun, 11 Oct 2015, Sagi Grimberg wrote:

> Is someone looking at this? It really should be fixed before 4.3
> final...

The following fixup patch is needed:



Subject: ipoib: For sendonly join free the multicast group on leave

When we leave the multicast group on expiration of a neighbor we
do not free the mcast structure. This results in a memory leak.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -495,6 +495,7 @@ void ipoib_dev_cleanup(struct net_device
 void ipoib_mcast_join_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
+void ipoib_mcast_free(struct ipoib_mcast *mc);

 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1207,8 +1207,10 @@ static void __ipoib_reap_neigh(struct ip

 out_unlock:
spin_unlock_irqrestore(>lock, flags);
-   list_for_each_entry_safe(mcast, tmcast, _list, list)
+   list_for_each_entry_safe(mcast, tmcast, _list, list) {
ipoib_mcast_leave(dev, mcast);
+   ipoib_mcast_free(mcast);
+   }
 }

 static void ipoib_reap_neigh(struct work_struct *work)
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -106,7 +106,7 @@ static void __ipoib_mcast_schedule_join_
queue_delayed_work(priv->wq, >mcast_task, 0);
 }

-static void ipoib_mcast_free(struct ipoib_mcast *mcast)
+void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
struct net_device *dev = mcast->dev;
int tx_dropped = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mellanox CQE compression

2015-10-09 Thread Christoph Lameter
On Fri, 9 Oct 2015, Anuj Kalia wrote:

> I am pretty excited about the CQE compression feature introduced in
> Mellanox OFED 3.1. Is this feature supported for ConnectX-3 or
> Connect-IB cards?

The other obvious question to ask is when can we expect this feature to be
upstream?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-28 Thread Christoph Lameter
Ok I refactored the whole thing to make it less invasive and keep more
functionality in ipoib_multicast.c. Since you are working on it it would
be best for you to have the newest version. I split this into two patches:
One preparatory and one that implements the actual logic.
Both attached. The patch that implements the join is inline here:


Subject: ipoib multicast: Expire MC groups when the address expires

Upon address expiration do the proper thing to also expire the
sendonly multicast group.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2015-09-28 
11:56:59.779764388 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h  2015-09-28 11:57:16.291764857 
-0500
@@ -548,6 +548,8 @@

 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid, int set_qkey);
+void ipoib_mcast_remove_mc_list(struct net_device *dev, struct list_head 
*list);
+void ipoib_mcast_detach_sendonly(struct ipoib_dev_priv *priv, u8 *mgid, struct 
list_head *list);

 int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c2015-09-28 
11:56:59.779764388 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c 2015-09-28 
11:56:59.775764388 -0500
@@ -1149,6 +1149,8 @@
unsigned long dt;
unsigned long flags;
int i;
+   LIST_HEAD(remove_list);
+   struct net_device *dev = priv->dev;

if (test_bit(IPOIB_STOP_NEIGH_GC, >flags))
return;
@@ -1176,6 +1178,9 @@
  
lockdep_is_held(>lock))) != NULL) {
/* was the neigh idle for two GC periods */
if (time_after(neigh_obsolete, neigh->alive)) {
+
+   ipoib_mcast_detach_sendonly(priv, neigh->daddr 
+ 4, _list);
+
rcu_assign_pointer(*np,
   
rcu_dereference_protected(neigh->hnext,
 
lockdep_is_held(>lock)));
@@ -1191,6 +1196,7 @@

 out_unlock:
spin_unlock_irqrestore(>lock, flags);
+   ipoib_mcast_remove_mc_list(dev, _list);
 }

 static void ipoib_reap_neigh(struct work_struct *work)
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c   2015-09-28 
11:56:59.779764388 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c2015-09-28 
11:56:59.775764388 -0500
@@ -800,6 +800,23 @@
}
 }

+/*
+ * Check if this is a sendonly multicast group. If so remove it from the list 
and put it
+ * onto the given list for final removal.
+ */
+void ipoib_mcast_detach_sendonly(struct ipoib_dev_priv *priv, u8 *mgid, struct 
list_head *remove_list)
+{
+   struct ipoib_mcast *mcast;
+
+   /* Is this multicast ? */
+   if (mcast_auto_create && *mgid == 0xff) {
+   mcast = __ipoib_mcast_find(priv->dev, mgid);
+
+   if (mcast && test_bit(IPOIB_MCAST_FLAG_SENDONLY, >flags))
+   ipoib_detach_mc_group(priv, mcast, remove_list);
+   }
+}
+
 void ipoib_mcast_dev_flush(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
From: Christoph Lameter <c...@linux.com>
Subject: ipoib multicast: Extract two function from ipoib_mcast_flush

We need these two functions later to do the implicit leave when the
address handle expires so refactor the code.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c   2015-09-28 
11:20:05.387701463 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c2015-09-28 
11:20:53.819702839 -0500
@@ -775,6 +775,31 @@
spin_unlock_irqrestore(>lock, flags);
 }
 
+/*
+ * Detach a multicast group from the devices multicast tree and move it
+ * to a list for future removal
+ */
+static void ipoib_detach_mc_group(struct ipoib_dev_priv *priv,
+   struct ipoib_mcast *mcast, struct list_head *remove_list)
+{
+   list_del(>list);
+   rb_erase(>rb_node, >multicast_tree);
+   list_add_tail(>list, remove_list);
+}
+
+/*
+ * Remove a list of multicast groups that has been detached and free them
+ */
+void ipoib_mcast_remove_mc_list(struct net_device *dev

Re: libmlx4 and libmlx5 git trees? Who is handling those?

2015-09-28 Thread Christoph Lameter
On Mon, 28 Sep 2015, Jason Gunthorpe wrote:

> Should we combine the user side of the kapi 'core' stack (libverbs,
> all open source providers, libumad, libcm) into one source
> package? Many projects have been working in that model lately with
> some success, IMHO.

Yes please.

> Right now we even have the situation where some providers won't build
> with some verbs's, so it isn't even really the case they are actually
> independent.

Right. Its really nasty when you are trying to add features that require
libibverbs and libmlx? changes. Plus it may depend on kernel changes.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-28 Thread Christoph Lameter
On Mon, 28 Sep 2015, Doug Ledford wrote:

> No, I was referring to using this on top of your patch and my other two
> patches, which change the ipoib driver to create sendonly groups and
> then expire them when the neighbor expires.

Ok under which conditions could the joining be deferred and packets be
sent to broadcast?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-28 Thread Christoph Lameter
On Mon, 28 Sep 2015, Doug Ledford wrote:

> > We would like to keep
> > irrelevant traffic off the fabric as much as possible. An a reception
> > event that requires traffic to be thrown out will cause jitter in the
> > processing of inbound traffic that we also would like to avoid.
>
> That may not be optimal for your app, but we also need to try and
> maintain proper emulation of typical IP/Ethernet behavior since this is
> IPoIB after all.  That's why the app isn't required to join the group
> before sending, and also why it should be able to expect that we will
> fall back to sending via broadcast if needed.

Ok this needs to work with the existing ethernet gateways and verified to
work with them.

> However, the following algorithm might be suitable here:
>
> On first packet:
>   create mcast group
>   queue packet to group
>   schedule join
>
> On subsequent packets:
>   find mcast group
>   check mcast state
> if already joined, send immediately
> if joining, queue packet to mcast queue
> if join is deferred, send via bcast

Hmmm... If the multicast group does not exist in the SM then we could only
bcast to all routers instead? No host in the fabric could then be
listening the only listeners possible are outside the fabric.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libmlx4 and libmlx5 git trees? Who is handling those?

2015-09-28 Thread Christoph Lameter
On Mon, 28 Sep 2015, Doug Ledford wrote:

> On 09/28/2015 11:42 AM, Christoph Lameter wrote:
> > Where are these trees and who is maintaining them? I see that there is a
> > libibverbs on kernel.org that is updated by Doug.
> >
> > There are some mlx4/5 trees around but those have no recent commits. There
> > is a libmlx4 on kernel.org as well but the last merge there was by Roland
> > in May 2014.
> >
>
> git://git.openfabrics.org/~yishaih/libmlx4.git

Ahhh... There is active development here.

> git://git.openfabrics.org/~eli/libmlx5.git

More than 6 months of no activity.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-28 Thread Christoph Lameter
On Mon, 28 Sep 2015, Or Gerlitz wrote:

> Personally, up to few weeks ago, I was under the misimpression that
> not only IPoIB joins as full member also on the sendonly flow, but
> also that such group can be actually opened under that flow, and it
> turns out they don't. Later you said that your production environment
> was running a very old non upstream stack that had a knob to somehow

I said we run OFED 1.5.X on older systems. That is not custom.

> make it work and as of that didn't realize that something goes wrong
> for years w.r.t a gateway functionality with upstream/inbox code, so
> we all screwed up here over a time period with is few orders of
> magnitude longer than a holiday duration.

We have been are migrating to a RH7 native stack over the last months and
in a mixed environment the systems running OFED will create the MC groups
so the issue was hidden. We have talked about this migration a couple of
times even face to face. ???



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-27 Thread Christoph Lameter

On Sat, 26 Sep 2015, Or Gerlitz wrote:

> It's possible that this was done for a reason, so

> sounds good, so taking into account that Erez is away till Oct 6th, we
> can probably pick your patch and later, if Erez proves us that there's
> deep problem there, revert it and take his.

Ok but if Erez does not have the time to participate in code development
and follow up on the patch as issues arise then I would rather rework the
code so that it is easily understandable and I will continue to follow up
on the issues with the code as they develop. This seems to be much more
important to my company than Mellanox.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-27 Thread Christoph Lameter
On Sun, 27 Sep 2015, Doug Ledford wrote:

> Currently I'm testing your patch with a couple other patches.  I dropped
> the patch of mine that added a module option, and added two different
> patches.  However, I'm still waffling on this patch somewhat.  In the
> discussions that Jason and I had, I pretty much decided that I would
> like to see all send-only multicast sends be sent immediately with no
> backlog queue.  That means that if we had to start a send-only join, or
> if we started one and it hasn't completed yet, we would send the packet
> immediately via the broadcast group versus queueing.  Doing so might
> trip this new code up.

If we send immediately then we would need to check on each packet if the
multicast creation has been completed?

Also broadcast could cause a unecessary reception event on the NICs of
machines that have no interest in this traffic. We would like to keep
irrelevant traffic off the fabric as much as possible. An a reception
event that requires traffic to be thrown out will cause jitter in the
processing of inbound traffic that we also would like to avoid.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-25 Thread Christoph Lameter
On Fri, 25 Sep 2015, Or Gerlitz wrote:

> On Thu, Sep 24, 2015 at 8:00 PM, Christoph Lameter <c...@linux.com> wrote:
> > Ok here is the fixed up and tested V2 of the patch. Can this go in with
> > Doug's  patch?
>
>
> Repeating myself... do you find some over complexity in Erez's
> implementation? what's the rational for not using his patch and yes
> using yours? Erez and Co were very busy with some internal deadlines
> and he's now OOO (it's a high Holiday season now) - will be able to
> review your patch once he's back (Oct 6, I believe). It seems that the
> patch does the job, but there are locking/contexts and such to
> consider here, so I can't just ack it, have you passed it through
> testing?

Yes the patch introduces a new callback and creates workqueues that
recheck conditions etc etc.

Makes it difficult to review and potentially creates new race conditions.
I'd rather have a straightforward solution.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-25 Thread Christoph Lameter
And yes this went through testing here and we want to run this as part of
our prod kernels.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications

2015-09-17 Thread Christoph Lameter
Could we simplify it a bit. This compiles but avoids all the
generalizations and workqueues. Had to export two new functions from
ipoib_multicast.c though.



Subject: ipoib: Expire sendonly multicast joins on neighbor expiration

Add mcast_leave functionality to __ipoib_reap_neighbor.

Based on Erez work.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c2015-09-09 
13:14:03.412350354 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c 2015-09-17 
09:34:03.169844055 -0500
@@ -1149,6 +1149,8 @@ static void __ipoib_reap_neigh(struct ip
unsigned long dt;
unsigned long flags;
int i;
+   LIST_HEAD(remove_list);
+   struct ipoib_mcast *mcast, *tmcast;

if (test_bit(IPOIB_STOP_NEIGH_GC, >flags))
return;
@@ -1176,6 +1178,18 @@ static void __ipoib_reap_neigh(struct ip
  
lockdep_is_held(>lock))) != NULL) {
/* was the neigh idle for two GC periods */
if (time_after(neigh_obsolete, neigh->alive)) {
+
+   /* Is this multicast ? */
+   if (neigh->daddr[4] == 0xff) {
+   mcast = __ipoib_mcast_find(priv->dev, 
neigh->daddr + 4);
+
+   if (mcast && 
test_bit(IPOIB_MCAST_FLAG_SENDONLY, >flags)) {
+   list_del(>list);
+   rb_erase(>rb_node, 
>multicast_tree);
+   list_add_tail(>list, 
_list);
+   }
+   }
+
rcu_assign_pointer(*np,
   
rcu_dereference_protected(neigh->hnext,
 
lockdep_is_held(>lock)));
@@ -1191,6 +1205,8 @@ static void __ipoib_reap_neigh(struct ip

 out_unlock:
spin_unlock_irqrestore(>lock, flags);
+   list_for_each_entry_safe(mcast, tmcast, _list, list)
+   ipoib_mcast_leave(priv->dev, mcast);
 }

 static void ipoib_reap_neigh(struct work_struct *work)
Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2015-09-09 
13:14:03.412350354 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h  2015-09-17 09:36:17.342455845 
-0500
@@ -548,6 +548,8 @@ void ipoib_path_iter_read(struct ipoib_p

 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid, int set_qkey);
+int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast);
+struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);

 int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c   2015-09-09 
13:14:03.412350354 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c2015-09-17 
09:36:55.305497262 -0500
@@ -153,7 +153,7 @@ static struct ipoib_mcast *ipoib_mcast_a
return mcast;
 }

-static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void 
*mgid)
+struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct rb_node *n = priv->multicast_tree.rb_node;
@@ -675,7 +675,7 @@ int ipoib_mcast_stop_thread(struct net_d
return 0;
 }

-static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
+int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
int ret = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-16 Thread Christoph Lameter
On Wed, 16 Sep 2015, Doug Ledford wrote:

> > Abusing it for send-side is probably the wrong
> > direction overall.
>
> I wouldn't "abuse" it for such, I would suggest adding a proper notion
> of send-only registrations.

That is really not necessary for IP traffic. There is no need to track
these since multicast can be send without subscriptions. So I guess that
there will not be much support on netdev for such an approach.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-16 Thread Christoph Lameter
On Wed, 16 Sep 2015, Doug Ledford wrote:

> On 09/15/2015 07:53 PM, Christoph Lameter wrote:
> > On Tue, 15 Sep 2015, Jason Gunthorpe wrote:
> >
> >> The mcast list in the core is soley for listing subscriptions for
> >> inbound - ie receive. Abusing it for send-side is probably the wrong
> >> direction overall.
> >
> > Ok then a simple approach would be to port timeout logic from
> > OFED-1.5.X.
>
> It's the simple fix, but not the right fix.  I would prefer to find the
> right fix for upstream.

We would have to track which sockets have sent sendonly multicast
traffic. Some sort of a refcount on the sendonly multicast group that
gets decremented when the socket is closed down. We need some sort of
custom callback during socket shutdown.

The IPoIB layer is not a protocol otherwise we would have a shutdown
callback to work with.

Hmmm... For the UDP protocol the shutdown function is not populated in the
protocol methods. There is an encap_destroy() that is called on
udp_destroy_sock(). We could add another check in udp_destroy_sock()
that does a callback for IPoIB. That could then release the refcount.

Question then is how do we know which socket has done a sendonly join to
which multicast groups? We cannot use the regular multicast list for a
socket. So add another list?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-16 Thread Christoph Lameter
Another approach may be to tie the unsub from sendonly multicast joins to
the expiration of the layer 2 addresses in IPoIB. F.e. add code to
 __ipoib_reap_ah() to detect if the handle was used for a sendonly
multicast join. If so unsubscribe from the MC group. This will result in
behavior consistent with address resolution and caching on IPoIB.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-16 Thread Christoph Lameter
On Wed, 16 Sep 2015, Or Gerlitz wrote:

> On Wed, Sep 16, 2015 at 7:31 PM, Christoph Lameter <c...@linux.com> wrote:
> > Another approach may be to tie the unsub from sendonly multicast joins to
> > the expiration of the layer 2 addresses in IPoIB. F.e. add code to
> >  __ipoib_reap_ah() to detect if the handle was used for a sendonly
> > multicast join. If so unsubscribe from the MC group. This will result in
> > behavior consistent with address resolution and caching on IPoIB.
>
> yep, Erez has the patches to do so.

Would you please share them?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-16 Thread Christoph Lameter
On Wed, 16 Sep 2015, Or Gerlitz wrote:

> Could you please post here a few (say 2-4) liner summary of what is
> still missing or done wrong in 4.3-rc1 and what is your suggestion how
> to resolve that.

With Doug's patch here the only thing that is left to be done is to
properly leave the multicast group. And it seems that Erez patch does just that.

And then there are the 20 other things that I have pending with Mellanox
but those are different issues that do not belong here. This one is a
critical bug for us.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-15 Thread Christoph Lameter
On Tue, 15 Sep 2015, Jason Gunthorpe wrote:

> The mcast list in the core is soley for listing subscriptions for
> inbound - ie receive. Abusing it for send-side is probably the wrong
> direction overall.

Ok then a simple approach would be to port timeout logic from
OFED-1.5.X.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-15 Thread Christoph Lameter
On Tue, 15 Sep 2015, Doug Ledford wrote:

>
> I actually think this is a step in the right direction, but I think we
> are talking about a layering violation doing it the way you are
> suggesting.  What's more, there are a few difficulties here in that I'm

The function is used in various lower layer processing functions
throughout the network stack. As long as we still have a task context and
a socket accessible this should be fine.


> fairly certain the core
networking layer doesn't have the concept of a
> send-only join, yet we would need it to have such.  If we had three apps

The networking layer supports a setsockopt that can set IP_MULTICAST_LOOP
behavior. The the API is there to properly control this from user space
and thus its possible to join sendonly from user space. Actually we would
need that for the proper implementaiton of sendonly joins at the fabric
level I would think. I can work on making this work properly with the IP
stack.

> do send only sends, we would need to track all three of those sockets as
> being send only socket joins, and then if someone did a full join, we
> would need to automatically upgrade the join, and then if that app with
> the full join left, we would need to gracefully downgrade to send only
> again.  So, I think it would take work at the core level to get this
> right, but I think that's probably the right place to do the work.

Well the IP stack already has to deal with this. See mc_loop option in
the socket field as well as the sk_mc_loop() function in the network
stack.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-14 Thread Christoph Lameter
On Fri, 11 Sep 2015, Doug Ledford wrote:

> > At a minimum, when the socket that did the send closes the send-only
> > could be de-refed..
>
> If we kept a ref count, but we don't.  Tracking this is not a small change.

We could call ip_mc_join_group() from ipoib_mcast_send() which would join
it at the socket layer. That layer would do the tracking for us and leave the
group when the process terminates. The join would be visible the same way
as if one would have done an explicit setsockopt().

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-11 Thread Christoph Lameter
On Thu, 10 Sep 2015, Doug Ledford wrote:

> +  * 1) ifdown/ifup
> +  * 2) a regular mcast join/leave happens and we run
> +  *ipoib_mcast_restart_task
> +  * 3) a REREGISTER event comes in from the SM
> +  * 4) any other event that might cause a mcast flush

Could we have a timeout and leave the multicast group on process exit?
The old code base did that with the ipoib_mcast_leave_task() function.

With that timeout we do not longer accumulate MC sendonly subscriptions
for long running systems.

Also IPOIB_MAX_MCAST_QUEUE's default to 3 is not really enough to capture
a burst of traffic send to a multicast group. Can we make this
configurable or increase the max?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 0/7] Add support for multicast loopback prevention to mlx4

2015-09-04 Thread Christoph Lameter


We ran this through our tests and this works both for Infiniband and
Ethernet.

Tested-by: Christoph Lameter <c...@linux.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] IB/ipoib: Clean up send-only multicast joins

2015-09-03 Thread Christoph Lameter
On Fri, 21 Aug 2015, Jason Gunthorpe wrote:

> Even though we don't expect the group to be created by the SM we
> sill need to provide all the parameters to force the SM to validate
> they are correct.

Just ran into this issue with Redhat 7.1. Earlier code base. Same
problem. qkey etc not set. OFED-1.5.4.1 did sets these parameters
correctly and it was a surprise that the kernel IB stack never had these
fixes.

The way sendonly joins are handled now is a bit bothering me now. The join
is delayed? How can that work? We are sending the multicast packets before
the join is complete? This means the multicast packets are going to be
dropped since there is no multicast subscription.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 0/2] IB/{core,mlx4_ib}: RX/TX checksum offload

2015-09-02 Thread Christoph Lameter
On Wed, 5 Aug 2015, Amir Vadai wrote:

> This will be used by a revised version of the IP checksum patches [1], that
> will be sent later on.

Ok when can we get the full set for testing? Seems that the libibverbs and
libmlx4 portions are missing?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V3 7/8] IB/mlx4: Add mmap call to map the hardware clock

2015-08-27 Thread Christoph Lameter
Could you please post an updates patch that reflects the current state in
Matan's tree?


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V3 1/8] IB/core: Change provider's API of create_cq to be extendible

2015-08-27 Thread Christoph Lameter
Ok we tested this patchset with Matans timestamp-v2 branches from his repo
on github and the timestamps now work fine.

Can we please get the user space library bits into libibverbs and libmlx4?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 for-next 0/7] Add support for multicast loopback prevention to mlx4

2015-08-25 Thread Christoph Lameter
On Thu, 20 Aug 2015, Eran Ben Elisha wrote:

 This patch-set adds a new  implementation for multicast loopback prevention 
 for
 mlx4 driver.  The current implementation is very limited, especially if link
 layer is Ethernet. The new implementation is based on HW feature of dropping
 incoming multicast packets if the sender QP counter index is equal to the
 receiver counter index.

Do you have this in a git tree somewhere for testing?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/hfi1: Remove some sysfs files

2015-08-07 Thread Christoph Lameter
On Fri, 7 Aug 2015, Mike Marciniszyn wrote:

  {
 @@ -599,25 +581,21 @@ static ssize_t show_tempsense(struct device *device,
  /* start of per-unit file structures and support code */
  static DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL);
  static DEVICE_ATTR(board_id, S_IRUGO, show_hfi, NULL);
 -static DEVICE_ATTR(version, S_IRUGO, show_version, NULL);
  static DEVICE_ATTR(nctxts, S_IRUGO, show_nctxts, NULL);
  static DEVICE_ATTR(nfreectxts, S_IRUGO, show_nfreectxts, NULL);
  static DEVICE_ATTR(serial, S_IRUGO, show_serial, NULL);
  static DEVICE_ATTR(boardversion, S_IRUGO, show_boardversion, NULL);
  static DEVICE_ATTR(tempsense, S_IRUGO, show_tempsense, NULL);
 -static DEVICE_ATTR(localbus_info, S_IRUGO, show_localbus_info, NULL);
  static DEVICE_ATTR(chip_reset, S_IWUSR, NULL, store_chip_reset);

AFAICT the remaining are also provided by generic APIs (aside from nctxt,
nfreectxts). I really want our management appss for device etc not to crap out.

Could you get some experienced engineers to look at the driver
internally to Intel before publishing? There are numerous other drivers in
the kernel by Intel that do the right thing.

That this is duplicated and the other things show issues with kernel
basics. The driver in its entirety probably does not follow the quality
that we are used to from other developers at Intel. Its likely that
structural changes need to be made to the driver. Significant portions may
be duplicating functionality that the kernel already provides as generic
functionality. Also we have already established that there is
significantly duplication of other drivers in the infiniband tree.

Should this not go into staging instead? If this is merged then the push
to clean these issues up goes away like it did with the earlier
incarnations of this hardware.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] update ocrdma to dual license

2015-07-31 Thread Christoph Lameter
On Wed, 8 Jul 2015, Christoph Hellwig wrote:

 So how about someone tells OFED to stop trying to enforce this BS?

 This just confirms my byass that Open-Fabrics Alliance are a bunch of
 idiots making life hard, similar to all their horrible OFED driver
 distributions that crated a total mess for everyone involved.

There are a number of commmits already that change the license. Lets
revert these:

commit b8f5595eb96c9fce1c907d13e89581e5061edf2e
Author: Devesh Sharma devesh.sha...@avagotech.com
Date:   Fri Jul 24 05:04:00 2015 +0530

RDMA/ocrdma: update ocrdma module license string

Change module_license from GPL to Dual BSD/GPL

Cc: Tejun Heo t...@kernel.org
Cc: Duan Jiong duanj.f...@cn.fujitsu.com
Cc: Roland Dreier rol...@purestorage.com
Cc: Jes Sorensen jes.soren...@redhat.com
Cc: Sasha Levin levinsasha...@gmail.com
Cc: Dan Carpenter dan.carpen...@oracle.com
Cc: Prarit Bhargava pra...@redhat.com
Cc: Colin Ian King colin.k...@canonical.com
Cc: Wei Yongjun yongjun_...@trendmicro.com.cn
Cc: Moni Shoua mo...@mellanox.com
Cc: Rasmus Villemoes li...@rasmusvillemoes.dk
Cc: Li RongQing roy.qing...@gmail.com
Cc: Devendra Naga devendra.a...@gmail.com
Signed-off-by: Devesh Sharma devesh.sha...@avagotech.com
Signed-off-by: Doug Ledford dledf...@redhat.com

commit 71ee67306ecbdfc0c94ed93c77ff99d29e961d69
Author: Devesh Sharma devesh.sha...@avagotech.com
Date:   Fri Jul 24 05:03:59 2015 +0530

RDMA/ocrdma: update ocrdma license to dual-license

Change of license from GPLv2 to dual-license (GPLv2 and BSD 2-Clause)

All contributors were contacted off-list and permission to make this
change was received.  The complete list of contributors are Cc:ed here.

Cc: Tejun Heo t...@kernel.org
Cc: Duan Jiong duanj.f...@cn.fujitsu.com
Cc: Roland Dreier rol...@purestorage.com
Cc: Jes Sorensen jes.soren...@redhat.com
Cc: Sasha Levin levinsasha...@gmail.com
Cc: Dan Carpenter dan.carpen...@oracle.com
Cc: Prarit Bhargava pra...@redhat.com
Cc: Colin Ian King colin.k...@canonical.com
Cc: Wei Yongjun yongjun_...@trendmicro.com.cn
Cc: Moni Shoua mo...@mellanox.com
Cc: Rasmus Villemoes li...@rasmusvillemoes.dk
Cc: Li RongQing roy.qing...@gmail.com
Cc: Devendra Naga devendra.a...@gmail.com
Signed-off-by: Devesh Sharma devesh.sha...@avagotech.com
Signed-off-by: Doug Ledford dledf...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] update ocrdma to dual license

2015-07-31 Thread Christoph Lameter
On Wed, 8 Jul 2015, Christoph Hellwig wrote:

 On Wed, Jul 08, 2015 at 03:33:03PM -0400, Doug Ledford wrote:
  I am not a lawyer, but this has been explained to me on numerous
  occasions, so I relay the layman's interpretation here:
 
  No, you don't always need everyone's approval.  There are contributions
  that are not legally copyright worthy.

 There are.  But for an open source project trying to deal with slippery
 slot is not worth it.  Just get an ACK from everyone to be on the safe
 side and show that you act in good faith.

Note that there are numerous contributions in the IB subsystem from folks
not in the OFA. Those certainly have the expectation that their work was
under the GPLv2 and not BSD.




--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 40/50] IB/hfi1: add sysfs routines and documentation

2015-07-31 Thread Christoph Lameter
On Fri, 31 Jul 2015, Marciniszyn, Mike wrote:
  Please get someone knowledgeable at Intel to look at this. There is a (re=
 v 04)
  when using lspci on my nic here. This seems to be the hardware revision.
 =20

 I'm not sure what else to say other that the source for the sysfs file disp=
 lay is not from the PCI_REVISION_ID.

No its from the PCI config block. So its not needed in the sysfs display
since lspci can get to it that way.

if (c =3D get_conf_byte(d, PCI_REVISION_ID))
  printf( (rev %02x), c);

 Note, that a zero value suppresses the print, which is the case right now.

Well yeah that is the initial release. You would increment it when the
next rev comes out.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 40/50] IB/hfi1: add sysfs routines and documentation

2015-07-31 Thread Christoph Lameter
On Fri, 31 Jul 2015, Marciniszyn, Mike wrote:

+HFI1
+
+  The hfi1 driver also creates these additional files:
+
+   hw_rev - hardware revision
 =20
  I'm checking on this to see if it is indeed a duplicate.
 =20

 Our hardware architect has indicated there is not PCIe equivalent for this =
 case.

Please get someone knowledgeable at Intel to look at this. There is
a (rev 04) when using lspci on my nic here. This seems to be the hardware
revision.

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM
(rev 04)

lspci.c does the following and to get that info. See where
PCI_REVISION_ID is used:



static void
show_terse(struct device *d)
{
  int c;
  struct pci_dev *p = d-dev;
  char classbuf[128], devbuf[128];

  show_slot_name(d);
  printf( %s: %s,
 pci_lookup_name(pacc, classbuf, sizeof(classbuf),
 PCI_LOOKUP_CLASS,
 p-device_class),
 pci_lookup_name(pacc, devbuf, sizeof(devbuf),
 PCI_LOOKUP_VENDOR | PCI_LOOKUP_DEVICE,
 p-vendor_id, p-device_id));
  if (c = get_conf_byte(d, PCI_REVISION_ID))
printf( (rev %02x), c);
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] update ocrdma to dual license

2015-07-31 Thread Christoph Lameter
On Fri, 31 Jul 2015, Doug Ledford wrote:

  Everyone on that Cc: list (and I note in particular that your name is
  *not* on that list) has been contacted and gave permission to
  Avagotech/Emulex to go ahead and change the copyright on the code.  As
  such, it is their *right* to make that change if they see fit.  There
  will be no revert, period.

 Also, just as a general rule, don't *EVER* come to me trying to assert
 copyright control on code you haven't even donated one line of effort to.

I have not asserted any copyright on the particular files.

But I have extensively contributed to core kernel code for 20 years with
the understanding that the license for the kernel code as a whole is under
the GPL and that others will contribute like I did under the GPL. It
certainly is a surprise to me that someone can change the license of parts
of the kernel to allow non-GPL licensing. Never seen that before.

I will assert that the modifications of the IB stack that I have
contributed over the years (mostly in passes over the kernel to
change functions globally) are under GPL only. In this case you are lucky
that I never touched those files.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 40/50] IB/hfi1: add sysfs routines and documentation

2015-07-31 Thread Christoph Lameter
On Fri, 31 Jul 2015, Marciniszyn, Mike wrote:

  And lspci as well as other tools will not be able to distinguish between
  different versions of the hardware.

 Sorry if I misled.

 We do fully support the PCI revision number, and that will be set that diff=
 erently for different hardware versions.

Ah great.

 We have additional versioning information that we convey using chip registe=
 rs (not PCI config registers), and the driver brings these values out for i=
 nterpretation by our own tools. The hardware design does not support VPD.

Oww... Our IT folks will be really mad about this. Their inventory and
provisioning systems will not be able to work properly.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 40/50] IB/hfi1: add sysfs routines and documentation

2015-07-31 Thread Christoph Lameter
On Fri, 31 Jul 2015, Jason Gunthorpe wrote:

 If that wasn't done, the HW can't be changed, we are stuck with
 wonky sysfs files.. Try and get it right next time shrug

And lspci as well as other tools will not be able to distinguish between
different versions of the hardware.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/50] IB: Add CNP opcode enumeration.

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Mike Marciniszyn wrote:

 This patch adds the value of the CNP opcode to the existing list of enumerated
 opcodes.

That is obvious and useless. Patches should have a meaningful
description and justify the changes.

Why do you add the CNP opcode and what in the world does it do? CNP is
what? And why do the other enum values not work for you?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 01/50] IB: Add CNP opcode enumeration.

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Marciniszyn, Mike wrote:

  That is obvious and useless. Patches should have a meaningful description
  and justify the changes.
 

 The driver uses the CNP opcode for congestion control.

And that requires a new transport protocol???

  Why do you add the CNP opcode and what in the world does it do? CNP is
  what? And why do the other enum values not work for you?

 The driver supports congestion control in software vs. outboard
 firmware, so the opcode should be available in the appropriate kernel
 include file.

So is CNP an operation or a protocol?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 40/50] IB/hfi1: add sysfs routines and documentation

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Mike Marciniszyn wrote:

 +HFI1
 +
 +  The hfi1 driver also creates these additional files:
 +
 +   hw_rev - hardware revision
 +   board_id - manufacturing board id
 +   version - driver version
 +   tempsense - thermal sense information
 +   serial - board serial number
 +   nfreectxts - number of free user contexts
 +   nctxts - number of allowed contexts (PSM2)
 +   localbus_info - PCIe info
 +   chip_reset - diagnostic (root only)
 +   boardversion - board version

Arent these already provide by the pci-e driver framework? Tools will not
work if you do not put the information out there in a way that they can be
scanned.

F.e the following output of lspci -vv shows a revision and the board_id
is also usually avaialble. The kernel driver version is also there via
the driver/module directory etc etc. Please integrate properly into the
kernel device driver infrastructure and do not create useless new entries.

lspci -vv

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 
04)
Subsystem: Fujitsu Technology Solutions Device 11ed
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 28
Region 0: Memory at f7c0 (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at f7c3d000 (32-bit, non-prefetchable) [size=4K]
Region 2: I/O ports at f080 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee00378  Data: 
Capabilities: [e0] PCI Advanced Features
AFCap: TP+ FLR+
AFCtrl: FLR-
AFStatus: TP-
Kernel driver in use: e1000e

ls -l /sys/devices/pci\:00/:00:19.0/driver/module/
total 0
-r--r--r-- 1 root root 4096 Jul 30 14:59 coresize
drwxr-xr-x 2 root root0 Jul 30 15:47 drivers
drwxr-xr-x 2 root root0 Jul 30 14:59 holders
-r--r--r-- 1 root root 4096 Jul 30 15:47 initsize
-r--r--r-- 1 root root 4096 Jul 30 14:59 initstate
drwxr-xr-x 2 root root0 Jul 30 15:47 notes
drwxr-xr-x 2 root root0 Jul 30 15:47 parameters
-r--r--r-- 1 root root 4096 Jul 30 14:59 refcnt
drwxr-xr-x 2 root root0 Jul 30 15:47 sections
-r--r--r-- 1 root root 4096 Jul 30 15:47 srcversion
-r--r--r-- 1 root root 4096 Jul 30 15:47 taint
--w--- 1 root root 4096 Jul 30 14:59 uevent
-r--r--r-- 1 root root 4096 Jul 30 15:49 version


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/50] Add OPA gen1 driver

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Mike Marciniszyn wrote:

 As a verbs driver the device functions as an InfiniBand device and
 supports the standard features of the IBTA specification v1.3 with
 the exceptions noted below.

Hmmm... So OPA networks and IB networks (Truescale?) will be able to
interoperate?

 The public information can be reviewed at:

 http://www.intel.com/content/www/us/en/omni-path/omni-path-fabric-overview.html

That is very helpful although I have to guess what the various marketing
terms mean. Is there more detail on NICs and switch specifications
available?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [TECH TOPIC] IRQ affinity

2015-07-15 Thread Christoph Lameter
On Wed, 15 Jul 2015, Christoph Hellwig wrote:

 Many years ago we decided to move setting of IRQ to core affnities to
 userspace with the irqbalance daemon.

 These days we have systems with lots of MSI-X vector, and we have
 hardware and subsystem support for per-CPU I/O queues in the block
 layer, the RDMA subsystem and probably the network stack (I'm not too
 familar with the recent developments there).  It would really help the
 out of the box performance and experience if we could allow such
 subsystems to bind interrupt vectors to the node that the queue is
 configured on.

 I'd like to discuss if the rationale for moving the IRQ affinity setting
 fully to userspace are still correct in todays world any any pitfalls
 we'll have to learn from in irqbalanced and the old in-kernel affinity
 code.

Configuration with processors that are trying to be OS noise free
(NOHZ) would also benefit if device interrupts would be directed to
processors that are not in the NOHZ set. Currently we use scripts on
bootup that redirect interrupts away from these.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 00/41] Add OPA gen1 driver

2015-07-09 Thread Christoph Lameter
On Wed, 8 Jul 2015, Marciniszyn, Mike wrote:

  Are there any user space tools to control/exercise the driver and the
  protocol stack?

 There are diagnostic tools that are Intel specific that have device driver =
 hooks.

Ok I hope these things are going to be merged at some point into the
ibnet diagnostic tools ?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH REPOST libibverbs] Add IP and TCP/UDP TX checksum offload support

2015-07-01 Thread Christoph Lameter
Is there any release schedule and/or upstream repo where I can see changes
for libibverbs and libmlx4?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH REPOST libibverbs] Add IP and TCP/UDP TX checksum offload support

2015-06-18 Thread Christoph Lameter
We run those patches and we would like to see them upstream.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 47/49] IB/hfi1: add multicast routines

2015-06-15 Thread Christoph Lameter
Ummm... This looks eerily similar to qib_verbs_mcast.c. Sed job on the
file? Is there any way to get a description as to what the differences are
between qib and hfi?

Can you just use the same file?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/41] Add OPA gen1 driver

2015-06-15 Thread Christoph Lameter
Ummm.. Could we get some more descriptions as to what this code is for?

Do we have a new OmniPath protocol here as well or is it IB? Which
standards are followed?

I think the APIs that the driver uses need to be documented somewhere in
particular if new sysfs entries etc are created.

Are there any user space tools to control/exercise the driver and the
protocol stack?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 38/41] IB/hfi1: add general verbs handling

2015-06-12 Thread Christoph Lameter
On Thu, 11 Jun 2015, Mike Marciniszyn wrote:

 +static int query_device(struct ib_device *ibdev,
 + struct ib_device_attr *props)
 +{
 + struct hfi1_devdata *dd = dd_from_ibdev(ibdev);
 + struct hfi1_ibdev *dev = to_idev(ibdev);
 +
 + memset(props, 0, sizeof(*props));
 +
 + props-device_cap_flags = IB_DEVICE_BAD_PKEY_CNTR |
 + IB_DEVICE_BAD_QKEY_CNTR | IB_DEVICE_SHUTDOWN_PORT |
 + IB_DEVICE_SYS_IMAGE_GUID | IB_DEVICE_RC_RNR_NAK_GEN |
 + IB_DEVICE_PORT_ACTIVE_EVENT | IB_DEVICE_SRQ_RESIZE;

Hmmm... One thing that we need here is:

IB_DEVICE_BLOCK_MULTICAST_LOOPBACK

to avoid the flow back of large multicast streams. Looks like this is
IB/OP only. If you support ethernet then other flags are required too.

 + props-page_size_cap = PAGE_SIZE;

No large page support?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-11 Thread Christoph Lameter
On Wed, 10 Jun 2015, Hefty, Sean wrote:

  There are multiple problems with libfrabric related to the use cases in m=
 y
  area. Most of all the lack of multicast support. Then there is the build
  up of software bloat on top. The interest here is in low latency
  operations. Redenzvous and other new features are really not wanted if
  they increase the latency.

 Multicast is only supported by one vendor that has taken a hostile position=
  against libfabric.  Support for multicast will eventually be there, but it=
 's definitely not a priority for me.  As an open source project, anyone is =
 welcome to propose patches.

Intel is supporting multicast in hardware. Its just a bad implementation
(broadcast and filtering MC groups in the HCA or what was that?) and there
is no plan to fix the issues despite the problem being known for quite
some time. Also does this mean that libfabric only to supports the
features needed by Intel?

 For native providers, libfabric will reduce latency.  That's a provider imp=
 lementation issue, and native providers will be available soon.  The OFIWG =
 selected to have a working set of interfaces that applications can begin us=
 ing immediately, versus waiting until there were a large set of native prov=
 iders.

I would be interested to see some measurements. AFAICT the Intel solutions
are based on historically inferior IB technology from Qlogic which has
never been able in my lab tests to compete latency wise with other
vendors. I have heard these latency claims repeatedly from Qlogic
personnel over the years.

 IMO, this is exactly the problem.  The entire design is being driving by th=
 e implementation.  That produces an unmaintainable API and fractures the so=
 ftware ecosystem, which is exactly where we are today.

This is a well designed solution and its easy to use.

It would help libfabric if you would work with other vendors and
industries to include support for their needs. MPI is not the only
applications that are running on the fabrics. I understand that is
historically the only area in which Qlogic hardware was able to compete
but I think you need to move beyond that. APIs should be as general as
possible abstracting hardware as much as possible. A viable libfabric
needs to be easy to use, low overhead as well as covering the requirements
of multiple vendors and use cases.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-09 Thread Christoph Lameter
On Mon, 8 Jun 2015, Hefty, Sean wrote:

 You're assuming that the only start time of interest is when a send operati=
 on has been posted.  Jason asked what I would do with libfabric.  That inte=
 rface supports triggered operations.  It has also been designed such that a=
  rendezvous (that has to be one of the most difficult words in the English =
 language to spell correctly, even with spell check) protocol could be imple=
 mented by the provider.  On the receive side, it may be of interest to repo=
 rt the start and ending time for larger transfers, primarily for debugging =
 purposes.

There are multiple problems with libfrabric related to the use cases in my
area. Most of all the lack of multicast support. Then there is the build
up of software bloat on top. The interest here is in low latency
operations. Redenzvous and other new features are really not wanted if
they increase the latency.

 I have no idea how the time stamps are expected to be used, so why limit it=
 ?  An app could just as easily create their own time stamp when reading a w=
 ork completion, especially when the data is going into an anonymous receive=
  buffer.  That would seem to work for your use case.

No it cannot as described earlier. The work can be completed much earlier
than when the polling thread gets around to check for it. We do that today
since there is nothing better but this means that there is a gap there.
On the send side you have no easy way to telling when the operation was
complete without the timestamp.

 I have no problem with a bare metal interface exposing this.  But pretendin=
 g that it's generic and that this is the one and only way that this could b=
 e implemented doesn't make it so.

This is a way it was implemented and its usable. Shooting for pie in the
sky does not bring us anything. Nor ideas of requirements from a new
experimental API that does not support the basic features that we need
and seems to be on its way to mess up the latencies of access to RDMA
operations.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-06 Thread Christoph Lameter
On Wed, 3 Jun 2015, Jason Gunthorpe wrote:

 On Wed, Jun 03, 2015 at 07:55:58PM -0500, Christoph Lameter wrote:

  I thknk the raw cycles and the rought oscillator speed are fine.

 Time keeping is designed to adjust for 100's of ppm drift between
 clocks.

What time keeping? Ntp? pptp is supposed to be accurate to 10s of ns and
we would need an accuracy in that range.

 A communications clock source will be spec'd to be below 200ppm in
 accuracy. IB clocks are below 100 ppm, and PCI-E is 300ppm (approx, I
 didn't check, order of magnitue is close)

Well that is not usable. ns are a billionth of a second which is the unit
of measurement of these activities here. A send action can be around 600-1000ns.
If we are off by 200ppm then that is 200 microseconds meaning 20 ns.
And its our experience that these clocks can be off by milliseconds in
practice.

 That translates into 0.0625 Hz. for a 312.5 MHz ethernet reference clock

Ok that is around 3ns per cycle? And you think the accuracy is therefore
in femtoseconds? I have never seen something that accurate. Wish something
like that would exist. Maybe in some labs that provide the source of
global timekeeping?

 Compared to 5,000,000 Hz in error from rounding.

Huh?

 So no, I disagree that rough is fine for anything.

I am sorry but the practical issues that we are dealing with in
timekeeping today shows just the opposite. For a true comparison of clocks
with nanosecond accuracy you would need time corrected values and that is
a challenge due to the variances of the clocks that we see.



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-06 Thread Christoph Lameter
On Thu, 4 Jun 2015, Hefty, Sean wrote:

 If I were adding timestamps, I would probably define a new completion
 structure with 2 u64 time stamp fields (start and end times), and figure
 out when start occurred, end occurred, and the timing metric later.  :)

Not sure why you would need the start. The app knows when it submitted a
send request and incoming packets can be readily timed with taps if
necessary. If you want the start on inbound packets then you have the
challenge that the adapter needs to figure out when the first bit of the
message actually arrived and the timestamp information needs to be pushed
through all the way through the pipeline. Completion is easily done.

 I would assume that these are non-wrapping values.

Its fine what we have now as far as I can tell.

I am not sure why it is necessary to make this more complicated than it is
now. We need a simple means to obtain the completion time and that is what
the current implementation provides. There is even another vendor
(chelsio) who has a similar implementation.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-06 Thread Christoph Lameter
On Sat, 6 Jun 2015, Doug Ledford wrote:

 The ppm rating is based upon the speed of the clock, not time.  It's how
 many cycles of variance you are allowed from the target speed given in
 cycles / millions of cycles of the target clock frequency.  If you have
 a 312.5MHz clock, and your accuracy is specified as 100ppm, then the
 total clock variability is 312.5 * 100 = 31250 cycles (I suspect that
 this is an absolute variance, and so the tolerance would be +-1/2 of the
 total amount, but I don't know that for certain).

Ok well then you also have the problem that the clock may be off in
general already by a certain factor from the true speed of the flow of
time due to manufacturing variances etc. We are only talking about the
instabilty of the clock source while operating it seems?

  I am sorry but the practical issues that we are dealing with in
  timekeeping today shows just the opposite. For a true comparison of clocks
  with nanosecond accuracy you would need time corrected values and that is
  a challenge due to the variances of the clocks that we see.

 Jason's point, and one that isn't addressed yet, is that this might not
 be variance in the clocks and instead might be a design flaw in the API
 you are using and the way the clock speeds are passed to user space.
 Changing from int MHz to int KHz might solve your problem.

That sounds doable. Maybe we need to look at how clock speeds are
specified elsewhere?

man adtimex

gives some ways that this is done in the general API for clock adjustment.

Or maybe better look at IEEE 1588 for ways to specify the clock
characteristics?

http://www.nist.gov/el/isd/ieee/ieee1588.cfm

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device

2015-06-03 Thread Christoph Lameter
On Mon, 1 Jun 2015, Hefty, Sean wrote:

  We want to have a time stamp when the action is complete and the data is
  available to the application or the send action is complete and the CQ
  entry can be reused.

 This is what polling the completion from the CQ tells you, independent of t=
 here being a time stamp.

But you may not be polling that frequently. Polling threads may check
multiple sources of events and may also currently executing code to handle
an event. Also there is the problem of the OS interrupting you. All of
these sources of inaccuracy are removed by the timestamp.

That was for inbound. For outbound you do not get a timestamp without this
feature. Typically reclaim of outbound work requeust is delayed quite a
bit and getting a timestamp later does not reflect the actual time the
message was sent.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-03 Thread Christoph Lameter
On Wed, 3 Jun 2015, Jason Gunthorpe wrote:

 MHz is fine *for mlx hardware* but someone elses hardware that uses,
 say 312.5 MHz (ie the ethernet symbol clock) is NOT OK because MHz
 looses too much precision.

Oscillator vary in frequency. In order to accurately convert to NS the
drift due to temperature etc needs to be taken into consideration. The
ns value there is pretty rough as well. Accurate time may need time
software to continually monitor the *actual* frequency of the oscillator.
I thknk the raw cycles and the rought oscillator speed are fine.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >