Re: Unable to establish rdma connection, breaks rdma basic functionality

2016-01-06 Thread Matan Barak
On Wed, Jan 6, 2016 at 6:43 AM, Hariprasad S <haripra...@chelsio.com> wrote:
>
> Hi Doug,
>
> I am trying to rping server, but it fails when bound to any address other 
> then IF_ANY.
> # rping -s -a 102.1.1.129 -C1 -p  -vd
> created cm_id 0x23d7800
> rdma_bind_addr: No such file or directory
> destroy cm_id 0x23d7800
>
> If bound to IF_ANY address, server starts but client fails to establish 
> connection.
> # rping -s -C1 -p  -vvvd
> created cm_id 0xc34800
> rdma_bind_addr successful
> rdma_listen
>
> And the commit which introduced this regression is
>
> commit abae1b71dd37bab506b14a6cf6ba7148f4d57232
> Author: Matan Barak <mat...@mellanox.com>
> Date:   Thu Oct 15 18:38:49 2015 +0300
>
> IB/cma: cma_validate_port should verify the port and netdevice
>
> Previously, cma_validate_port searched for GIDs in IB cache and then
> tried to verify the found port. This could fail when there are
> identical GIDs on both ports. In addition, netdevice should be taken
> into account when searching the GID table.
> Fixing cma_validate_port to search only the relevant port's cache
> and netdevice.
>
> Signed-off-by: Matan Barak <mat...@mellanox.com>
> Signed-off-by: Doug Ledford <dledf...@redhat.com
>
>
> The bug is easily reproducible with latest rc and breaks basic rdma 
> functionality.
> Since 4.4 is already in -rc8, can we have a quick fix.
>
> Thanks,
> Hari--

Hi,

I don't have a iwarp server, so could you please test this simple fix:

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 2cbf9c9..351e835 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -439,7 +439,7 @@ static inline int cma_validate_port(struct
ib_device *device, u8 port,
if ((dev_type != ARPHRD_INFINIBAND) && rdma_protocol_ib(device, port))
return ret;

-   if (dev_type == ARPHRD_ETHER)
+   if (dev_type == ARPHRD_ETHER && rdma_protocol_roce(device, port))
ndev = dev_get_by_index(_net, bound_if_index);

ret = ib_find_cached_gid_by_port(device, gid, port, ndev, NULL);

Regards,
Matan

> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/cma: Fix RDMA port validation for iWarp

2016-01-06 Thread Matan Barak
cma_validate_port wrongly assumed that Ethernet devices are RoCE
devices and thus their ndev should be matched in the GID table.
This broke the iWrap support. Fixing that matching the ndev only if
we work on a RoCE port.

Fixes: abae1b71dd37 ('IB/cma: cma_validate_port should verify the port
 and netdevice')
Reported-by: Hariprasad Shenai <haripra...@chelsio.com>
Tested-by: Hariprasad Shenai <haripra...@chelsio.com>
Signed-off-by: Matan Barak <mat...@mellanox.com>
---

Hi Doug,

This patch fixes an iWarp issue that was introduced in the RoCE
refactoring series.

Regards,
Matan

 drivers/infiniband/core/cma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 2d762a2..17a15c5 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -453,7 +453,7 @@ static inline int cma_validate_port(struct ib_device 
*device, u8 port,
if ((dev_type != ARPHRD_INFINIBAND) && rdma_protocol_ib(device, port))
return ret;
 
-   if (dev_type == ARPHRD_ETHER)
+   if (dev_type == ARPHRD_ETHER && rdma_protocol_roce(device, port))
ndev = dev_get_by_index(_net, bound_if_index);
 
ret = ib_find_cached_gid_by_port(device, gid, port, ndev, NULL);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC

2016-01-05 Thread Matan Barak
On Thu, Dec 24, 2015 at 9:46 AM, Matan Barak <mat...@dev.mellanox.co.il> wrote:
> On Wed, Dec 23, 2015 at 10:04 PM, Doug Ledford <dledf...@redhat.com> wrote:
>> On 10/15/2015 12:58 PM, Hefty, Sean wrote:
>>>>>> ib_create_ah_from_wc needs to resolve the DMAC in order to create the
>>>>>> AH (this may result sending an ARP and waiting for response).
>>>>>> CM uses this function (which is now sleepable).
>>>>>
>>>>> This is a significant change to the CM.  The CM calls are invoked
>>>> assuming that they return relatively quickly.  They're invoked from
>>>> callbacks and internally.  Having the calls now wait for an ARP response
>>>> requires that this be re-architected, so the calling thread doesn't go out
>>>> to lunch for several seconds.
>>>>
>>>> Agree - this is a significant change, but it was done a long time ago
>>>> (at v4.3 if I recall). When we need to send a message we need to
>>>
>>> We're at 4.3-rc5?
>>>
>>>> figure out the destination MAC. Even the passive side needs to do that
>>>> as some vendors don't report the source MAC of the packet in their wc.
>>>> Even if they did, since IP based addressing is rout-able by its
>>>> nature, it should follow the networking stack rules. Some crazy
>>>> configurations could force sending responses to packets that came from
>>>> router1 to router2 - so we have no choice than resolving the DMAC at
>>>> every side.
>>>
>>> Ib_create_ah_from_wc is broken.   It is now an asynchronous operation, only 
>>> the call itself was left as synchronous.  We can't block kernel threads for 
>>> a minute, or however long ARP takes to resolve.  The call itself must 
>>> change to be async, and all users of it updated to allocate some request, 
>>> queue it, and handle all race conditions that result -- such as state 
>>> changes or destruction of the work that caused the request to be initiated.
>>>
>>
>> I don't know who had intended to address this, but it got left out of
>> the 4.4 work.  We need to not let this drop through the cracks (for
>> another release).  Can someone please put fixing this properly on their
>> TODO list?
>>
>
> IMHO, the proposed patch makes things better. Not applying the current
> patch means we have a "sleeping while atomic" error (in addition to
> the fact that kernel threads could wait until the ARP process
> finishes), which is pretty bad. I tend to agree that adding another CM
> state is probably a better approach, but unless someone steps up and
> add this for v4.5, I think that's the best thing we have.
>
>> --
>> Doug Ledford <dledf...@redhat.com>
>>   GPG KeyID: 0E572FDD
>>
>>
>
> Matan

Yishai has found a double free bug in the error flow of this patch.
The fix is pretty simple.
Thanks Yishai for catching and testing this fix.

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 07a3bbf..832674f 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -296,10 +296,9 @@ static int _cm_alloc_response_msg(struct cm_port *port,
   0, IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA,
   GFP_ATOMIC,
   IB_MGMT_BASE_VERSION);
-   if (IS_ERR(m)) {
-   ib_destroy_ah(ah);
+   if (IS_ERR(m))
return PTR_ERR(m);
-   }
+
m->ah = ah;
*msg = m;
return 0;
@@ -310,13 +309,18 @@ static int cm_alloc_response_msg(struct cm_port *port,
 struct ib_mad_send_buf **msg)
 {
struct ib_ah *ah;
+   int ret;

ah = ib_create_ah_from_wc(port->mad_agent->qp->pd, mad_recv_wc->wc,
  mad_recv_wc->recv_buf.grh, port->port_num);
if (IS_ERR(ah))
return PTR_ERR(ah);

-   return _cm_alloc_response_msg(port, mad_recv_wc, ah, msg);
+   ret = _cm_alloc_response_msg(port, mad_recv_wc, ah, msg);
+   if (ret)
+   ib_destroy_ah(ah);
+
+   return ret;
 }

 static void cm_free_msg(struct ib_mad_send_buf *msg)


Doug, if you intend to take this patch. I can squash this fix and respin it.

Thanks,
Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V1 for-next 2/2] IB/core: Use hop-limit from IP stack for RoCE

2016-01-04 Thread Matan Barak
Previously, IPV6_DEFAULT_HOPLIMIT was used as the hop limit value for
RoCE. Fixing that by taking ip4_dst_hoplimit and ip6_dst_hoplimit as
hop limit values.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/addr.c   |  9 -
 drivers/infiniband/core/cm.c |  1 +
 drivers/infiniband/core/cma.c| 12 +---
 drivers/infiniband/core/verbs.c  | 16 +++-
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |  3 ++-
 include/rdma/ib_addr.h   |  4 +++-
 6 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index ce3c68e..f924d90 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -252,6 +252,8 @@ static int addr4_resolve(struct sockaddr_in *src_in,
if (rt->rt_uses_gateway)
addr->network = RDMA_NETWORK_IPV4;
 
+   addr->hoplimit = ip4_dst_hoplimit(>dst);
+
*prt = rt;
return 0;
 out:
@@ -295,6 +297,8 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
if (rt->rt6i_flags & RTF_GATEWAY)
addr->network = RDMA_NETWORK_IPV6;
 
+   addr->hoplimit = ip6_dst_hoplimit(dst);
+
*pdst = dst;
return 0;
 put:
@@ -542,7 +546,8 @@ static void resolve_cb(int status, struct sockaddr 
*src_addr,
 
 int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid,
 const union ib_gid *dgid,
-u8 *dmac, u16 *vlan_id, int *if_index)
+u8 *dmac, u16 *vlan_id, int *if_index,
+int *hoplimit)
 {
int ret = 0;
struct rdma_dev_addr dev_addr;
@@ -581,6 +586,8 @@ int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid,
*if_index = dev_addr.bound_dev_if;
if (vlan_id)
*vlan_id = rdma_vlan_dev_vlan_id(dev);
+   if (hoplimit)
+   *hoplimit = dev_addr.hoplimit;
dev_put(dev);
return ret;
 }
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index e3a95d1..cd3d345 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1641,6 +1641,7 @@ static int cm_req_handler(struct cm_work *work)
cm_format_paths_from_req(req_msg, >path[0], >path[1]);
 
memcpy(work->path[0].dmac, cm_id_priv->av.ah_attr.dmac, ETH_ALEN);
+   work->path[0].hop_limit = cm_id_priv->av.ah_attr.grh.hop_limit;
ret = ib_get_cached_gid(work->port->cm_dev->ib_device,
work->port->port_num,
cm_id_priv->av.ah_attr.grh.sgid_index,
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 559ee3d..66983da 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2424,7 +2424,6 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
 {
struct rdma_route *route = _priv->id.route;
struct rdma_addr *addr = >addr;
-   enum ib_gid_type network_gid_type;
struct cma_work *work;
int ret;
struct net_device *ndev = NULL;
@@ -2478,14 +2477,13 @@ static int cma_resolve_iboe_route(struct 
rdma_id_private *id_priv)
>path_rec->dgid);
 
/* Use the hint from IP Stack to select GID Type */
-   network_gid_type = ib_network_to_gid_type(addr->dev_addr.network);
-   if (addr->dev_addr.network != RDMA_NETWORK_IB) {
-   route->path_rec->gid_type = network_gid_type;
+   if (route->path_rec->gid_type < 
ib_network_to_gid_type(addr->dev_addr.network))
+   route->path_rec->gid_type = 
ib_network_to_gid_type(addr->dev_addr.network);
+   if (((struct sockaddr *)_priv->id.route.addr.dst_addr)->sa_family != 
AF_IB)
/* TODO: get the hoplimit from the inet/inet6 device */
-   route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT;
-   } else {
+   route->path_rec->hop_limit = addr->dev_addr.hoplimit;
+   else
route->path_rec->hop_limit = 1;
-   }
route->path_rec->reversible = 1;
route->path_rec->pkey = cpu_to_be16(0x);
route->path_rec->mtu_selector = IB_SA_EQ;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 66eb498..b1998bc 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -434,6 +434,7 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 
port_num,
int ret;
enum rdma_network_type net_type = RDMA_NETWORK_IB;
enum ib_gid_type gid_type = IB_GID_TYPE_IB;
+   int hoplimit = 0xff;
union ib_gid dgid;
union ib_gid sgid;
 
@@ -471,7 +472,7 @@ int ib_init_ah_from_wc(struct ib_de

[PATCH V1 for-next 0/2] Fix hop-limit for RoCE

2016-01-04 Thread Matan Barak
Hi Doug,

Previously, the hop limit of RoCE packets were set to
IPV6_DEFAULT_HOPLIMIT. This generally works, but RoCE stack needs to
follow the IP stack rules. Therefore, this patch series use
ip4_dst_hoplimit and ip6_dst_hoplimit in order to set the correct
hop limit for RoCE traffic.

The first patch refactors the name of rdma_addr_find_dmac_by_grh to
rdma_addr_find_l2_eth_by_grh while the second one does the actual
change.

Regards,
Matan

Changes from V0:
 - Hop limit in IB when using reversible path should be 0xff.

Matan Barak (2):
  IB/core: Rename rdma_addr_find_dmac_by_grh
  IB/core: Use hop-limit from IP stack for RoCE

 drivers/infiniband/core/addr.c   | 14 +++---
 drivers/infiniband/core/cm.c |  1 +
 drivers/infiniband/core/cma.c| 12 +---
 drivers/infiniband/core/verbs.c  | 30 ++
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |  7 ---
 include/rdma/ib_addr.h   |  7 +--
 6 files changed, 40 insertions(+), 31 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V1 for-next 1/2] IB/core: Rename rdma_addr_find_dmac_by_grh

2016-01-04 Thread Matan Barak
rdma_addr_find_dmac_by_grh resolves dmac, vlan_id and if_index and
downsteram patch will also add hop_limit as an output parameter,
thus we rename it to rdma_addr_find_l2_eth_by_grh.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/addr.c   |  7 ---
 drivers/infiniband/core/verbs.c  | 18 +-
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |  6 +++---
 include/rdma/ib_addr.h   |  5 +++--
 4 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 0b5f245..ce3c68e 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -540,8 +540,9 @@ static void resolve_cb(int status, struct sockaddr 
*src_addr,
complete(&((struct resolve_cb_context *)context)->comp);
 }
 
-int rdma_addr_find_dmac_by_grh(const union ib_gid *sgid, const union ib_gid 
*dgid,
-  u8 *dmac, u16 *vlan_id, int *if_index)
+int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid,
+const union ib_gid *dgid,
+u8 *dmac, u16 *vlan_id, int *if_index)
 {
int ret = 0;
struct rdma_dev_addr dev_addr;
@@ -583,7 +584,7 @@ int rdma_addr_find_dmac_by_grh(const union ib_gid *sgid, 
const union ib_gid *dgi
dev_put(dev);
return ret;
 }
-EXPORT_SYMBOL(rdma_addr_find_dmac_by_grh);
+EXPORT_SYMBOL(rdma_addr_find_l2_eth_by_grh);
 
 int rdma_addr_find_smac_by_sgid(union ib_gid *sgid, u8 *smac, u16 *vlan_id)
 {
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 072b94d..66eb498 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -467,11 +467,11 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 
port_num,
if (!idev)
return -ENODEV;
 
-   ret = rdma_addr_find_dmac_by_grh(, ,
-ah_attr->dmac,
-wc->wc_flags & IB_WC_WITH_VLAN 
?
-NULL : _id,
-_index);
+   ret = rdma_addr_find_l2_eth_by_grh(, ,
+  ah_attr->dmac,
+  wc->wc_flags & 
IB_WC_WITH_VLAN ?
+  NULL : _id,
+  _index);
if (ret) {
dev_put(idev);
return ret;
@@ -1158,10 +1158,10 @@ int ib_resolve_eth_dmac(struct ib_qp *qp,
 
ifindex = sgid_attr.ndev->ifindex;
 
-   ret = rdma_addr_find_dmac_by_grh(,
-
_attr->ah_attr.grh.dgid,
-qp_attr->ah_attr.dmac,
-NULL, );
+   ret = rdma_addr_find_l2_eth_by_grh(,
+  
_attr->ah_attr.grh.dgid,
+  
qp_attr->ah_attr.dmac,
+  NULL, );
 
dev_put(sgid_attr.ndev);
}
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
index a343e03..850e0d1 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
@@ -152,9 +152,9 @@ struct ib_ah *ocrdma_create_ah(struct ib_pd *ibpd, struct 
ib_ah_attr *attr)
if ((pd->uctx) &&
(!rdma_is_multicast_addr((struct in6_addr *)attr->grh.dgid.raw)) &&
(!rdma_link_local_addr((struct in6_addr *)attr->grh.dgid.raw))) {
-   status = rdma_addr_find_dmac_by_grh(, >grh.dgid,
-   attr->dmac, _tag,
-   _attr.ndev->ifindex);
+   status = rdma_addr_find_l2_eth_by_grh(, >grh.dgid,
+ attr->dmac, _tag,
+ _attr.ndev->ifindex);
if (status) {
pr_err("%s(): Failed to resolve dmac from gid." 
"status = %d\n", __func__, status);
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index 87156dc..73fd088 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -130,8 +130,9 @@ int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct 
net_device *dev,
 int rdma_addr_size(struct sockaddr *addr);
 
 int rdma_addr_find_smac_by_sgid(union ib_g

Re: [PATCH for-next 2/2] IB/core: Use hop-limit from IP stack for RoCE

2016-01-04 Thread Matan Barak
On Sun, Jan 3, 2016 at 9:03 PM, Jason Gunthorpe
<jguntho...@obsidianresearch.com> wrote:
> On Sun, Jan 03, 2016 at 03:59:11PM +0200, Matan Barak wrote:
>> @@ -434,6 +434,7 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 
>> port_num,
>>   int ret;
>>   enum rdma_network_type net_type = RDMA_NETWORK_IB;
>>   enum ib_gid_type gid_type = IB_GID_TYPE_IB;
>> + int hoplimit = grh->hop_limit;
>
>>   ah_attr->grh.flow_label = flow_class & 0xF;
>> - ah_attr->grh.hop_limit = 0xFF;
>> + ah_attr->grh.hop_limit = hoplimit;
>
> No, this is wrong for IB. Please be careful to follow the IB
> specification language for computing a hop limit on a reversible path.
>

You're right, this should be 0xff. Thanks.

> No idea about rocee, but I can't believe using grh->hop_limit is right
> there either.
>

Regarding RoCE, the hop limit is set from the routing table (in the
same function):
ret = rdma_addr_find_l2_eth_by_grh(, ,
   ah_attr->dmac,

wc->wc_flags & IB_WC_WITH_VLAN ?
   NULL : _id,
   _index,
);



> Jason
>

Regards,
Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 1/2] IB/core: Rename rdma_addr_find_dmac_by_grh

2016-01-03 Thread Matan Barak
rdma_addr_find_dmac_by_grh resolves dmac, vlan_id and if_index and
downsteram patch will also add hop_limit as an output parameter,
thus we rename it to rdma_addr_find_l2_eth_by_grh.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/addr.c   |  7 ---
 drivers/infiniband/core/verbs.c  | 18 +-
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |  6 +++---
 include/rdma/ib_addr.h   |  5 +++--
 4 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 0b5f245..ce3c68e 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -540,8 +540,9 @@ static void resolve_cb(int status, struct sockaddr 
*src_addr,
complete(&((struct resolve_cb_context *)context)->comp);
 }
 
-int rdma_addr_find_dmac_by_grh(const union ib_gid *sgid, const union ib_gid 
*dgid,
-  u8 *dmac, u16 *vlan_id, int *if_index)
+int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid,
+const union ib_gid *dgid,
+u8 *dmac, u16 *vlan_id, int *if_index)
 {
int ret = 0;
struct rdma_dev_addr dev_addr;
@@ -583,7 +584,7 @@ int rdma_addr_find_dmac_by_grh(const union ib_gid *sgid, 
const union ib_gid *dgi
dev_put(dev);
return ret;
 }
-EXPORT_SYMBOL(rdma_addr_find_dmac_by_grh);
+EXPORT_SYMBOL(rdma_addr_find_l2_eth_by_grh);
 
 int rdma_addr_find_smac_by_sgid(union ib_gid *sgid, u8 *smac, u16 *vlan_id)
 {
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 072b94d..66eb498 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -467,11 +467,11 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 
port_num,
if (!idev)
return -ENODEV;
 
-   ret = rdma_addr_find_dmac_by_grh(, ,
-ah_attr->dmac,
-wc->wc_flags & IB_WC_WITH_VLAN 
?
-NULL : _id,
-_index);
+   ret = rdma_addr_find_l2_eth_by_grh(, ,
+  ah_attr->dmac,
+  wc->wc_flags & 
IB_WC_WITH_VLAN ?
+  NULL : _id,
+  _index);
if (ret) {
dev_put(idev);
return ret;
@@ -1158,10 +1158,10 @@ int ib_resolve_eth_dmac(struct ib_qp *qp,
 
ifindex = sgid_attr.ndev->ifindex;
 
-   ret = rdma_addr_find_dmac_by_grh(,
-
_attr->ah_attr.grh.dgid,
-qp_attr->ah_attr.dmac,
-NULL, );
+   ret = rdma_addr_find_l2_eth_by_grh(,
+  
_attr->ah_attr.grh.dgid,
+  
qp_attr->ah_attr.dmac,
+  NULL, );
 
dev_put(sgid_attr.ndev);
}
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
index a343e03..850e0d1 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
@@ -152,9 +152,9 @@ struct ib_ah *ocrdma_create_ah(struct ib_pd *ibpd, struct 
ib_ah_attr *attr)
if ((pd->uctx) &&
(!rdma_is_multicast_addr((struct in6_addr *)attr->grh.dgid.raw)) &&
(!rdma_link_local_addr((struct in6_addr *)attr->grh.dgid.raw))) {
-   status = rdma_addr_find_dmac_by_grh(, >grh.dgid,
-   attr->dmac, _tag,
-   _attr.ndev->ifindex);
+   status = rdma_addr_find_l2_eth_by_grh(, >grh.dgid,
+ attr->dmac, _tag,
+ _attr.ndev->ifindex);
if (status) {
pr_err("%s(): Failed to resolve dmac from gid." 
"status = %d\n", __func__, status);
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index 87156dc..73fd088 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -130,8 +130,9 @@ int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct 
net_device *dev,
 int rdma_addr_size(struct sockaddr *addr);
 
 int rdma_addr_find_smac_by_sgid(union ib_g

[PATCH for-next 2/2] IB/core: Use hop-limit from IP stack for RoCE

2016-01-03 Thread Matan Barak
Previously, IPV6_DEFAULT_HOPLIMIT was used as the hop limit value for
RoCE. Fixing that by taking ip4_dst_hoplimit and ip6_dst_hoplimit as
hop limit values.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/addr.c   |  9 -
 drivers/infiniband/core/cm.c |  1 +
 drivers/infiniband/core/cma.c| 12 +---
 drivers/infiniband/core/verbs.c  | 16 +++-
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |  3 ++-
 include/rdma/ib_addr.h   |  4 +++-
 6 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index ce3c68e..f924d90 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -252,6 +252,8 @@ static int addr4_resolve(struct sockaddr_in *src_in,
if (rt->rt_uses_gateway)
addr->network = RDMA_NETWORK_IPV4;
 
+   addr->hoplimit = ip4_dst_hoplimit(>dst);
+
*prt = rt;
return 0;
 out:
@@ -295,6 +297,8 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
if (rt->rt6i_flags & RTF_GATEWAY)
addr->network = RDMA_NETWORK_IPV6;
 
+   addr->hoplimit = ip6_dst_hoplimit(dst);
+
*pdst = dst;
return 0;
 put:
@@ -542,7 +546,8 @@ static void resolve_cb(int status, struct sockaddr 
*src_addr,
 
 int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid,
 const union ib_gid *dgid,
-u8 *dmac, u16 *vlan_id, int *if_index)
+u8 *dmac, u16 *vlan_id, int *if_index,
+int *hoplimit)
 {
int ret = 0;
struct rdma_dev_addr dev_addr;
@@ -581,6 +586,8 @@ int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid,
*if_index = dev_addr.bound_dev_if;
if (vlan_id)
*vlan_id = rdma_vlan_dev_vlan_id(dev);
+   if (hoplimit)
+   *hoplimit = dev_addr.hoplimit;
dev_put(dev);
return ret;
 }
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index e3a95d1..cd3d345 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1641,6 +1641,7 @@ static int cm_req_handler(struct cm_work *work)
cm_format_paths_from_req(req_msg, >path[0], >path[1]);
 
memcpy(work->path[0].dmac, cm_id_priv->av.ah_attr.dmac, ETH_ALEN);
+   work->path[0].hop_limit = cm_id_priv->av.ah_attr.grh.hop_limit;
ret = ib_get_cached_gid(work->port->cm_dev->ib_device,
work->port->port_num,
cm_id_priv->av.ah_attr.grh.sgid_index,
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 559ee3d..66983da 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2424,7 +2424,6 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
 {
struct rdma_route *route = _priv->id.route;
struct rdma_addr *addr = >addr;
-   enum ib_gid_type network_gid_type;
struct cma_work *work;
int ret;
struct net_device *ndev = NULL;
@@ -2478,14 +2477,13 @@ static int cma_resolve_iboe_route(struct 
rdma_id_private *id_priv)
>path_rec->dgid);
 
/* Use the hint from IP Stack to select GID Type */
-   network_gid_type = ib_network_to_gid_type(addr->dev_addr.network);
-   if (addr->dev_addr.network != RDMA_NETWORK_IB) {
-   route->path_rec->gid_type = network_gid_type;
+   if (route->path_rec->gid_type < 
ib_network_to_gid_type(addr->dev_addr.network))
+   route->path_rec->gid_type = 
ib_network_to_gid_type(addr->dev_addr.network);
+   if (((struct sockaddr *)_priv->id.route.addr.dst_addr)->sa_family != 
AF_IB)
/* TODO: get the hoplimit from the inet/inet6 device */
-   route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT;
-   } else {
+   route->path_rec->hop_limit = addr->dev_addr.hoplimit;
+   else
route->path_rec->hop_limit = 1;
-   }
route->path_rec->reversible = 1;
route->path_rec->pkey = cpu_to_be16(0x);
route->path_rec->mtu_selector = IB_SA_EQ;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 66eb498..8a525f6 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -434,6 +434,7 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 
port_num,
int ret;
enum rdma_network_type net_type = RDMA_NETWORK_IB;
enum ib_gid_type gid_type = IB_GID_TYPE_IB;
+   int hoplimit = grh->hop_limit;
union ib_gid dgid;
union ib_gid sgid;
 
@@ -471,7 +472,7 @@ int ib_init_ah_from_wc

[PATCH for-next 0/2] Fix hop-limit for RoCE

2016-01-03 Thread Matan Barak
Hi Doug,

Previously, the hop limit of RoCE packets were set to
IPV6_DEFAULT_HOPLIMIT. This generally works, but RoCE stack needs to
follow the IP stack rules. Therefore, this patch series use
ip4_dst_hoplimit and ip6_dst_hoplimit in order to set the correct
hop limit for RoCE traffic.

The first patch refactors the name of rdma_addr_find_dmac_by_grh to
rdma_addr_find_l2_eth_by_grh while the second one does the actual
change.

Regards,
Matan

Matan Barak (2):
  IB/core: Rename rdma_addr_find_dmac_by_grh
  IB/core: Use hop-limit from IP stack for RoCE

 drivers/infiniband/core/addr.c   | 14 +++---
 drivers/infiniband/core/cm.c |  1 +
 drivers/infiniband/core/cma.c| 12 +---
 drivers/infiniband/core/verbs.c  | 30 ++
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |  7 ---
 include/rdma/ib_addr.h   |  7 +--
 6 files changed, 40 insertions(+), 31 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V1] IB/core: Do not allocate more memory than required for cma_configfs

2015-12-31 Thread Matan Barak
We were allocating larger memory space than required for
cma_dev_group->default_ports_group.

Fixes: 045959db65c6 ('IB/cma: Add configfs for rdma_cm')
Signed-off-by: Matan Barak <mat...@mellanox.com>
---
Hi Doug,

This patch fixes a small issue, where we allocated more space than we
actually needed. This was introduces in the RoCE v2 series.

Regards,
Matan

Changes from V0:
 - Change subject and fix spelling mistake in commit message

 drivers/infiniband/core/cma_configfs.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cma_configfs.c 
b/drivers/infiniband/core/cma_configfs.c
index bd1d640..ab554df 100644
--- a/drivers/infiniband/core/cma_configfs.c
+++ b/drivers/infiniband/core/cma_configfs.c
@@ -169,9 +169,10 @@ static int make_cma_ports(struct cma_dev_group 
*cma_dev_group,
ports = kcalloc(ports_num, sizeof(*cma_dev_group->ports),
GFP_KERNEL);
 
-   cma_dev_group->default_ports_group = kcalloc(ports_num + 1,
-
sizeof(*cma_dev_group->ports),
-GFP_KERNEL);
+   cma_dev_group->default_ports_group =
+   kcalloc(ports_num + 1,
+   sizeof(*cma_dev_group->default_ports_group),
+   GFP_KERNEL);
 
if (!ports || !cma_dev_group->default_ports_group) {
err = -ENOMEM;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/core: Allocating larger memory than required for cma_configfs

2015-12-31 Thread Matan Barak
On Thu, Dec 31, 2015 at 9:50 AM, Bart Van Assche
<bart.vanass...@sandisk.com> wrote:
> On 12/30/2015 03:14 PM, Matan Barak wrote:
>>
>> We were allocating larger memory space than requried for
>> cma_dev_group->default_ports_group.
>
>
> Please change the subject into something like "Do not allocate more ...".
> Please also fix the spelling error in the patch description.
>

No problem, I'll fix and send V1.

> Thanks,
>
> Bart.
>

Regards,
Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 3/7] IB/mlx4: Configure device to work in RoCEv2

2015-12-30 Thread Matan Barak



On 12/29/2015 4:37 PM, Or Gerlitz wrote:

On 12/29/2015 3:24 PM, Matan Barak wrote:

From: Moni Shoua <mo...@mellanox.com>

Some mlx4 adapters are RoCEv2 capable. To enable this feature some
hardware configuration is required. This is

1. Set port general parameters
2. Configure the outgoing UDP destination port
3. Configure the QP that work with RoCEv2

Signed-off-by: Moni Shoua <mo...@mellanox.com>
---
  drivers/infiniband/hw/mlx4/main.c | 19 ++---
  drivers/infiniband/hw/mlx4/qp.c   | 35
---
  drivers/net/ethernet/mellanox/mlx4/fw.c   | 16 +-
  drivers/net/ethernet/mellanox/mlx4/mlx4.h |  7 +--
  drivers/net/ethernet/mellanox/mlx4/port.c |  8 +++
  drivers/net/ethernet/mellanox/mlx4/qp.c   | 28
+
  include/linux/mlx4/device.h   |  1 +
  include/linux/mlx4/qp.h   | 15 +++--
  include/rdma/ib_verbs.h   |  2 ++
  9 files changed, 120 insertions(+), 11 deletions(-)


Better put (please do...) functionality which is plain mlx4 corish (such
as new/modified FW commands, new SW/FW fields of structs and such) into
mlx4_core patch.



diff --git a/drivers/infiniband/hw/mlx4/main.c
b/drivers/infiniband/hw/mlx4/main.c
index 988fa33..44e5699 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -384,6 +384,7 @@ int mlx4_ib_gid_index_to_real_index(struct
mlx4_ib_dev *ibdev,
  int i;
  int ret;
  unsigned long flags;
+struct ib_gid_attr attr;
  if (port_num > MLX4_MAX_PORTS)
  return -EINVAL;
@@ -394,10 +395,13 @@ int mlx4_ib_gid_index_to_real_index(struct
mlx4_ib_dev *ibdev,
  if (!rdma_cap_roce_gid_table(>ib_dev, port_num))
  return index;
-ret = ib_get_cached_gid(>ib_dev, port_num, index, ,
NULL);
+ret = ib_get_cached_gid(>ib_dev, port_num, index, ,
);
  if (ret)
  return ret;
+if (attr.ndev)
+dev_put(attr.ndev);
+
  if (!memcmp(, , sizeof(gid)))
  return -EINVAL;
@@ -405,7 +409,8 @@ int mlx4_ib_gid_index_to_real_index(struct
mlx4_ib_dev *ibdev,
  port_gid_table = >gids[port_num - 1];
  for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i)
-if (!memcmp(_gid_table->gids[i].gid, , sizeof(gid))) {
+if (!memcmp(_gid_table->gids[i].gid, , sizeof(gid)) &&
+attr.gid_type == port_gid_table->gids[i].gid_type) {
  ctx = port_gid_table->gids[i].ctx;
  break;
  }
@@ -2481,7 +2486,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
  if (mlx4_ib_init_sriov(ibdev))
  goto err_mad;
-if (dev->caps.flags & MLX4_DEV_CAP_FLAG_IBOE) {
+if (dev->caps.flags & MLX4_DEV_CAP_FLAG_IBOE ||
+dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_ROCE_V1_V2) {
  if (!iboe->nb.notifier_call) {
  iboe->nb.notifier_call = mlx4_ib_netdev_event;
  err = register_netdevice_notifier(>nb);
@@ -2490,6 +2496,13 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
  goto err_notif;
  }
  }
+if (!mlx4_is_slave(dev) &&
+dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_ROCE_V1_V2) {
+err = mlx4_config_roce_v2_port(dev, ROCE_V2_UDP_DPORT);
+if (err) {
+goto err_notif;
+}
+}
  }
  for (j = 0; j < ARRAY_SIZE(mlx4_class_attributes); ++j) {
diff --git a/drivers/infiniband/hw/mlx4/qp.c
b/drivers/infiniband/hw/mlx4/qp.c
index 8d28059..c0dee79 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1508,6 +1508,24 @@ static int create_qp_lb_counter(struct
mlx4_ib_dev *dev, struct mlx4_ib_qp *qp)
  return 0;
  }
+enum {
+MLX4_QPC_ROCE_MODE_1 = 0,
+MLX4_QPC_ROCE_MODE_2 = 2,
+MLX4_QPC_ROCE_MODE_MAX = 0xff
+};
+
+static u8 gid_type_to_qpc(enum ib_gid_type gid_type)
+{
+switch (gid_type) {
+case IB_GID_TYPE_ROCE:
+return MLX4_QPC_ROCE_MODE_1;
+case IB_GID_TYPE_ROCE_UDP_ENCAP:
+return MLX4_QPC_ROCE_MODE_2;
+default:
+return MLX4_QPC_ROCE_MODE_MAX;
+}
+}
+
  static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 const struct ib_qp_attr *attr, int attr_mask,
 enum ib_qp_state cur_state, enum ib_qp_state
new_state)
@@ -1651,9 +1669,10 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
  u16 vlan = 0x;
  u8 smac[ETH_ALEN];
  int status = 0;
+int is_eth = rdma_cap_eth_ah(>ib_dev, port_num) &&
+attr->ah_attr.ah_flags & IB_AH_GRH;
-if (rdma_cap_eth_ah(>ib_dev, port_num) &&
-attr->ah_attr.ah_flags & IB_AH_GRH) {
+if (is_eth && attr->ah_attr.ah_flags & IB_AH_GRH) {
  int index = attr->ah_attr.grh.sgid_index;
   

Re: [PATCH for-next 6/7] IB/mlx4: Create and use another QP1 for RoCEv2

2015-12-30 Thread Matan Barak



On 12/29/2015 4:42 PM, Or Gerlitz wrote:

On 12/29/2015 3:24 PM, Matan Barak wrote:

The mlx4 driver uses a special QP to implement the GSI QP. This kind
of QP allows to build the InfiniBand headers in SW to be put before
the payload that comes in with the WR. The mlx4 HW builds the packet,
calculates the ICRC and puts it at the end of the payload. This ICRC
calculation however depends on the QP configuration which is
determined when QP is modified (roce_mode during INIT->RTR).
On the other hand, ICRC verification when packet is received does to
depend on this configuration.


I don't understand the part of the sentence saying "when packet is
received does to depend on this configuration"
maybe some typo/s there?



I'll rephrase Moni's commit message for V2:

The mlx4 driver uses a special QP to implement the GSI QP. This kind of 
QP allows to build the InfiniBand headers in software.
When mlx4 hadware builds the packet, it calculates the ICRC and puts it 
at the end of the payload. However, this ICRC calculation depends
on the QP configuration, which is determined when the QP is modified 
(roce_mode during INIT->RTR).
When receiving a packet, the ICRC verification doesn't depend on this 
configuration.
Therefore, using two GSI QPs for send (one for each RoCE version) and 
one GSI QP for receive are required.



Therefore, using 2 GSI QPs for send (one for each RoCE version) and 1
GSI QP for receive are required.


s/2/two/ and s/1/one/ please



No problem


Or.



Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 2/7] IB/mlx4: Add RoCE per GID support for add_gid and del_gid

2015-12-30 Thread Matan Barak



On 12/29/2015 5:24 PM, Or Gerlitz wrote:

On 12/29/2015 3:24 PM, Matan Barak wrote:

[...] We use a new firmware command in order to populate the GID table
and store the type along with the GID value.


Its a new value to existing command.. so better say we use a new value
to the SET_PORT firmware command to do X



Ok


Also here, break out mlx4_core new functionality e.g the changes to
include/linux/mlx4/cmd.h into mlx4_core only patch. You don't need any
change to mlx4_core to have it's own patch, I guess one up to three mlx4
core patches would be OK.



I'll split mlx4_core logically.


Did you make sure (at the resource tracker) that VFs can't do this new
set port command flavor?



In mlx4_common_set_port:
if (slave != dev->caps.function &&
in_modifier != MLX4_SET_PORT_GENERAL &&
 in_modifier != MLX4_SET_PORT_GID_TABLE) {
mlx4_warn(dev, "denying SET_PORT for slave:%d\n", slave);
return -EINVAL;
}




Also find some spot to put blank line in the change-log, it's hard to
read this way.



No problem


Or.


Matan




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V3 00/11] Add RoCE v2 support

2015-12-30 Thread Matan Barak



On 12/30/2015 8:04 AM, Or Gerlitz wrote:

Hi Matan,

I see these two smatch complaints on code added with this series, can
you please take a look?

drivers/infiniband/core/addr.c:503 rdma_resolve_ip_route() warn:
variable dereferenced before check 'src_addr' (see line 500)
drivers/infiniband/core/cma_configfs.c:172 make_cma_ports() warn: double
check that we're allocating correct size: 8 vs 128



I'll send fixes for both of them. Thanks for posting this.



Or.


Matan


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 1/7] IB/mlx4: Query RoCE support

2015-12-30 Thread Matan Barak



On 12/30/2015 10:44 AM, Or Gerlitz wrote:

On 12/30/2015 10:27 AM, Matan Barak wrote:



On 12/29/2015 5:19 PM, Or Gerlitz wrote:

On 12/29/2015 3:24 PM, Matan Barak wrote:

@@ -905,6 +906,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev,
struct mlx4_dev_cap *dev_cap)
  dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_EQE_STRIDE;
  MLX4_GET(dev_cap->bmme_flags, outbox,
   QUERY_DEV_CAP_BMME_FLAGS_OFFSET);
+if (dev_cap->bmme_flags & MLX4_FLAG_ROCE_V1_V2)
+dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_ROCE_V1_V2;


Did you make sure that the query dev cap wrapper unsets this bit when
proxing VF queries?


In mlx4_dev_cap:
if (mlx4_is_mfunc(dev)) {
dev->caps.flags &= ~MLX4_DEV_CAP_FLAG_SENSE_SUPPORT;
dev_cap->flags2 &= ~MLX4_DEV_CAP_FLAG2_ROCE_V1_V2;
mlx4_dbg(dev, "RoCE V2 is not supported when SR-IOV is enabled\n");
}

mlx4_slave_cap calls mlx4_dev_cap and uses the dev_caps it queried, so
we should be safe here.


mlx4_slave_cap is part of the Linux VF driver flow, right?

So...  NO, this is the Linux implementation.

You should make things robust against any guest driver.

The only way to do that is patch the command wrapper used by the PF
to filter out unwanted cap bits, see other filtering we do in
mlx4_QUERY_DEV_CAP_wrapper



I agree, thanks


Or.


Matan









  if (dev_cap->bmme_flags & MLX4_FLAG_PORT_REMAP)
  dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_PORT_REMAP;
  MLX4_GET(field, outbox, QUERY_DEV_CAP_CONFIG_DEV_OFFSET);






--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V3 00/11] Add RoCE v2 support

2015-12-30 Thread Matan Barak



On 12/30/2015 1:05 PM, Or Gerlitz wrote:

On 12/30/2015 12:48 PM, Matan Barak wrote:



On 12/30/2015 8:04 AM, Or Gerlitz wrote:

Hi Matan,

I see these two smatch complaints on code added with this series, can
you please take a look?

drivers/infiniband/core/addr.c:503 rdma_resolve_ip_route() warn:
variable dereferenced before check 'src_addr' (see line 500)
drivers/infiniband/core/cma_configfs.c:172 make_cma_ports() warn: double
check that we're allocating correct size: 8 vs 128



I'll send fixes for both of them. Thanks for posting this.


when the same smatch runs on older gcc, it produces more warnings, are
they false-positives?




Yeah, false positives - cma_configfs_params_get returns 0 iff both 
cma_dev and group are valid.



drivers/infiniband/core/cma_configfs.c: In function
?default_roce_mode_store?:
drivers/infiniband/core/cma_configfs.c:123: warning: ?cma_dev? may be
used uninitialized in this function
drivers/infiniband/core/cma_configfs.c:124: warning: ?group? may be
used uninitialized in this function
drivers/infiniband/core/cma_configfs.c: In function
?default_roce_mode_show?:
drivers/infiniband/core/cma_configfs.c:102: warning: ?cma_dev? may be
used uninitialized in this function
drivers/infiniband/core/cma_configfs.c:103: warning: ?group? may be
used uninitialized in this function


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 2/3] IB/core: Change per-entry lock in RoCE GID table to one lock

2015-12-30 Thread Matan Barak



On 12/30/2015 9:36 AM, Bart Van Assche wrote:

On 12/30/2015 07:01 AM, Or Gerlitz wrote:

On 10/28/2015 4:52 PM, Matan Barak wrote:

@@ -134,16 +138,14 @@ static int write_gid(struct ib_device *ib_dev,
u8 port,
  {
  int ret = 0;
  struct net_device *old_net_dev;
-unsigned long flags;
  /* in rdma_cap_roce_gid_table, this funciton should be protected
by a
   * sleep-able lock.
   */
-write_lock_irqsave(>data_vec[ix].lock, flags);
  if (rdma_cap_roce_gid_table(ib_dev, port)) {
  table->data_vec[ix].props |= GID_TABLE_ENTRY_INVALID;
-write_unlock_irqrestore(>data_vec[ix].lock, flags);
+write_unlock_irq(>rwlock);
  /* GID_TABLE_WRITE_ACTION_MODIFY currently isn't supported by
   * RoCE providers and thus only updates the cache.
   */
@@ -153,7 +155,7 @@ static int write_gid(struct ib_device *ib_dev, u8
port,
  else if (action == GID_TABLE_WRITE_ACTION_DEL)
  ret = ib_dev->del_gid(ib_dev, port, ix,
>data_vec[ix].context);
-write_lock_irqsave(>data_vec[ix].lock, flags);
+write_lock_irq(>rwlock);
  }


sparse complains on

drivers/infiniband/core/cache.c:186:17: warning: context imbalance in
'write_gid' - unexpected unlock

is this false positive?




It is false positive.


Hello Or,

sparse expects __release() and __acquire() annotations for functions
that unlock a lock object that has been locked by its caller. See e.g.
http://lists.kernelnewbies.org/pipermail/kernelnewbies/2011-October/003541.html.



Thanks - adding __releases and __acquires eliminates this sparse warning.



Bart.


Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 1/7] IB/mlx4: Query RoCE support

2015-12-30 Thread Matan Barak



On 12/29/2015 5:19 PM, Or Gerlitz wrote:

On 12/29/2015 3:24 PM, Matan Barak wrote:

@@ -905,6 +906,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev,
struct mlx4_dev_cap *dev_cap)
  dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_EQE_STRIDE;
  MLX4_GET(dev_cap->bmme_flags, outbox,
   QUERY_DEV_CAP_BMME_FLAGS_OFFSET);
+if (dev_cap->bmme_flags & MLX4_FLAG_ROCE_V1_V2)
+dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_ROCE_V1_V2;


Did you make sure that the query dev cap wrapper unsets this bit when
proxing VF queries?


In mlx4_dev_cap:
if (mlx4_is_mfunc(dev)) {
dev->caps.flags &= ~MLX4_DEV_CAP_FLAG_SENSE_SUPPORT;
dev_cap->flags2 &= ~MLX4_DEV_CAP_FLAG2_ROCE_V1_V2;
mlx4_dbg(dev, "RoCE V2 is not supported when SR-IOV is enabled\n");
}

mlx4_slave_cap calls mlx4_dev_cap and uses the dev_caps it queried, so 
we should be safe here.





  if (dev_cap->bmme_flags & MLX4_FLAG_PORT_REMAP)
  dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_PORT_REMAP;
  MLX4_GET(field, outbox, QUERY_DEV_CAP_CONFIG_DEV_OFFSET);




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V3 10/11] IB/core: Initialize UD header structure with IP and UDP headers

2015-12-30 Thread Matan Barak



On 12/30/2015 7:57 AM, Or Gerlitz wrote:

On 12/23/2015 2:56 PM, Matan Barak wrote:

+__be16 ib_ud_ip4_csum(struct ib_ud_header *header)
+{
+struct iphdr iph;
+
+iph.ihl= 5;
+iph.version= 4;
+iph.tos= header->ip4.tos;
+iph.tot_len= header->ip4.tot_len;
+iph.id= header->ip4.id;
+iph.frag_off= header->ip4.frag_off;
+iph.ttl= header->ip4.ttl;
+iph.protocol= header->ip4.protocol;
+iph.check= 0;
+iph.saddr= header->ip4.saddr;
+iph.daddr= header->ip4.daddr;
+
+return ip_fast_csum((u8 *), iph.ihl);
+}
+EXPORT_SYMBOL(ib_ud_ip4_csum);


You have introduced here this sparse warning, please fix



Thanks, we'll fix this.


drivers/infiniband/core/ud_header.c:299:28: warning: incorrect type in
return expression (different base types)
drivers/infiniband/core/ud_header.c:299:28:expected restricted __be16
drivers/infiniband/core/ud_header.c:299:28:got restricted __sum16

Or.


Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 7/7] IB/mlx4: Advertise RoCE support

2015-12-30 Thread Matan Barak



On 12/29/2015 4:44 PM, Or Gerlitz wrote:

On 12/29/2015 3:24 PM, Matan Barak wrote:

Advertise RoCE support in port_immutable according to the hardware
capabilities. This enables the verbs stack to use RoCE v2 mode.


Advertise RoCE V2 support



Signed-off-by: Matan Barak <mat...@mellanox.com>


I guess you wanted  "IB/mlx4: Advertise RoCE V2 support" for the patch
title? since we did
advertise RDMA_CORE_PORT_IBA_ROCE prior to this patch.



Correct, thanks!


Or.

---
  drivers/infiniband/hw/mlx4/main.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c
b/drivers/infiniband/hw/mlx4/main.c
index 44e5699..8cf2575 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -2183,6 +2183,7 @@ static int mlx4_port_immutable(struct ib_device
*ibdev, u8 port_num,
 struct ib_port_immutable *immutable)
  {
  struct ib_port_attr attr;
+struct mlx4_ib_dev *mdev = to_mdev(ibdev);
  int err;
  err = mlx4_ib_query_port(ibdev, port_num, );
@@ -2192,10 +2193,15 @@ static int mlx4_port_immutable(struct
ib_device *ibdev, u8 port_num,
  immutable->pkey_tbl_len = attr.pkey_tbl_len;
  immutable->gid_tbl_len = attr.gid_tbl_len;
-if (mlx4_ib_port_link_layer(ibdev, port_num) ==
IB_LINK_LAYER_INFINIBAND)
+if (mlx4_ib_port_link_layer(ibdev, port_num) ==
IB_LINK_LAYER_INFINIBAND) {
  immutable->core_cap_flags = RDMA_CORE_PORT_IBA_IB;
-else
-immutable->core_cap_flags = RDMA_CORE_PORT_IBA_ROCE;
+} else {
+if (mdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IBOE)
+immutable->core_cap_flags = RDMA_CORE_PORT_IBA_ROCE;
+if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_ROCE_V1_V2)
+immutable->core_cap_flags = RDMA_CORE_PORT_IBA_ROCE |
+RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP;
+}
  immutable->max_mad_size = IB_MGMT_MAD_SIZE;



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/core: Allocating larger memory than required for cma_configfs

2015-12-30 Thread Matan Barak
We were allocating larger memory space than requried for
cma_dev_group->default_ports_group.

Fixes: 045959db65c6 ('IB/cma: Add configfs for rdma_cm')
Signed-off-by: Matan Barak <mat...@mellanox.com>
---
Hi Doug,

This patch fixes a small issue, where we allocated more space than we
actually needed. This was introduces in the RoCE v2 series.

Regards,
Matan

 drivers/infiniband/core/cma_configfs.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cma_configfs.c 
b/drivers/infiniband/core/cma_configfs.c
index bd1d640..ab554df 100644
--- a/drivers/infiniband/core/cma_configfs.c
+++ b/drivers/infiniband/core/cma_configfs.c
@@ -169,9 +169,10 @@ static int make_cma_ports(struct cma_dev_group 
*cma_dev_group,
ports = kcalloc(ports_num, sizeof(*cma_dev_group->ports),
GFP_KERNEL);
 
-   cma_dev_group->default_ports_group = kcalloc(ports_num + 1,
-
sizeof(*cma_dev_group->ports),
-GFP_KERNEL);
+   cma_dev_group->default_ports_group =
+   kcalloc(ports_num + 1,
+   sizeof(*cma_dev_group->default_ports_group),
+   GFP_KERNEL);
 
if (!ports || !cma_dev_group->default_ports_group) {
err = -ENOMEM;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/core: Fix dereference before check

2015-12-30 Thread Matan Barak
Sparse complains about dereference before check. Fixing this by
moving the check before the dereference.

Fixes: 200298326b27 ('IB/core: Validate route when we init ah')
Signed-off-by: Matan Barak <mat...@mellanox.com>
---
Hi Doug,

This patch eliminates a deference before check sparse false warning.
This was introduced in the RoCE v2 series.

Regards,
Matan

 drivers/infiniband/core/addr.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 0b5f245..791cc98 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -497,13 +497,14 @@ int rdma_resolve_ip_route(struct sockaddr *src_addr,
struct sockaddr_storage ssrc_addr = {};
struct sockaddr *src_in = (struct sockaddr *)_addr;
 
-   if (src_addr->sa_family != dst_addr->sa_family)
-   return -EINVAL;
+   if (src_addr) {
+   if (src_addr->sa_family != dst_addr->sa_family)
+   return -EINVAL;
 
-   if (src_addr)
memcpy(src_in, src_addr, rdma_addr_size(src_addr));
-   else
+   } else {
src_in->sa_family = dst_addr->sa_family;
+   }
 
return addr_resolve(src_in, dst_addr, addr, false);
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/core: Eliminate sparse false positive warning on context imbalance

2015-12-30 Thread Matan Barak
When write_gid function needs to do a sleep-able operation, it unlocks
table->rwlock and then relocks it. Sparse complains about context
imbalance.

This is safe as write_gid is always called with table->rwlock.
write_gid protects from simultaneous writes to this GID entry
by setting the GID_TABLE_ENTRY_INVALID flag.

Fixes: 9c584f049596 ('IB/core: Change per-entry lock in RoCE GID table to
 one lock')
Signed-off-by: Matan Barak <mat...@mellanox.com>

---
Hi Doug,

This patch eliminates a sparse false-positive warning about context
imbalance. We use __releases and __acquires in order to do so.

Regards,
Matan

 drivers/infiniband/core/cache.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 92cadbd..53343ff 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -174,6 +174,7 @@ static int write_gid(struct ib_device *ib_dev, u8 port,
 const struct ib_gid_attr *attr,
 enum gid_table_write_action action,
 bool  default_gid)
+   __releases(>rwlock) __acquires(>rwlock)
 {
int ret = 0;
struct net_device *old_net_dev;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/core: sysfs.c: Fix PerfMgt ClassPortInfo handling

2015-12-29 Thread Matan Barak
On Mon, Dec 28, 2015 at 11:53 PM, Hal Rosenstock <h...@dev.mellanox.co.il> 
wrote:
>
> Port number is not part of ClassPortInfo attribute but is
> still needed as a parameter when invoking process_mad.
>
> To properly handle this attribute, port_num is added as a
> parameter to get_counter_table and get_perf_mad was changed
> not to store port_num in the attribute itself when it's
> querying the ClassPortInfo attribute.
>
> This handles issue pointed out by Matan Barak <mat...@dev.mellanox.co.il>
>
> Signed-off-by: Hal Rosenstock <h...@mellanox.com>
> Acked-by: Matan Barak <mat...@mellanox.com>
> ---
> diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
> index 539040f..2daf832 100644
> --- a/drivers/infiniband/core/sysfs.c
> +++ b/drivers/infiniband/core/sysfs.c
> @@ -438,7 +438,8 @@ static int get_perf_mad(struct ib_device *dev, int 
> port_num, int attr,
> in_mad->mad_hdr.method= IB_MGMT_METHOD_GET;
> in_mad->mad_hdr.attr_id   = attr;
>
> -   in_mad->data[41] = port_num;/* PortSelect field */
> +   if (attr != IB_PMA_CLASS_PORT_INFO)
> +   in_mad->data[41] = port_num;/* PortSelect field */
>
> if ((dev->process_mad(dev, IB_MAD_IGNORE_MKEY,
>  port_num, NULL, NULL,
> @@ -714,11 +715,12 @@ err:
>   * Figure out which counter table to use depending on
>   * the device capabilities.
>   */
> -static struct attribute_group *get_counter_table(struct ib_device *dev)
> +static struct attribute_group *get_counter_table(struct ib_device *dev,
> +int port_num)
>  {
> struct ib_class_port_info cpi;
>
> -   if (get_perf_mad(dev, 0, IB_PMA_CLASS_PORT_INFO,
> +   if (get_perf_mad(dev, port_num, IB_PMA_CLASS_PORT_INFO,
> , 40, sizeof(cpi)) >= 0) {
>
> if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH)
> @@ -776,7 +778,7 @@ static int add_port(struct ib_device *device, int 
> port_num,
> goto err_put;
> }
>
> -   p->pma_table = get_counter_table(device);
> +   p->pma_table = get_counter_table(device, port_num);
> ret = sysfs_create_group(>kobj, p->pma_table);
> if (ret)
> goto err_put_gid_attrs;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Please just add:
Fixes: 145d9c541032 ('IB/core: Display extended counter set if available')
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 1/7] IB/mlx4: Query RoCE support

2015-12-29 Thread Matan Barak
From: Moni Shoua 

Query the RoCE support from firmware using the appropriate firmware
commands. Downstream patches will read these capabilities and act
accordingly.

Signed-off-by: Moni Shoua 
---
 drivers/net/ethernet/mellanox/mlx4/fw.c   |  3 +++
 drivers/net/ethernet/mellanox/mlx4/main.c |  6 +-
 include/linux/mlx4/device.h   | 11 +--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c 
b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 90db94e..bdd6822 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -157,6 +157,7 @@ static void dump_dev_cap_flags2(struct mlx4_dev *dev, u64 
flags)
[29] = "802.1ad offload support",
[31] = "Modifying loopback source checks using UPDATE_QP 
support",
[32] = "Loopback source checks support",
+   [33] = "RoCEv2 support"
};
int i;
 
@@ -905,6 +906,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct 
mlx4_dev_cap *dev_cap)
dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_EQE_STRIDE;
MLX4_GET(dev_cap->bmme_flags, outbox,
 QUERY_DEV_CAP_BMME_FLAGS_OFFSET);
+   if (dev_cap->bmme_flags & MLX4_FLAG_ROCE_V1_V2)
+   dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_ROCE_V1_V2;
if (dev_cap->bmme_flags & MLX4_FLAG_PORT_REMAP)
dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_PORT_REMAP;
MLX4_GET(field, outbox, QUERY_DEV_CAP_CONFIG_DEV_OFFSET);
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 31c491e..fb4968f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -424,8 +424,12 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct 
mlx4_dev_cap *dev_cap)
if (mlx4_priv(dev)->pci_dev_data & MLX4_PCI_DEV_FORCE_SENSE_PORT)
dev->caps.flags |= MLX4_DEV_CAP_FLAG_SENSE_SUPPORT;
/* Don't do sense port on multifunction devices (for now at least) */
-   if (mlx4_is_mfunc(dev))
+   /* Don't do enable RoCE V2 on multifunction devices */
+   if (mlx4_is_mfunc(dev)) {
dev->caps.flags &= ~MLX4_DEV_CAP_FLAG_SENSE_SUPPORT;
+   dev_cap->flags2 &= ~MLX4_DEV_CAP_FLAG2_ROCE_V1_V2;
+   mlx4_dbg(dev, "RoCE V2 is not supported when SR-IOV is 
enabled\n");
+   }
 
if (mlx4_low_memory_profile()) {
dev->caps.log_num_macs  = MLX4_MIN_LOG_NUM_MAC;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index d3133be..dbf39ab 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -216,6 +216,7 @@ enum {
MLX4_DEV_CAP_FLAG2_SKIP_OUTER_VLAN  = 1LL <<  30,
MLX4_DEV_CAP_FLAG2_UPDATE_QP_SRC_CHECK_LB = 1ULL << 31,
MLX4_DEV_CAP_FLAG2_LB_SRC_CHK   = 1ULL << 32,
+   MLX4_DEV_CAP_FLAG2_ROCE_V1_V2   = 1LL <<  33,
 };
 
 enum {
@@ -267,6 +268,7 @@ enum {
MLX4_BMME_FLAG_TYPE_2_WIN   = 1 <<  9,
MLX4_BMME_FLAG_RESERVED_LKEY= 1 << 10,
MLX4_BMME_FLAG_FAST_REG_WR  = 1 << 11,
+   MLX4_BMME_FLAG_ROCE_V1_V2   = 1 << 19,
MLX4_BMME_FLAG_PORT_REMAP   = 1 << 24,
MLX4_BMME_FLAG_VSD_INIT2RTR = 1 << 28,
 };
@@ -275,6 +277,10 @@ enum {
MLX4_FLAG_PORT_REMAP= MLX4_BMME_FLAG_PORT_REMAP
 };
 
+enum {
+   MLX4_FLAG_ROCE_V1_V2= MLX4_BMME_FLAG_ROCE_V1_V2
+};
+
 enum mlx4_event {
MLX4_EVENT_TYPE_COMP   = 0x00,
MLX4_EVENT_TYPE_PATH_MIG   = 0x01,
@@ -984,9 +990,10 @@ struct mlx4_mad_ifc {
if (((dev)->caps.port_mask[port] != MLX4_PORT_TYPE_IB))
 
 #define mlx4_foreach_ib_transport_port(port, dev) \
-   for ((port) = 1; (port) <= (dev)->caps.num_ports; (port)++)   \
+   for ((port) = 1; (port) <= (dev)->caps.num_ports; (port)++)   \
if (((dev)->caps.port_mask[port] == MLX4_PORT_TYPE_IB) || \
-   ((dev)->caps.flags & MLX4_DEV_CAP_FLAG_IBOE))
+   ((dev)->caps.flags & MLX4_DEV_CAP_FLAG_IBOE) || \
+   ((dev)->caps.flags2 & MLX4_DEV_CAP_FLAG2_ROCE_V1_V2))
 
 #define MLX4_INVALID_SLAVE_ID  0xFF
 #define MLX4_SINK_COUNTER_INDEX(dev)   (dev->caps.max_counters - 1)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 2/7] IB/mlx4: Add RoCE per GID support for add_gid and del_gid

2015-12-29 Thread Matan Barak
In RoCE, GID table is managed in the IB core driver. The role of the
mlx4 driver is to synchronize the HW with the entries in the GID table.
Since it is possible that the same GID value will appear more than once
in the GID table (though with different attributes) it is required from
the mlx4 driver to maintain a reference counting mechanism and populate
the HW with a single value. We use a new firmware command in order to
populate the GID table and store the type along with the GID value.

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/hw/mlx4/main.c| 69 +---
 drivers/infiniband/hw/mlx4/mlx4_ib.h |  1 +
 include/linux/mlx4/cmd.h |  3 +-
 3 files changed, 67 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 627267f..988fa33 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -154,9 +154,9 @@ static struct net_device *mlx4_ib_get_netdev(struct 
ib_device *device, u8 port_n
return dev;
 }
 
-static int mlx4_ib_update_gids(struct gid_entry *gids,
-  struct mlx4_ib_dev *ibdev,
-  u8 port_num)
+static int mlx4_ib_update_gids_v1(struct gid_entry *gids,
+ struct mlx4_ib_dev *ibdev,
+ u8 port_num)
 {
struct mlx4_cmd_mailbox *mailbox;
int err;
@@ -187,6 +187,61 @@ static int mlx4_ib_update_gids(struct gid_entry *gids,
return err;
 }
 
+static int mlx4_ib_update_gids_v1_v2(struct gid_entry *gids,
+struct mlx4_ib_dev *ibdev,
+u8 port_num)
+{
+   struct mlx4_cmd_mailbox *mailbox;
+   int err;
+   struct mlx4_dev *dev = ibdev->dev;
+   int i;
+   struct {
+   union ib_gidgid;
+   __be32  rsrvd1[2];
+   __be16  rsrvd2;
+   u8  type;
+   u8  version;
+   __be32  rsrvd3;
+   } *gid_tbl;
+
+   mailbox = mlx4_alloc_cmd_mailbox(dev);
+   if (IS_ERR(mailbox))
+   return -ENOMEM;
+
+   gid_tbl = mailbox->buf;
+   for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i) {
+   memcpy(_tbl[i].gid, [i].gid, sizeof(union ib_gid));
+   if (gids[i].gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP) {
+   gid_tbl[i].version = 2;
+   if (!ipv6_addr_v4mapped((struct in6_addr 
*)[i].gid))
+   gid_tbl[i].type = 1;
+   }
+   }
+
+   err = mlx4_cmd(dev, mailbox->dma,
+  MLX4_SET_PORT_ROCE_ADDR << 8 | port_num,
+  1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
+  MLX4_CMD_WRAPPED);
+   if (mlx4_is_bonded(dev))
+   err += mlx4_cmd(dev, mailbox->dma,
+   MLX4_SET_PORT_ROCE_ADDR << 8 | 2,
+   1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
+   MLX4_CMD_WRAPPED);
+
+   mlx4_free_cmd_mailbox(dev, mailbox);
+   return err;
+}
+
+static int mlx4_ib_update_gids(struct gid_entry *gids,
+  struct mlx4_ib_dev *ibdev,
+  u8 port_num)
+{
+   if (ibdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_ROCE_V1_V2)
+   return mlx4_ib_update_gids_v1_v2(gids, ibdev, port_num);
+
+   return mlx4_ib_update_gids_v1(gids, ibdev, port_num);
+}
+
 static int mlx4_ib_add_gid(struct ib_device *device,
   u8 port_num,
   unsigned int index,
@@ -215,7 +270,8 @@ static int mlx4_ib_add_gid(struct ib_device *device,
port_gid_table = >gids[port_num - 1];
spin_lock_bh(>lock);
for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i) {
-   if (!memcmp(_gid_table->gids[i].gid, gid, sizeof(*gid))) {
+   if (!memcmp(_gid_table->gids[i].gid, gid, sizeof(*gid)) &&
+   (port_gid_table->gids[i].gid_type == attr->gid_type))  {
found = i;
break;
}
@@ -233,6 +289,7 @@ static int mlx4_ib_add_gid(struct ib_device *device,
} else {
*context = port_gid_table->gids[free].ctx;
memcpy(_gid_table->gids[free].gid, gid, 
sizeof(*gid));
+   port_gid_table->gids[free].gid_type = 
attr->gid_type;
port_gid_table->gids[free].ctx->real_index = 
free;
port_gid_table->gids[free].ctx->refcount = 1;
hw_update = 1;
@@ -248,8 +305,10 @@ static int mlx4_ib_add_gid(struct ib_device *device,
if (!gids) {
ret = -ENOMEM;
  

[PATCH for-next 7/7] IB/mlx4: Advertise RoCE support

2015-12-29 Thread Matan Barak
Advertise RoCE support in port_immutable according to the hardware
capabilities. This enables the verbs stack to use RoCE v2 mode.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/hw/mlx4/main.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 44e5699..8cf2575 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -2183,6 +2183,7 @@ static int mlx4_port_immutable(struct ib_device *ibdev, 
u8 port_num,
   struct ib_port_immutable *immutable)
 {
struct ib_port_attr attr;
+   struct mlx4_ib_dev *mdev = to_mdev(ibdev);
int err;
 
err = mlx4_ib_query_port(ibdev, port_num, );
@@ -2192,10 +2193,15 @@ static int mlx4_port_immutable(struct ib_device *ibdev, 
u8 port_num,
immutable->pkey_tbl_len = attr.pkey_tbl_len;
immutable->gid_tbl_len = attr.gid_tbl_len;
 
-   if (mlx4_ib_port_link_layer(ibdev, port_num) == 
IB_LINK_LAYER_INFINIBAND)
+   if (mlx4_ib_port_link_layer(ibdev, port_num) == 
IB_LINK_LAYER_INFINIBAND) {
immutable->core_cap_flags = RDMA_CORE_PORT_IBA_IB;
-   else
-   immutable->core_cap_flags = RDMA_CORE_PORT_IBA_ROCE;
+   } else {
+   if (mdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IBOE)
+   immutable->core_cap_flags = RDMA_CORE_PORT_IBA_ROCE;
+   if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_ROCE_V1_V2)
+   immutable->core_cap_flags = RDMA_CORE_PORT_IBA_ROCE |
+   RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP;
+   }
 
immutable->max_mad_size = IB_MGMT_MAD_SIZE;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 0/7] Add RoCE v2 support for mlx4 driver

2015-12-29 Thread Matan Barak
Hi Doug,

This series adds RoCE v2 support for mlx4 driver.
It implements the required bits in the new RoCE v2 API while adding
the necessary firmware commands and handling.

Patch 0001 queries the firmware if RoCE is supported.
Patch 0002 introduces a new firmware command that sets the GID table,
such that we store the GID type along the GID itself in the table.
Patch 0003 configures the device to work in RoCE v1 and RoCE v2 mixed
mode.
Patch 0004 adds the support to create steering rules for IPv4 based
packets. This is necessary in order to support RoCE multicast.
Patch 0005 introduces the support for sending RoCE v2 packets from
QP1.
Patch 0006 creates another QP in order to receive QP1 RoCE v2 traffic.
Patch 0007 advertises RoCE v2 support for upper layer. From this point
and on, the GID table will be populated with RoCE v2 based GIDs (if
the hardware supports so).

Regards,
Moni and Matan

Maor Gottlieb (1):
  net/mlx4_core: Add handlning of RoCE v2 over IPV4 in attach_flow

Matan Barak (2):
  IB/mlx4: Add RoCE per GID support for add_gid and del_gid
  IB/mlx4: Advertise RoCE support

Moni Shoua (4):
  IB/mlx4: Query RoCE support
  IB/mlx4: Configure device to work in RoCEv2
  IB/mlx4: Enable send of RoCE QP1 packets with IP/UDP headers
  IB/mlx4: Create and use another QP1 for RoCEv2

 drivers/infiniband/hw/mlx4/main.c | 100 +--
 drivers/infiniband/hw/mlx4/mlx4_ib.h  |   8 +
 drivers/infiniband/hw/mlx4/qp.c   | 283 --
 drivers/net/ethernet/mellanox/mlx4/fw.c   |  19 +-
 drivers/net/ethernet/mellanox/mlx4/main.c |   6 +-
 drivers/net/ethernet/mellanox/mlx4/mcg.c  |  14 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h |   7 +-
 drivers/net/ethernet/mellanox/mlx4/port.c |   8 +
 drivers/net/ethernet/mellanox/mlx4/qp.c   |  28 +++
 include/linux/mlx4/cmd.h  |   3 +-
 include/linux/mlx4/device.h   |  18 +-
 include/linux/mlx4/qp.h   |  15 +-
 include/rdma/ib_verbs.h   |   2 +
 13 files changed, 434 insertions(+), 77 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 3/7] IB/mlx4: Configure device to work in RoCEv2

2015-12-29 Thread Matan Barak
From: Moni Shoua 

Some mlx4 adapters are RoCEv2 capable. To enable this feature some
hardware configuration is required. This is

1. Set port general parameters
2. Configure the outgoing UDP destination port
3. Configure the QP that work with RoCEv2

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/hw/mlx4/main.c | 19 ++---
 drivers/infiniband/hw/mlx4/qp.c   | 35 ---
 drivers/net/ethernet/mellanox/mlx4/fw.c   | 16 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h |  7 +--
 drivers/net/ethernet/mellanox/mlx4/port.c |  8 +++
 drivers/net/ethernet/mellanox/mlx4/qp.c   | 28 +
 include/linux/mlx4/device.h   |  1 +
 include/linux/mlx4/qp.h   | 15 +++--
 include/rdma/ib_verbs.h   |  2 ++
 9 files changed, 120 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 988fa33..44e5699 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -384,6 +384,7 @@ int mlx4_ib_gid_index_to_real_index(struct mlx4_ib_dev 
*ibdev,
int i;
int ret;
unsigned long flags;
+   struct ib_gid_attr attr;
 
if (port_num > MLX4_MAX_PORTS)
return -EINVAL;
@@ -394,10 +395,13 @@ int mlx4_ib_gid_index_to_real_index(struct mlx4_ib_dev 
*ibdev,
if (!rdma_cap_roce_gid_table(>ib_dev, port_num))
return index;
 
-   ret = ib_get_cached_gid(>ib_dev, port_num, index, , NULL);
+   ret = ib_get_cached_gid(>ib_dev, port_num, index, , );
if (ret)
return ret;
 
+   if (attr.ndev)
+   dev_put(attr.ndev);
+
if (!memcmp(, , sizeof(gid)))
return -EINVAL;
 
@@ -405,7 +409,8 @@ int mlx4_ib_gid_index_to_real_index(struct mlx4_ib_dev 
*ibdev,
port_gid_table = >gids[port_num - 1];
 
for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i)
-   if (!memcmp(_gid_table->gids[i].gid, , sizeof(gid))) {
+   if (!memcmp(_gid_table->gids[i].gid, , sizeof(gid)) &&
+   attr.gid_type == port_gid_table->gids[i].gid_type) {
ctx = port_gid_table->gids[i].ctx;
break;
}
@@ -2481,7 +2486,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
if (mlx4_ib_init_sriov(ibdev))
goto err_mad;
 
-   if (dev->caps.flags & MLX4_DEV_CAP_FLAG_IBOE) {
+   if (dev->caps.flags & MLX4_DEV_CAP_FLAG_IBOE ||
+   dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_ROCE_V1_V2) {
if (!iboe->nb.notifier_call) {
iboe->nb.notifier_call = mlx4_ib_netdev_event;
err = register_netdevice_notifier(>nb);
@@ -2490,6 +2496,13 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
goto err_notif;
}
}
+   if (!mlx4_is_slave(dev) &&
+   dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_ROCE_V1_V2) {
+   err = mlx4_config_roce_v2_port(dev, ROCE_V2_UDP_DPORT);
+   if (err) {
+   goto err_notif;
+   }
+   }
}
 
for (j = 0; j < ARRAY_SIZE(mlx4_class_attributes); ++j) {
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8d28059..c0dee79 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1508,6 +1508,24 @@ static int create_qp_lb_counter(struct mlx4_ib_dev *dev, 
struct mlx4_ib_qp *qp)
return 0;
 }
 
+enum {
+   MLX4_QPC_ROCE_MODE_1 = 0,
+   MLX4_QPC_ROCE_MODE_2 = 2,
+   MLX4_QPC_ROCE_MODE_MAX = 0xff
+};
+
+static u8 gid_type_to_qpc(enum ib_gid_type gid_type)
+{
+   switch (gid_type) {
+   case IB_GID_TYPE_ROCE:
+   return MLX4_QPC_ROCE_MODE_1;
+   case IB_GID_TYPE_ROCE_UDP_ENCAP:
+   return MLX4_QPC_ROCE_MODE_2;
+   default:
+   return MLX4_QPC_ROCE_MODE_MAX;
+   }
+}
+
 static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
   const struct ib_qp_attr *attr, int attr_mask,
   enum ib_qp_state cur_state, enum ib_qp_state 
new_state)
@@ -1651,9 +1669,10 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
u16 vlan = 0x;
u8 smac[ETH_ALEN];
int status = 0;
+   int is_eth = rdma_cap_eth_ah(>ib_dev, port_num) &&
+   attr->ah_attr.ah_flags & IB_AH_GRH;
 
-   if (rdma_cap_eth_ah(>ib_dev, port_num) &&
-   attr->ah_attr.ah_flags & IB_AH_GRH) {
+   if (is_eth && attr->ah_attr.ah_flags & IB_AH_GRH) {
int index = attr->ah_attr.grh.sgid_index;
 
status = 

[PATCH for-next 4/7] net/mlx4_core: Add handlning of RoCE v2 over IPV4 in attach_flow

2015-12-29 Thread Matan Barak
From: Maor Gottlieb 

When attaching multicast for RoCE v2, we need to be able to steer
packets to the QPs. Hence, we add support for IPV4 over IB steering.

Signed-off-by: Maor Gottlieb 
---
 drivers/net/ethernet/mellanox/mlx4/mcg.c | 14 --
 include/linux/mlx4/device.h  |  6 ++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c 
b/drivers/net/ethernet/mellanox/mlx4/mcg.c
index 1d4e2e0..834e60e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mcg.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c
@@ -858,7 +858,9 @@ static int parse_trans_rule(struct mlx4_dev *dev, struct 
mlx4_spec_list *spec,
break;
 
case MLX4_NET_TRANS_RULE_ID_IB:
-   rule_hw->ib.l3_qpn = spec->ib.l3_qpn;
+   rule_hw->ib.l3_qpn = spec->ib.l3_qpn |
+   (spec->ib.roce_type == MLX4_FLOW_SPEC_IB_ROCE_TYPE_IPV4 
?
+(__force __be32)0x80 : (__force __be32)0);
rule_hw->ib.qpn_mask = spec->ib.qpn_msk;
memcpy(_hw->ib.dst_gid, >ib.dst_gid, 16);
memcpy(_hw->ib.dst_gid_msk, >ib.dst_gid_msk, 16);
@@ -1384,10 +1386,18 @@ int mlx4_trans_to_dmfs_attach(struct mlx4_dev *dev, 
struct mlx4_qp *qp,
memcpy(spec.eth.dst_mac_msk, _mask, ETH_ALEN);
break;
 
+   case MLX4_PROT_IB_IPV4:
+   spec.id = MLX4_NET_TRANS_RULE_ID_IB;
+   memcpy(spec.ib.dst_gid + 12, gid + 12, 4);
+   memset(spec.ib.dst_gid_msk + 12, 0xff, 4);
+   spec.ib.roce_type = MLX4_FLOW_SPEC_IB_ROCE_TYPE_IPV4;
+   break;
+
case MLX4_PROT_IB_IPV6:
spec.id = MLX4_NET_TRANS_RULE_ID_IB;
memcpy(spec.ib.dst_gid, gid, 16);
-   memset(_gid_msk, 0xff, 16);
+   memset(spec.ib.dst_gid_msk, 0xff, 16);
+   spec.ib.roce_type = MLX4_FLOW_SPEC_IB_ROCE_TYPE_IPV6;
break;
default:
return -EINVAL;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 0d873f1ae..cdc75b2 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -391,6 +391,11 @@ enum mlx4_protocol {
MLX4_PROT_FCOE
 };
 
+enum mlx4_flow_roce_type {
+   MLX4_FLOW_SPEC_IB_ROCE_TYPE_IPV6 = 0,
+   MLX4_FLOW_SPEC_IB_ROCE_TYPE_IPV4
+};
+
 enum {
MLX4_MTT_FLAG_PRESENT   = 1
 };
@@ -1197,6 +1202,7 @@ struct mlx4_spec_ipv4 {
 struct mlx4_spec_ib {
__be32  l3_qpn;
__be32  qpn_msk;
+   enummlx4_flow_roce_type roce_type;
u8  dst_gid[16];
u8  dst_gid_msk[16];
 };
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 5/7] IB/mlx4: Enable send of RoCE QP1 packets with IP/UDP headers

2015-12-29 Thread Matan Barak
From: Moni Shoua 

RoCEv2 packets are sent over IP/UDP protocols.
The mlx4 driver uses a type of RAW QP to send packets for QP1 and
therefore needs to build the network headers below BTH in software.

This patche adds option to build QP1 packets with IP and UDP headers if
RoCEv2 is requested.

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/hw/mlx4/qp.c | 86 ++---
 1 file changed, 54 insertions(+), 32 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index c0dee79..8485602 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -32,6 +32,8 @@
  */
 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -2282,16 +2284,7 @@ static int build_sriov_qp0_header(struct mlx4_ib_sqp 
*sqp,
return 0;
 }
 
-static void mlx4_u64_to_smac(u8 *dst_mac, u64 src_mac)
-{
-   int i;
-
-   for (i = ETH_ALEN; i; i--) {
-   dst_mac[i - 1] = src_mac & 0xff;
-   src_mac >>= 8;
-   }
-}
-
+#define MLX4_ROCEV2_QP1_SPORT 0xC000
 static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_ud_wr *wr,
void *wqe, unsigned *mlx_seg_len)
 {
@@ -2311,6 +2304,8 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, 
struct ib_ud_wr *wr,
bool is_eth;
bool is_vlan = false;
bool is_grh;
+   bool is_udp = false;
+   int ip_version = 0;
 
send_size = 0;
for (i = 0; i < wr->wr.num_sge; ++i)
@@ -2319,6 +2314,8 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, 
struct ib_ud_wr *wr,
is_eth = rdma_port_get_link_layer(sqp->qp.ibqp.device, sqp->qp.port) == 
IB_LINK_LAYER_ETHERNET;
is_grh = mlx4_ib_ah_grh_present(ah);
if (is_eth) {
+   struct ib_gid_attr gid_attr;
+
if (mlx4_is_mfunc(to_mdev(ib_dev)->dev)) {
/* When multi-function is enabled, the ib_core gid
 * indexes don't necessarily match the hw ones, so
@@ -2329,23 +2326,36 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, 
struct ib_ud_wr *wr,
if (err)
return err;
} else  {
-   err = ib_get_cached_gid(ib_dev,
+   err = ib_get_cached_gid(sqp->qp.ibqp.device,
be32_to_cpu(ah->av.ib.port_pd) 
>> 24,
ah->av.ib.gid_index, ,
-   NULL);
-   if (!err && !memcmp(, , sizeof(sgid)))
-   err = -ENOENT;
-   if (err)
+   _attr);
+   if (!err) {
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   if (!memcmp(, , sizeof(sgid)))
+   err = -ENOENT;
+   }
+   if (!err) {
+   is_udp = gid_attr.gid_type == 
IB_GID_TYPE_ROCE_UDP_ENCAP;
+   if (is_udp) {
+   if (ipv6_addr_v4mapped((struct in6_addr 
*)))
+   ip_version = 4;
+   else
+   ip_version = 6;
+   is_grh = false;
+   }
+   } else {
return err;
+   }
}
-
if (ah->av.eth.vlan != cpu_to_be16(0x)) {
vlan = be16_to_cpu(ah->av.eth.vlan) & 0x0fff;
is_vlan = 1;
}
}
err = ib_ud_header_init(send_size, !is_eth, is_eth, is_vlan, is_grh,
-   0, 0, 0, >ud_header);
+ ip_version, is_udp, 0, >ud_header);
if (err)
return err;
 
@@ -2356,7 +2366,7 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, 
struct ib_ud_wr *wr,
sqp->ud_header.lrh.source_lid = cpu_to_be16(ah->av.ib.g_slid & 
0x7f);
}
 
-   if (is_grh) {
+   if (is_grh || (ip_version == 6)) {
sqp->ud_header.grh.traffic_class =
(be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 20) & 
0xff;
sqp->ud_header.grh.flow_label=
@@ -2385,6 +2395,25 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, 
struct ib_ud_wr *wr,
   ah->av.ib.dgid, 16);
}
 
+   if (ip_version == 4) {
+   sqp->ud_header.ip4.tos =
+   (be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 20) & 
0xff;
+   sqp->ud_header.ip4.id 

[PATCH for-next 6/7] IB/mlx4: Create and use another QP1 for RoCEv2

2015-12-29 Thread Matan Barak
From: Moni Shoua 

The mlx4 driver uses a special QP to implement the GSI QP. This kind
of QP allows to build the InfiniBand headers in SW to be put before
the payload that comes in with the WR. The mlx4 HW builds the packet,
calculates the ICRC and puts it at the end of the payload. This ICRC
calculation however depends on the QP configuration which is
determined when QP is modified (roce_mode during INIT->RTR).
On the other hand, ICRC verification when packet is received does to
depend on this configuration.
Therefore, using 2 GSI QPs for send (one for each RoCE version) and 1
GSI QP for receive are required.

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   7 ++
 drivers/infiniband/hw/mlx4/qp.c  | 162 ++-
 2 files changed, 149 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h 
b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 7179fb1..52ce7b0 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -177,11 +177,18 @@ struct mlx4_ib_wq {
unsignedtail;
 };
 
+enum {
+   MLX4_IB_QP_CREATE_ROCE_V2_GSI = IB_QP_CREATE_RESERVED_START
+};
+
 enum mlx4_ib_qp_flags {
MLX4_IB_QP_LSO = IB_QP_CREATE_IPOIB_UD_LSO,
MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK = 
IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK,
MLX4_IB_QP_NETIF = IB_QP_CREATE_NETIF_QP,
MLX4_IB_QP_CREATE_USE_GFP_NOIO = IB_QP_CREATE_USE_GFP_NOIO,
+
+   /* Mellanox specific flags start from IB_QP_CREATE_RESERVED_START */
+   MLX4_IB_ROCE_V2_GSI_QP = MLX4_IB_QP_CREATE_ROCE_V2_GSI,
MLX4_IB_SRIOV_TUNNEL_QP = 1 << 30,
MLX4_IB_SRIOV_SQP = 1 << 31,
 };
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8485602..a154d51 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -87,6 +87,7 @@ struct mlx4_ib_sqp {
u32 send_psn;
struct ib_ud_header ud_header;
u8  header_buf[MLX4_IB_UD_HEADER_SIZE];
+   struct ib_qp*roce_v2_gsi;
 };
 
 enum {
@@ -155,7 +156,10 @@ static int is_sqp(struct mlx4_ib_dev *dev, struct 
mlx4_ib_qp *qp)
}
}
}
-   return proxy_sqp;
+   if (proxy_sqp)
+   return 1;
+
+   return !!(qp->flags & MLX4_IB_ROCE_V2_GSI_QP);
 }
 
 /* used for INIT/CLOSE port logic */
@@ -695,6 +699,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct 
ib_pd *pd,
qp = >qp;
qp->pri.vid = 0x;
qp->alt.vid = 0x;
+   sqp->roce_v2_gsi = NULL;
} else {
qp = kzalloc(sizeof (struct mlx4_ib_qp), gfp);
if (!qp)
@@ -1085,9 +1090,17 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, 
struct mlx4_ib_qp *qp,
del_gid_entries(qp);
 }
 
-static u32 get_sqp_num(struct mlx4_ib_dev *dev, struct ib_qp_init_attr *attr)
+static int get_sqp_num(struct mlx4_ib_dev *dev, struct ib_qp_init_attr *attr)
 {
/* Native or PPF */
+   if ((!mlx4_is_mfunc(dev->dev) || mlx4_is_master(dev->dev)) &&
+   attr->create_flags & MLX4_IB_QP_CREATE_ROCE_V2_GSI) {
+   int sqpn;
+   int res = mlx4_qp_reserve_range(dev->dev, 1, 1, , 0);
+
+   return res ? -abs(res) : sqpn;
+   }
+
if (!mlx4_is_mfunc(dev->dev) ||
(mlx4_is_master(dev->dev) &&
 attr->create_flags & MLX4_IB_SRIOV_SQP)) {
@@ -1102,9 +1115,9 @@ static u32 get_sqp_num(struct mlx4_ib_dev *dev, struct 
ib_qp_init_attr *attr)
return dev->dev->caps.qp1_proxy[attr->port_num - 1];
 }
 
-struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
-   struct ib_qp_init_attr *init_attr,
-   struct ib_udata *udata)
+static struct ib_qp *_mlx4_ib_create_qp(struct ib_pd *pd,
+   struct ib_qp_init_attr *init_attr,
+   struct ib_udata *udata)
 {
struct mlx4_ib_qp *qp = NULL;
int err;
@@ -1123,6 +1136,7 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
MLX4_IB_SRIOV_TUNNEL_QP |
MLX4_IB_SRIOV_SQP |
MLX4_IB_QP_NETIF |
+   MLX4_IB_QP_CREATE_ROCE_V2_GSI |
MLX4_IB_QP_CREATE_USE_GFP_NOIO))
return ERR_PTR(-EINVAL);
 
@@ -1131,15 +1145,21 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
return ERR_PTR(-EINVAL);
}
 
-   if (init_attr->create_flags &&
-   ((udata && init_attr->create_flags & ~(sup_u_create_flags)) ||
-((init_attr->create_flags & 

Re: [PATCH v2 for-next 5/7] IB/mlx4: Add IB counters table

2015-12-27 Thread Matan Barak
On Fri, Dec 25, 2015 at 5:56 PM, Hal Rosenstock <h...@dev.mellanox.co.il> wrote:
> On 12/25/2015 9:50 AM, Hal Rosenstock wrote:
>> On 12/24/2015 11:09 AM, Matan Barak wrote:
>>> On Thu, Dec 24, 2015 at 4:07 PM, Matan Barak <mat...@dev.mellanox.co.il> 
>>> wrote:
>>>> On Thu, Dec 24, 2015 at 2:38 PM, Or Gerlitz <ogerl...@mellanox.com> wrote:
>>>>> On 12/24/2015 12:42 PM, Sagi Grimberg wrote:
>>>>>>
>>>>>>
>>>>>>>> This patch seems to generate a list corruption [1] when I test
>>>>>>>> with Doug's for-4.5 tree. Eran, care to take a look at this?
>>>>>>>
>>>>>>>
>>>>>>> This patch is part from a series that was introduced in 4.3-rc1 [1],
>>>>>>
>>>>>>
>>>>>> Then something else broke it. Can people check their patches on doug's
>>>>>> tree? At the moment it's unusable...
>>>>>
>>>>
>>>> Leon and I have checked Doug's tree with mlx4_ib disabled and we
>>>> didn't encounter any error.
>>>> We ran ucmatose over IB connection (in mlx5) and it worked flawlessly.
>>>>
>>>>>
>>>>> Yes, I checked the branch up to commit 882f3b3 "Merge branches
>>>>> '4.5/Or-cleanup' and '4.5/rdma-cq' into k.o/for-4.5" and it works (rping,
>>>>> ibv_rc_pingpong over top of mlx4 VPI)
>>>>>
>>>
>>> Regarding mlx4, Eran and I analyzed it. We didn't test that, but it
>>> seems like the bug is introduced in the 64bit counters test. Here's a
>>> proposal:
>>>
>>> diff --git a/drivers/infiniband/core/sysfs.c 
>>> b/drivers/infiniband/core/sysfs.c
>>> index 539040f..8da3c83 100644
>>> --- a/drivers/infiniband/core/sysfs.c
>>> +++ b/drivers/infiniband/core/sysfs.c
>>> @@ -714,11 +714,12 @@ err:
>>>   * Figure out which counter table to use depending on
>>>   * the device capabilities.
>>>   */
>>> -static struct attribute_group *get_counter_table(struct ib_device *dev)
>>> +static struct attribute_group *get_counter_table(struct ib_device *dev,
>>> +  int port_num)
>>>  {
>>> struct ib_class_port_info cpi;
>>>
>>> -   if (get_perf_mad(dev, 0, IB_PMA_CLASS_PORT_INFO,
>>> + if (get_perf_mad(dev, port_num, IB_PMA_CLASS_PORT_INFO,
>>> , 40, sizeof(cpi)) >= 0) {
>>
>> Your proposal is similar to earlier version of Christoph's patch but was
>> changed since ClassPortInfo attribute does not have PortSelect field
>> like other PerfMgt attributes which is where this port num would be
>> placed. In ClassPortInfo attribute, that location would be the
>> ClassVersion field that would be set to port number in PerfMgt Get query.
>
> In actuality, I don't think it really matters as this is a Get not a Set
> and the PMA would do the right thing even if some field in the CPI were
> stepped on.
>

Well, it does matter as it calls the vendor driver with port_num = 0.
Since the kernel is trusted, the vendor driver expects a valid port number.
Giving it an invalid number might result in memory corruptions, as
demonstrated in this case.

>> -- Hal

Matan

>>
>>>
>>> if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH)
>>> @@ -776,7 +777,7 @@ static int add_port(struct ib_device *device, int 
>>> port_num,
>>> goto err_put;
>>> }
>>>
>>> -   p->pma_table = get_counter_table(device);
>>> + p->pma_table = get_counter_table(device, port_num);
>>> ret = sysfs_create_group(>kobj, p->pma_table);
>>> if (ret)
>>> goto err_put_gid_attrs;
>>>
>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majord...@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 for-next 5/7] IB/mlx4: Add IB counters table

2015-12-24 Thread Matan Barak
On Thu, Dec 24, 2015 at 4:07 PM, Matan Barak <mat...@dev.mellanox.co.il> wrote:
> On Thu, Dec 24, 2015 at 2:38 PM, Or Gerlitz <ogerl...@mellanox.com> wrote:
>> On 12/24/2015 12:42 PM, Sagi Grimberg wrote:
>>>
>>>
>>>>> This patch seems to generate a list corruption [1] when I test
>>>>> with Doug's for-4.5 tree. Eran, care to take a look at this?
>>>>
>>>>
>>>> This patch is part from a series that was introduced in 4.3-rc1 [1],
>>>
>>>
>>> Then something else broke it. Can people check their patches on doug's
>>> tree? At the moment it's unusable...
>>
>
> Leon and I have checked Doug's tree with mlx4_ib disabled and we
> didn't encounter any error.
> We ran ucmatose over IB connection (in mlx5) and it worked flawlessly.
>
>>
>> Yes, I checked the branch up to commit 882f3b3 "Merge branches
>> '4.5/Or-cleanup' and '4.5/rdma-cq' into k.o/for-4.5" and it works (rping,
>> ibv_rc_pingpong over top of mlx4 VPI)
>>

Regarding mlx4, Eran and I analyzed it. We didn't test that, but it
seems like the bug is introduced in the 64bit counters test. Here's a
proposal:

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 539040f..8da3c83 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -714,11 +714,12 @@ err:
  * Figure out which counter table to use depending on
  * the device capabilities.
  */
-static struct attribute_group *get_counter_table(struct ib_device *dev)
+static struct attribute_group *get_counter_table(struct ib_device *dev,
+  int port_num)
 {
struct ib_class_port_info cpi;

-   if (get_perf_mad(dev, 0, IB_PMA_CLASS_PORT_INFO,
+ if (get_perf_mad(dev, port_num, IB_PMA_CLASS_PORT_INFO,
, 40, sizeof(cpi)) >= 0) {

if (cpi.capability_mask && IB_PMA_CLASS_CAP_EXT_WIDTH)
@@ -776,7 +777,7 @@ static int add_port(struct ib_device *device, int port_num,
goto err_put;
}

-   p->pma_table = get_counter_table(device);
+ p->pma_table = get_counter_table(device, port_num);
ret = sysfs_create_group(>kobj, p->pma_table);
if (ret)
goto err_put_gid_attrs;


>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 for-next 5/7] IB/mlx4: Add IB counters table

2015-12-24 Thread Matan Barak
On Thu, Dec 24, 2015 at 2:38 PM, Or Gerlitz  wrote:
> On 12/24/2015 12:42 PM, Sagi Grimberg wrote:
>>
>>
 This patch seems to generate a list corruption [1] when I test
 with Doug's for-4.5 tree. Eran, care to take a look at this?
>>>
>>>
>>> This patch is part from a series that was introduced in 4.3-rc1 [1],
>>
>>
>> Then something else broke it. Can people check their patches on doug's
>> tree? At the moment it's unusable...
>

Leon and I have checked Doug's tree with mlx4_ib disabled and we
didn't encounter any error.
We ran ucmatose over IB connection (in mlx5) and it worked flawlessly.

>
> Yes, I checked the branch up to commit 882f3b3 "Merge branches
> '4.5/Or-cleanup' and '4.5/rdma-cq' into k.o/for-4.5" and it works (rping,
> ibv_rc_pingpong over top of mlx4 VPI)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 4/5] IB/mlx5: Add hca_core_clock_offset to udata in init_ucontext

2015-12-23 Thread Matan Barak
On Wed, Dec 23, 2015 at 11:03 AM, Or Gerlitz <ogerl...@mellanox.com> wrote:
> On 12/22/2015 11:37 PM, Or Gerlitz wrote:
>>
>> On Tue, Dec 15, 2015 at 8:30 PM, Matan Barak<mat...@mellanox.com>  wrote:
>>>
>>> >Pass hca_core_clock_offset to user-space is mandatory in order to
>>> >let the user-space read the free-running clock register from the
>>> >right offset in the memory mapped page.
>>> >Passing this value is done by changing the vendor's command
>>> >and response of init_ucontext to be in extensible form.
>>
>> Is "old" (unpatched) libmlx5 still operational over "new" (patched)
>> mlx5 IB driver?
>
>
> and same question for the other way around as well
>

new kernel, old lib - response length is initialized to
min(offsetof(typeof(resp), response_length) +
sizeof(resp.response_length), udata->outlen);
I this case, we initialize it to udata->outlen, so the user-space gets
exactly the same information as before.

new lib, old kernel - the response is cleared before it is sent to the
kernel. The command size has stayed the same.
So - (reqlen == sizeof(struct mlx5_ib_alloc_ucontext_req_v2)) still holds true.
Since the kernel isn't familiar with the new response fields, it
doesn't copy them. Hence the response's comp_mask is still zero and
libmlx5 knows that hca_core_clock isn't supported.

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 09/10] IB/mlx5: Add RoCE fields to Address Vector

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

Set the address handle and QP address path fields according to the
link layer type (IB/Eth).

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/ah.c  | 32 +--
 drivers/infiniband/hw/mlx5/main.c| 21 ++
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  5 +++--
 drivers/infiniband/hw/mlx5/qp.c  | 42 ++--
 include/linux/mlx5/qp.h  | 21 --
 5 files changed, 96 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/ah.c b/drivers/infiniband/hw/mlx5/ah.c
index 6608058..745efa4 100644
--- a/drivers/infiniband/hw/mlx5/ah.c
+++ b/drivers/infiniband/hw/mlx5/ah.c
@@ -32,8 +32,10 @@
 
 #include "mlx5_ib.h"
 
-struct ib_ah *create_ib_ah(struct ib_ah_attr *ah_attr,
-  struct mlx5_ib_ah *ah)
+static struct ib_ah *create_ib_ah(struct mlx5_ib_dev *dev,
+ struct mlx5_ib_ah *ah,
+ struct ib_ah_attr *ah_attr,
+ enum rdma_link_layer ll)
 {
if (ah_attr->ah_flags & IB_AH_GRH) {
memcpy(ah->av.rgid, _attr->grh.dgid, 16);
@@ -44,9 +46,20 @@ struct ib_ah *create_ib_ah(struct ib_ah_attr *ah_attr,
ah->av.tclass = ah_attr->grh.traffic_class;
}
 
-   ah->av.rlid = cpu_to_be16(ah_attr->dlid);
-   ah->av.fl_mlid = ah_attr->src_path_bits & 0x7f;
-   ah->av.stat_rate_sl = (ah_attr->static_rate << 4) | (ah_attr->sl & 0xf);
+   ah->av.stat_rate_sl = (ah_attr->static_rate << 4);
+
+   if (ll == IB_LINK_LAYER_ETHERNET) {
+   memcpy(ah->av.rmac, ah_attr->dmac, sizeof(ah_attr->dmac));
+   ah->av.udp_sport =
+   mlx5_get_roce_udp_sport(dev,
+   ah_attr->port_num,
+   ah_attr->grh.sgid_index);
+   ah->av.stat_rate_sl |= (ah_attr->sl & 0x7) << 1;
+   } else {
+   ah->av.rlid = cpu_to_be16(ah_attr->dlid);
+   ah->av.fl_mlid = ah_attr->src_path_bits & 0x7f;
+   ah->av.stat_rate_sl |= (ah_attr->sl & 0xf);
+   }
 
return >ibah;
 }
@@ -54,12 +67,19 @@ struct ib_ah *create_ib_ah(struct ib_ah_attr *ah_attr,
 struct ib_ah *mlx5_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
 {
struct mlx5_ib_ah *ah;
+   struct mlx5_ib_dev *dev = to_mdev(pd->device);
+   enum rdma_link_layer ll;
+
+   ll = pd->device->get_link_layer(pd->device, ah_attr->port_num);
+
+   if (ll == IB_LINK_LAYER_ETHERNET && !(ah_attr->ah_flags & IB_AH_GRH))
+   return ERR_PTR(-EINVAL);
 
ah = kzalloc(sizeof(*ah), GFP_ATOMIC);
if (!ah)
return ERR_PTR(-ENOMEM);
 
-   return create_ib_ah(ah_attr, ah); /* never fails */
+   return create_ib_ah(dev, ah, ah_attr, ll); /* never fails */
 }
 
 int mlx5_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr)
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 6d160b5..2374007 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -252,6 +253,26 @@ static int mlx5_ib_del_gid(struct ib_device *device, u8 
port_num,
return set_roce_addr(device, port_num, index, NULL, NULL);
 }
 
+__be16 mlx5_get_roce_udp_sport(struct mlx5_ib_dev *dev, u8 port_num,
+  int index)
+{
+   struct ib_gid_attr attr;
+   union ib_gid gid;
+
+   if (ib_get_cached_gid(>ib_dev, port_num, index, , ))
+   return 0;
+
+   if (!attr.ndev)
+   return 0;
+
+   dev_put(attr.ndev);
+
+   if (attr.gid_type != IB_GID_TYPE_ROCE_UDP_ENCAP)
+   return 0;
+
+   return cpu_to_be16(MLX5_CAP_ROCE(dev->mdev, r_roce_min_src_udp_port));
+}
+
 static int mlx5_use_mad_ifc(struct mlx5_ib_dev *dev)
 {
return !dev->mdev->issi;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 1eaa611..b0deeb3 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -517,8 +517,6 @@ void mlx5_ib_free_srq_wqe(struct mlx5_ib_srq *srq, int 
wqe_index);
 int mlx5_MAD_IFC(struct mlx5_ib_dev *dev, int ignore_mkey, int ignore_bkey,
 u8 port, const struct ib_wc *in_wc, const struct ib_grh 
*in_grh,
 const void *in_mad, void *response_mad);
-struct ib_ah *create_ib_ah(struct ib_ah_attr *ah_attr,
-  struct mlx5_ib_ah *ah);
 struct ib_ah *mlx5_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr);
 int mlx5_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr);
 int mlx5_ib_destroy_ah(struct ib_ah *ah);
@@ -647,6 +645,9 @@ static inline void 

[PATCH for-next V1 10/10] IB/mlx5: Support RoCE

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

Advertise RoCE support for IB/core layer and set the hardware to
work in RoCE mode.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/main.c | 48 +++
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 2374007..1c7c459 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1498,6 +1498,32 @@ static void destroy_dev_resources(struct 
mlx5_ib_resources *devr)
mlx5_ib_dealloc_pd(devr->p0);
 }
 
+static u32 get_core_cap_flags(struct ib_device *ibdev)
+{
+   struct mlx5_ib_dev *dev = to_mdev(ibdev);
+   enum rdma_link_layer ll = mlx5_ib_port_link_layer(ibdev, 1);
+   u8 l3_type_cap = MLX5_CAP_ROCE(dev->mdev, l3_type);
+   u8 roce_version_cap = MLX5_CAP_ROCE(dev->mdev, roce_version);
+   u32 ret = 0;
+
+   if (ll == IB_LINK_LAYER_INFINIBAND)
+   return RDMA_CORE_PORT_IBA_IB;
+
+   if (!(l3_type_cap & MLX5_ROCE_L3_TYPE_IPV4_CAP))
+   return 0;
+
+   if (!(l3_type_cap & MLX5_ROCE_L3_TYPE_IPV6_CAP))
+   return 0;
+
+   if (roce_version_cap & MLX5_ROCE_VERSION_1_CAP)
+   ret |= RDMA_CORE_PORT_IBA_ROCE;
+
+   if (roce_version_cap & MLX5_ROCE_VERSION_2_CAP)
+   ret |= RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP;
+
+   return ret;
+}
+
 static int mlx5_port_immutable(struct ib_device *ibdev, u8 port_num,
   struct ib_port_immutable *immutable)
 {
@@ -1510,7 +1536,7 @@ static int mlx5_port_immutable(struct ib_device *ibdev, 
u8 port_num,
 
immutable->pkey_tbl_len = attr.pkey_tbl_len;
immutable->gid_tbl_len = attr.gid_tbl_len;
-   immutable->core_cap_flags = RDMA_CORE_PORT_IBA_IB;
+   immutable->core_cap_flags = get_core_cap_flags(ibdev);
immutable->max_mad_size = IB_MGMT_MAD_SIZE;
 
return 0;
@@ -1518,12 +1544,27 @@ static int mlx5_port_immutable(struct ib_device *ibdev, 
u8 port_num,
 
 static int mlx5_enable_roce(struct mlx5_ib_dev *dev)
 {
+   int err;
+
dev->roce.nb.notifier_call = mlx5_netdev_event;
-   return register_netdevice_notifier(>roce.nb);
+   err = register_netdevice_notifier(>roce.nb);
+   if (err)
+   return err;
+
+   err = mlx5_nic_vport_enable_roce(dev->mdev);
+   if (err)
+   goto err_unregister_netdevice_notifier;
+
+   return 0;
+
+err_unregister_netdevice_notifier:
+   unregister_netdevice_notifier(>roce.nb);
+   return err;
 }
 
 static void mlx5_disable_roce(struct mlx5_ib_dev *dev)
 {
+   mlx5_nic_vport_disable_roce(dev->mdev);
unregister_netdevice_notifier(>roce.nb);
 }
 
@@ -1538,8 +1579,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
port_type_cap = MLX5_CAP_GEN(mdev, port_type);
ll = mlx5_port_type_cap_to_rdma_ll(port_type_cap);
 
-   /* don't create IB instance over Eth ports, no RoCE yet! */
-   if (ll == IB_LINK_LAYER_ETHERNET)
+   if ((ll == IB_LINK_LAYER_ETHERNET) && !MLX5_CAP_GEN(mdev, roce))
return NULL;
 
printk_once(KERN_INFO "%s", mlx5_version);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 06/10] IB/mlx5: Extend query_device/port to support RoCE

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

Using the vport access functions to retrieve the Ethernet
specific information and return this information in
ib_query_device and ib_query_port.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/main.c | 75 +++
 include/linux/mlx5/driver.h   |  7 
 2 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 48da17d..7f34b5e 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -120,6 +121,50 @@ static struct net_device *mlx5_ib_get_netdev(struct 
ib_device *device,
return ndev;
 }
 
+static int mlx5_query_port_roce(struct ib_device *device, u8 port_num,
+   struct ib_port_attr *props)
+{
+   struct mlx5_ib_dev *dev = to_mdev(device);
+   struct net_device *ndev;
+   enum ib_mtu ndev_ib_mtu;
+
+   memset(props, 0, sizeof(*props));
+
+   props->port_cap_flags  |= IB_PORT_CM_SUP;
+   props->port_cap_flags  |= IB_PORT_IP_BASED_GIDS;
+
+   props->gid_tbl_len  = MLX5_CAP_ROCE(dev->mdev,
+   roce_address_table_size);
+   props->max_mtu  = IB_MTU_4096;
+   props->max_msg_sz   = 1 << MLX5_CAP_GEN(dev->mdev, log_max_msg);
+   props->pkey_tbl_len = 1;
+   props->state= IB_PORT_DOWN;
+   props->phys_state   = 3;
+
+   mlx5_query_nic_vport_qkey_viol_cntr(dev->mdev,
+   (u16 *)>qkey_viol_cntr);
+
+   ndev = mlx5_ib_get_netdev(device, port_num);
+   if (!ndev)
+   return 0;
+
+   if (netif_running(ndev) && netif_carrier_ok(ndev)) {
+   props->state  = IB_PORT_ACTIVE;
+   props->phys_state = 5;
+   }
+
+   ndev_ib_mtu = iboe_get_mtu(ndev->mtu);
+
+   dev_put(ndev);
+
+   props->active_mtu   = min(props->max_mtu, ndev_ib_mtu);
+
+   props->active_width = IB_WIDTH_4X;  /* TODO */
+   props->active_speed = IB_SPEED_QDR; /* TODO */
+
+   return 0;
+}
+
 static int mlx5_use_mad_ifc(struct mlx5_ib_dev *dev)
 {
return !dev->mdev->issi;
@@ -158,13 +203,21 @@ static int mlx5_query_system_image_guid(struct ib_device 
*ibdev,
 
case MLX5_VPORT_ACCESS_METHOD_HCA:
err = mlx5_query_hca_vport_system_image_guid(mdev, );
-   if (!err)
-   *sys_image_guid = cpu_to_be64(tmp);
-   return err;
+   break;
+
+   case MLX5_VPORT_ACCESS_METHOD_NIC:
+   err = mlx5_query_nic_vport_system_image_guid(mdev, );
+   break;
 
default:
return -EINVAL;
}
+
+   if (!err)
+   *sys_image_guid = cpu_to_be64(tmp);
+
+   return err;
+
 }
 
 static int mlx5_query_max_pkeys(struct ib_device *ibdev,
@@ -218,13 +271,20 @@ static int mlx5_query_node_guid(struct mlx5_ib_dev *dev,
 
case MLX5_VPORT_ACCESS_METHOD_HCA:
err = mlx5_query_hca_vport_node_guid(dev->mdev, );
-   if (!err)
-   *node_guid = cpu_to_be64(tmp);
-   return err;
+   break;
+
+   case MLX5_VPORT_ACCESS_METHOD_NIC:
+   err = mlx5_query_nic_vport_node_guid(dev->mdev, );
+   break;
 
default:
return -EINVAL;
}
+
+   if (!err)
+   *node_guid = cpu_to_be64(tmp);
+
+   return err;
 }
 
 struct mlx5_reg_node_desc {
@@ -516,6 +576,9 @@ int mlx5_ib_query_port(struct ib_device *ibdev, u8 port,
case MLX5_VPORT_ACCESS_METHOD_HCA:
return mlx5_query_hca_port(ibdev, port, props);
 
+   case MLX5_VPORT_ACCESS_METHOD_NIC:
+   return mlx5_query_port_roce(ibdev, port, props);
+
default:
return -EINVAL;
}
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 5c857f2..7b9c976 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -632,13 +632,6 @@ extern struct workqueue_struct *mlx5_core_wq;
.struct_offset_bytes = offsetof(struct ib_unpacked_ ## header, field),  
\
.struct_size_bytes   = sizeof((struct ib_unpacked_ ## header *)0)->field
 
-struct ib_field {
-   size_t struct_offset_bytes;
-   size_t struct_size_bytes;
-   intoffset_bits;
-   intsize_bits;
-};
-
 static inline struct mlx5_core_dev *pci2mlx5_core_dev(struct pci_dev *pdev)
 {
return pci_get_drvdata(pdev);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 08/10] IB/mlx5: Support IB device's callbacks for adding/deleting GIDs

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

These callbacks write into the mlx5 RoCE address table.
Upon del_gid we write a zero'd GID.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/main.c | 89 +++
 include/linux/mlx5/device.h   | 20 +
 2 files changed, 109 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 7f34b5e..6d160b5 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -165,6 +165,93 @@ static int mlx5_query_port_roce(struct ib_device *device, 
u8 port_num,
return 0;
 }
 
+static void ib_gid_to_mlx5_roce_addr(const union ib_gid *gid,
+const struct ib_gid_attr *attr,
+void *mlx5_addr)
+{
+#define MLX5_SET_RA(p, f, v) MLX5_SET(roce_addr_layout, p, f, v)
+   char *mlx5_addr_l3_addr = MLX5_ADDR_OF(roce_addr_layout, mlx5_addr,
+  source_l3_address);
+   void *mlx5_addr_mac = MLX5_ADDR_OF(roce_addr_layout, mlx5_addr,
+  source_mac_47_32);
+
+   if (!gid)
+   return;
+
+   ether_addr_copy(mlx5_addr_mac, attr->ndev->dev_addr);
+
+   if (is_vlan_dev(attr->ndev)) {
+   MLX5_SET_RA(mlx5_addr, vlan_valid, 1);
+   MLX5_SET_RA(mlx5_addr, vlan_id, vlan_dev_vlan_id(attr->ndev));
+   }
+
+   switch (attr->gid_type) {
+   case IB_GID_TYPE_IB:
+   MLX5_SET_RA(mlx5_addr, roce_version, MLX5_ROCE_VERSION_1);
+   break;
+   case IB_GID_TYPE_ROCE_UDP_ENCAP:
+   MLX5_SET_RA(mlx5_addr, roce_version, MLX5_ROCE_VERSION_2);
+   break;
+
+   default:
+   WARN_ON(true);
+   }
+
+   if (attr->gid_type != IB_GID_TYPE_IB) {
+   if (ipv6_addr_v4mapped((void *)gid))
+   MLX5_SET_RA(mlx5_addr, roce_l3_type,
+   MLX5_ROCE_L3_TYPE_IPV4);
+   else
+   MLX5_SET_RA(mlx5_addr, roce_l3_type,
+   MLX5_ROCE_L3_TYPE_IPV6);
+   }
+
+   if ((attr->gid_type == IB_GID_TYPE_IB) ||
+   !ipv6_addr_v4mapped((void *)gid))
+   memcpy(mlx5_addr_l3_addr, gid, sizeof(*gid));
+   else
+   memcpy(_addr_l3_addr[12], >raw[12], 4);
+}
+
+static int set_roce_addr(struct ib_device *device, u8 port_num,
+unsigned int index,
+const union ib_gid *gid,
+const struct ib_gid_attr *attr)
+{
+   struct mlx5_ib_dev *dev = to_mdev(device);
+   u32  in[MLX5_ST_SZ_DW(set_roce_address_in)];
+   u32 out[MLX5_ST_SZ_DW(set_roce_address_out)];
+   void *in_addr = MLX5_ADDR_OF(set_roce_address_in, in, roce_address);
+   enum rdma_link_layer ll = mlx5_ib_port_link_layer(device, port_num);
+
+   if (ll != IB_LINK_LAYER_ETHERNET)
+   return -EINVAL;
+
+   memset(in, 0, sizeof(in));
+
+   ib_gid_to_mlx5_roce_addr(gid, attr, in_addr);
+
+   MLX5_SET(set_roce_address_in, in, roce_address_index, index);
+   MLX5_SET(set_roce_address_in, in, opcode, MLX5_CMD_OP_SET_ROCE_ADDRESS);
+
+   memset(out, 0, sizeof(out));
+   return mlx5_cmd_exec(dev->mdev, in, sizeof(in), out, sizeof(out));
+}
+
+static int mlx5_ib_add_gid(struct ib_device *device, u8 port_num,
+  unsigned int index, const union ib_gid *gid,
+  const struct ib_gid_attr *attr,
+  __always_unused void **context)
+{
+   return set_roce_addr(device, port_num, index, gid, attr);
+}
+
+static int mlx5_ib_del_gid(struct ib_device *device, u8 port_num,
+  unsigned int index, __always_unused void **context)
+{
+   return set_roce_addr(device, port_num, index, NULL, NULL);
+}
+
 static int mlx5_use_mad_ifc(struct mlx5_ib_dev *dev)
 {
return !dev->mdev->issi;
@@ -1499,6 +1586,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
if (ll == IB_LINK_LAYER_ETHERNET)
dev->ib_dev.get_netdev  = mlx5_ib_get_netdev;
dev->ib_dev.query_gid   = mlx5_ib_query_gid;
+   dev->ib_dev.add_gid = mlx5_ib_add_gid;
+   dev->ib_dev.del_gid = mlx5_ib_del_gid;
dev->ib_dev.query_pkey  = mlx5_ib_query_pkey;
dev->ib_dev.modify_device   = mlx5_ib_modify_device;
dev->ib_dev.modify_port = mlx5_ib_modify_port;
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 84aa7e0..ea4281b 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -279,6 +279,26 @@ enum {
 };
 
 enum {
+   MLX5_ROCE_VERSION_1 = 0,
+   MLX5_ROCE_VERSION_2 = 2,
+};
+
+enum {
+   MLX5_ROCE_VERSION_1_CAP = 1 

[PATCH for-next V1 04/10] net/mlx5_core: Introduce access functions to enable/disable RoCE

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

A mlx5 Ethernet port must be explicitly enabled for RoCE.
When RoCE is not enabled on the port, the NIC will refuse to create
QPs attached to it and incoming RoCE packets will be considered by the
NIC as plain Ethernet packets.

Signed-off-by: Achiad Shochat 
---
 drivers/net/ethernet/mellanox/mlx5/core/vport.c | 52 +
 include/linux/mlx5/vport.h  |  3 ++
 2 files changed, 55 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c 
b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index 54ab63b..245ff4a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -70,6 +70,17 @@ static int mlx5_query_nic_vport_context(struct mlx5_core_dev 
*mdev, u32 *out,
return mlx5_cmd_exec_check_status(mdev, in, sizeof(in), out, outlen);
 }
 
+static int mlx5_modify_nic_vport_context(struct mlx5_core_dev *mdev, void *in,
+int inlen)
+{
+   u32 out[MLX5_ST_SZ_DW(modify_nic_vport_context_out)];
+
+   MLX5_SET(modify_nic_vport_context_in, in, opcode,
+MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT);
+
+   return mlx5_cmd_exec_check_status(mdev, in, inlen, out, sizeof(out));
+}
+
 void mlx5_query_nic_vport_mac_address(struct mlx5_core_dev *mdev, u8 *addr)
 {
u32 *out;
@@ -350,3 +361,44 @@ int mlx5_query_hca_vport_node_guid(struct mlx5_core_dev 
*dev,
return err;
 }
 EXPORT_SYMBOL_GPL(mlx5_query_hca_vport_node_guid);
+
+enum mlx5_vport_roce_state {
+   MLX5_VPORT_ROCE_DISABLED = 0,
+   MLX5_VPORT_ROCE_ENABLED  = 1,
+};
+
+static int mlx5_nic_vport_update_roce_state(struct mlx5_core_dev *mdev,
+   enum mlx5_vport_roce_state state)
+{
+   void *in;
+   int inlen = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in);
+   int err;
+
+   in = mlx5_vzalloc(inlen);
+   if (!in) {
+   mlx5_core_warn(mdev, "failed to allocate inbox\n");
+   return -ENOMEM;
+   }
+
+   MLX5_SET(modify_nic_vport_context_in, in, field_select.roce_en, 1);
+   MLX5_SET(modify_nic_vport_context_in, in, nic_vport_context.roce_en,
+state);
+
+   err = mlx5_modify_nic_vport_context(mdev, in, inlen);
+
+   kvfree(in);
+
+   return err;
+}
+
+int mlx5_nic_vport_enable_roce(struct mlx5_core_dev *mdev)
+{
+   return mlx5_nic_vport_update_roce_state(mdev, MLX5_VPORT_ROCE_ENABLED);
+}
+EXPORT_SYMBOL_GPL(mlx5_nic_vport_enable_roce);
+
+int mlx5_nic_vport_disable_roce(struct mlx5_core_dev *mdev)
+{
+   return mlx5_nic_vport_update_roce_state(mdev, MLX5_VPORT_ROCE_DISABLED);
+}
+EXPORT_SYMBOL_GPL(mlx5_nic_vport_disable_roce);
diff --git a/include/linux/mlx5/vport.h b/include/linux/mlx5/vport.h
index 967e0fd..4c9ac60 100644
--- a/include/linux/mlx5/vport.h
+++ b/include/linux/mlx5/vport.h
@@ -52,4 +52,7 @@ int mlx5_query_hca_vport_system_image_guid(struct 
mlx5_core_dev *dev,
 int mlx5_query_hca_vport_node_guid(struct mlx5_core_dev *dev,
   u64 *node_guid);
 
+int mlx5_nic_vport_enable_roce(struct mlx5_core_dev *mdev);
+int mlx5_nic_vport_disable_roce(struct mlx5_core_dev *mdev);
+
 #endif /* __MLX5_VPORT_H__ */
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 07/10] IB/mlx5: Set network_hdr_type upon RoCE responder completion

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

When handling a responder completion, if the link layer is Ethernet,
set the work completion network_hdr_type field according to CQE's
info and the IB_WC_WITH_NETWORK_HDR_TYPE flag.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/cq.c | 17 +
 include/linux/mlx5/device.h |  6 ++
 2 files changed, 23 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 3dfd287..3ce5cfa7 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -171,6 +171,7 @@ enum {
 static void handle_responder(struct ib_wc *wc, struct mlx5_cqe64 *cqe,
 struct mlx5_ib_qp *qp)
 {
+   enum rdma_link_layer ll = rdma_port_get_link_layer(qp->ibqp.device, 1);
struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.device);
struct mlx5_ib_srq *srq;
struct mlx5_ib_wq *wq;
@@ -236,6 +237,22 @@ static void handle_responder(struct ib_wc *wc, struct 
mlx5_cqe64 *cqe,
} else {
wc->pkey_index = 0;
}
+
+   if (ll != IB_LINK_LAYER_ETHERNET)
+   return;
+
+   switch (wc->sl & 0x3) {
+   case MLX5_CQE_ROCE_L3_HEADER_TYPE_GRH:
+   wc->network_hdr_type = RDMA_NETWORK_IB;
+   break;
+   case MLX5_CQE_ROCE_L3_HEADER_TYPE_IPV6:
+   wc->network_hdr_type = RDMA_NETWORK_IPV6;
+   break;
+   case MLX5_CQE_ROCE_L3_HEADER_TYPE_IPV4:
+   wc->network_hdr_type = RDMA_NETWORK_IPV4;
+   break;
+   }
+   wc->wc_flags |= IB_WC_WITH_NETWORK_HDR_TYPE;
 }
 
 static void dump_cqe(struct mlx5_ib_dev *dev, struct mlx5_err_cqe *cqe)
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 0b473cb..84aa7e0 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -629,6 +629,12 @@ enum {
 };
 
 enum {
+   MLX5_CQE_ROCE_L3_HEADER_TYPE_GRH= 0x0,
+   MLX5_CQE_ROCE_L3_HEADER_TYPE_IPV6   = 0x1,
+   MLX5_CQE_ROCE_L3_HEADER_TYPE_IPV4   = 0x2,
+};
+
+enum {
CQE_L2_OK   = 1 << 0,
CQE_L3_OK   = 1 << 1,
CQE_L4_OK   = 1 << 2,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V3 01/11] IB/core: Add gid_type to gid attribute

2015-12-23 Thread Matan Barak
On Wed, Dec 23, 2015 at 3:40 PM, Leon Romanovsky <l...@leon.nu> wrote:
> On Wed, Dec 23, 2015 at 02:56:47PM +0200, Matan Barak wrote:
>> In order to support multiple GID types, we need to store the gid_type
>> with each GID. This is also aligned with the RoCE v2 annex "RoCEv2 PORT
>> GID table entries shall have a "GID type" attribute that denotes the L3
>> Address type". The currently supported GID is IB_GID_TYPE_IB which is
>> also RoCE v1 GID type.
>>
>> This implies that gid_type should be added to roce_gid_table meta-data.
>>
>> Signed-off-by: Matan Barak <mat...@mellanox.com>
>> ---
>>  drivers/infiniband/core/cache.c   |  144 
>> +++-
>>  drivers/infiniband/core/cm.c  |2 +-
>>  drivers/infiniband/core/cma.c |3 +-
>>  drivers/infiniband/core/core_priv.h   |4 +
>>  drivers/infiniband/core/device.c  |9 ++-
>>  drivers/infiniband/core/multicast.c   |2 +-
>>  drivers/infiniband/core/roce_gid_mgmt.c   |   60 ++--
>>  drivers/infiniband/core/sa_query.c|5 +-
>>  drivers/infiniband/core/uverbs_marshall.c |1 +
>>  drivers/infiniband/core/verbs.c   |1 +
>>  include/rdma/ib_cache.h   |4 +
>>  include/rdma/ib_sa.h  |1 +
>>  include/rdma/ib_verbs.h   |   11 ++-
>>  13 files changed, 185 insertions(+), 62 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/cache.c 
>> b/drivers/infiniband/core/cache.c
>> index 097e9df..566fd8f 100644
>> --- a/drivers/infiniband/core/cache.c
>> +++ b/drivers/infiniband/core/cache.c
>> @@ -64,6 +64,7 @@ enum gid_attr_find_mask {
>>   GID_ATTR_FIND_MASK_GID  = 1UL << 0,
>>   GID_ATTR_FIND_MASK_NETDEV   = 1UL << 1,
>>   GID_ATTR_FIND_MASK_DEFAULT  = 1UL << 2,
>> + GID_ATTR_FIND_MASK_GID_TYPE = 1UL << 3,
>>  };
>>
>>  enum gid_table_entry_props {
>> @@ -125,6 +126,19 @@ static void dispatch_gid_change_event(struct ib_device 
>> *ib_dev, u8 port)
>>   }
>>  }
>>
>> +static const char * const gid_type_str[] = {
>^^ ^^
> IMHO, The white spaces can be a little bit confusing to understand.
>

Pay attention to the double const - I think it's more clear that way.

>> + [IB_GID_TYPE_IB]= "IB/RoCE v1",
>> +};
>> +
>> +const char *ib_cache_gid_type_str(enum ib_gid_type gid_type)
>> +{
>> + if (gid_type < ARRAY_SIZE(gid_type_str) && gid_type_str[gid_type])
> Why do you need to check second condition?

If we want to leave a gap for an invalid GID type, we could do that.
Anyway, we could remove this as an incremental future patch if that's
really important.

>> + return gid_type_str[gid_type];
>> +
>> + return "Invalid GID type";
>> +}
>> +EXPORT_SYMBOL(ib_cache_gid_type_str);
>> +
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/cma: cma_match_net_dev needs to take into account port_num

2015-12-23 Thread Matan Barak
On Wed, Dec 23, 2015 at 6:08 PM, Doug Ledford <dledf...@redhat.com> wrote:
> On 12/22/2015 02:26 PM, Matan Barak wrote:
>> On Tue, Dec 22, 2015 at 8:58 PM, Doug Ledford <dledf...@redhat.com> wrote:
>>> On 12/22/2015 05:47 AM, Or Gerlitz wrote:
>>>> On 12/21/2015 5:01 PM, Matan Barak wrote:
>>>>> Previously, cma_match_net_dev called cma_protocol_roce which
>>>>> tried to verify that the IB device uses RoCE protocol. However,
>>>>> if rdma_id didn't have a bounded port, it used the first port
>>>>> of the device.
>>>>>
>>>>> In VPI systems, the first port might be an IB port while the second
>>>>> one could be an Ethernet port. This made requests for unbounded rdma_ids
>>>>> that come from the Ethernet port fail.
>>>>> Fixing this by passing the port of the request and checking this port
>>>>> of the device.
>>>>>
>>>>> Fixes: b8cab5dab15f ('IB/cma: Accept connection without a valid netdev
>>>>> on RoCE')
>>>>> Signed-off-by: Matan Barak<mat...@mellanox.com>
>>>>
>>>> seems that the patch is missing from patchworks, I can't explain that.
>>>
>>> I've already downloaded it and marked it accepted.
>>>
>>
>> Thanks Doug. Would you like that I'll repost the patch with the commit
>> message changed as Or suggested or is the current version good enough?
>>
>> Regarding the Ethernet loopback issue, I started looking into that,
>> but as Or stated, it's broken even before the RoCE patches.
>
> Ping.  Any progress on this?

Yeah, there's some progress - the basic problem is that we don't have
a bounded ndev and thus cma_resolve_iboe_route returns -ENODEV.
The root cause for this is that we have to store the ndev in
cma_bind_loopback. Even after doing that, cma_set_loopback changes the
sgid to be the localhost GID, which doesn't exist in the GID table and
thus will fail later in the GID lookup.
I think that regarding loopback, we actually want to send the data on
the link local default GID, which is guaranteed to exist. That's why I
think we should:
1. Change the cma_src_addr and cma_dst_addr in cma_bind_loopback to be
the default GID.
2. Store the associated ndev of this default GID as the bounded device.
3. In cma_resolve_loopback, get the MAC of this bounded device and
store it as the DMAC.
4. In cma_resolve_iboe_route, don't try to do route resolve if the
dGID matches the default GID.

It's still not working though, but this is where I'm headed. What do you think?


>
>
> --
> Doug Ledford <dledf...@redhat.com>
>   GPG KeyID: 0E572FDD
>
>

Regards,
Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 00/10] Add RoCE support to the mlx5 driver

2015-12-23 Thread Matan Barak
Hi Doug,

This patchset adds RoCE V1 and RoCE V2 support to the mlx5 device
driver.

This patchset was applied and tested over the third version of
"Add RoCE v2 support".

Regards,
Achiad

Changes from V0:
 - Fixed using rwlock before initializing it.
 - Rebased over Doug's k.o/for-4.5 branch.

Achiad Shochat (10):
  IB/mlx5: Support IB device's callback for getting the link layer
  IB/mlx5: Support IB device's callback for getting its netdev
  net/mlx5_core: Break down the vport mac address query function
  net/mlx5_core: Introduce access functions to enable/disable RoCE
  net/mlx5_core: Introduce access functions to query vport RoCE fields
  IB/mlx5: Extend query_device/port to support RoCE
  IB/mlx5: Set network_hdr_type upon RoCE responder completion
  IB/mlx5: Support IB device's callbacks for adding/deleting GIDs
  IB/mlx5: Add RoCE fields to Address Vector
  IB/mlx5: Support RoCE

 drivers/infiniband/hw/mlx5/ah.c |  32 ++-
 drivers/infiniband/hw/mlx5/cq.c |  17 ++
 drivers/infiniband/hw/mlx5/main.c   | 318 ++--
 drivers/infiniband/hw/mlx5/mlx5_ib.h|  15 +-
 drivers/infiniband/hw/mlx5/qp.c |  42 +++-
 drivers/net/ethernet/mellanox/mlx5/core/vport.c | 139 ++-
 include/linux/mlx5/device.h |  26 ++
 include/linux/mlx5/driver.h |   7 -
 include/linux/mlx5/mlx5_ifc.h   |  10 +-
 include/linux/mlx5/qp.h |  21 +-
 include/linux/mlx5/vport.h  |   8 +
 11 files changed, 578 insertions(+), 57 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 01/10] IB/mlx5: Support IB device's callback for getting the link layer

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

Make the existing mlx5_ib_port_link_layer() signature match
the ib device callback signature (add port_num parameter).
Refactor it to use a sub function so that the link layer could
be queried also before the ibdev is created.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/main.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index a51b594..de3a6b4 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -64,11 +64,9 @@ static char mlx5_version[] =
DRIVER_VERSION " (" DRIVER_RELDATE ")\n";
 
 static enum rdma_link_layer
-mlx5_ib_port_link_layer(struct ib_device *device)
+mlx5_port_type_cap_to_rdma_ll(int port_type_cap)
 {
-   struct mlx5_ib_dev *dev = to_mdev(device);
-
-   switch (MLX5_CAP_GEN(dev->mdev, port_type)) {
+   switch (port_type_cap) {
case MLX5_CAP_PORT_TYPE_IB:
return IB_LINK_LAYER_INFINIBAND;
case MLX5_CAP_PORT_TYPE_ETH:
@@ -78,6 +76,15 @@ mlx5_ib_port_link_layer(struct ib_device *device)
}
 }
 
+static enum rdma_link_layer
+mlx5_ib_port_link_layer(struct ib_device *device, u8 port_num)
+{
+   struct mlx5_ib_dev *dev = to_mdev(device);
+   int port_type_cap = MLX5_CAP_GEN(dev->mdev, port_type);
+
+   return mlx5_port_type_cap_to_rdma_ll(port_type_cap);
+}
+
 static int mlx5_use_mad_ifc(struct mlx5_ib_dev *dev)
 {
return !dev->mdev->issi;
@@ -94,7 +101,7 @@ static int mlx5_get_vport_access_method(struct ib_device 
*ibdev)
if (mlx5_use_mad_ifc(to_mdev(ibdev)))
return MLX5_VPORT_ACCESS_METHOD_MAD;
 
-   if (mlx5_ib_port_link_layer(ibdev) ==
+   if (mlx5_ib_port_link_layer(ibdev, 1) ==
IB_LINK_LAYER_ETHERNET)
return MLX5_VPORT_ACCESS_METHOD_NIC;
 
@@ -1306,11 +1313,16 @@ static int mlx5_port_immutable(struct ib_device *ibdev, 
u8 port_num,
 static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 {
struct mlx5_ib_dev *dev;
+   enum rdma_link_layer ll;
+   int port_type_cap;
int err;
int i;
 
+   port_type_cap = MLX5_CAP_GEN(mdev, port_type);
+   ll = mlx5_port_type_cap_to_rdma_ll(port_type_cap);
+
/* don't create IB instance over Eth ports, no RoCE yet! */
-   if (MLX5_CAP_GEN(mdev, port_type) == MLX5_CAP_PORT_TYPE_ETH)
+   if (ll == IB_LINK_LAYER_ETHERNET)
return NULL;
 
printk_once(KERN_INFO "%s", mlx5_version);
@@ -1373,6 +1385,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
(1ull << IB_USER_VERBS_EX_CMD_QUERY_DEVICE);
 
dev->ib_dev.query_port  = mlx5_ib_query_port;
+   dev->ib_dev.get_link_layer  = mlx5_ib_port_link_layer;
dev->ib_dev.query_gid   = mlx5_ib_query_gid;
dev->ib_dev.query_pkey  = mlx5_ib_query_pkey;
dev->ib_dev.modify_device   = mlx5_ib_modify_device;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 02/10] IB/mlx5: Support IB device's callback for getting its netdev

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

For Eth ports only:
Maintain a net device pointer in mlx5_ib_device and update it
upon NETDEV_REGISTER and NETDEV_UNREGISTER events if the
net-device and IB device have the same PCI parent device.
Implement the get_netdev callback to return this net device.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/main.c| 64 +++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 10 ++
 2 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index de3a6b4..48da17d 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -85,6 +85,41 @@ mlx5_ib_port_link_layer(struct ib_device *device, u8 
port_num)
return mlx5_port_type_cap_to_rdma_ll(port_type_cap);
 }
 
+static int mlx5_netdev_event(struct notifier_block *this,
+unsigned long event, void *ptr)
+{
+   struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
+   struct mlx5_ib_dev *ibdev = container_of(this, struct mlx5_ib_dev,
+roce.nb);
+
+   if ((event != NETDEV_UNREGISTER) && (event != NETDEV_REGISTER))
+   return NOTIFY_DONE;
+
+   write_lock(>roce.netdev_lock);
+   if (ndev->dev.parent == >mdev->pdev->dev)
+   ibdev->roce.netdev = (event == NETDEV_UNREGISTER) ? NULL : ndev;
+   write_unlock(>roce.netdev_lock);
+
+   return NOTIFY_DONE;
+}
+
+static struct net_device *mlx5_ib_get_netdev(struct ib_device *device,
+u8 port_num)
+{
+   struct mlx5_ib_dev *ibdev = to_mdev(device);
+   struct net_device *ndev;
+
+   /* Ensure ndev does not disappear before we invoke dev_hold()
+*/
+   read_lock(>roce.netdev_lock);
+   ndev = ibdev->roce.netdev;
+   if (ndev)
+   dev_hold(ndev);
+   read_unlock(>roce.netdev_lock);
+
+   return ndev;
+}
+
 static int mlx5_use_mad_ifc(struct mlx5_ib_dev *dev)
 {
return !dev->mdev->issi;
@@ -1310,6 +1345,17 @@ static int mlx5_port_immutable(struct ib_device *ibdev, 
u8 port_num,
return 0;
 }
 
+static int mlx5_enable_roce(struct mlx5_ib_dev *dev)
+{
+   dev->roce.nb.notifier_call = mlx5_netdev_event;
+   return register_netdevice_notifier(>roce.nb);
+}
+
+static void mlx5_disable_roce(struct mlx5_ib_dev *dev)
+{
+   unregister_netdevice_notifier(>roce.nb);
+}
+
 static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 {
struct mlx5_ib_dev *dev;
@@ -1337,6 +1383,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
if (err)
goto err_dealloc;
 
+   rwlock_init(>roce.netdev_lock);
err = get_port_caps(dev);
if (err)
goto err_dealloc;
@@ -1386,6 +1433,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 
dev->ib_dev.query_port  = mlx5_ib_query_port;
dev->ib_dev.get_link_layer  = mlx5_ib_port_link_layer;
+   if (ll == IB_LINK_LAYER_ETHERNET)
+   dev->ib_dev.get_netdev  = mlx5_ib_get_netdev;
dev->ib_dev.query_gid   = mlx5_ib_query_gid;
dev->ib_dev.query_pkey  = mlx5_ib_query_pkey;
dev->ib_dev.modify_device   = mlx5_ib_modify_device;
@@ -1442,9 +1491,15 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 
mutex_init(>cap_mask_mutex);
 
+   if (ll == IB_LINK_LAYER_ETHERNET) {
+   err = mlx5_enable_roce(dev);
+   if (err)
+   goto err_dealloc;
+   }
+
err = create_dev_resources(>devr);
if (err)
-   goto err_dealloc;
+   goto err_disable_roce;
 
err = mlx5_ib_odp_init_one(dev);
if (err)
@@ -1481,6 +1536,10 @@ err_odp:
 err_rsrc:
destroy_dev_resources(>devr);
 
+err_disable_roce:
+   if (ll == IB_LINK_LAYER_ETHERNET)
+   mlx5_disable_roce(dev);
+
 err_dealloc:
ib_dealloc_device((struct ib_device *)dev);
 
@@ -1490,11 +1549,14 @@ err_dealloc:
 static void mlx5_ib_remove(struct mlx5_core_dev *mdev, void *context)
 {
struct mlx5_ib_dev *dev = context;
+   enum rdma_link_layer ll = mlx5_ib_port_link_layer(>ib_dev, 1);
 
ib_unregister_device(>ib_dev);
destroy_umrc_res(dev);
mlx5_ib_odp_remove_one(dev);
destroy_dev_resources(>devr);
+   if (ll == IB_LINK_LAYER_ETHERNET)
+   mlx5_disable_roce(dev);
ib_dealloc_device(>ib_dev);
 }
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 6333472..1eaa611 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -407,9 +407,19 @@ struct mlx5_ib_resources {
struct ib_srq   *s1;
 };
 
+struct mlx5_roce {
+   /* Protect mlx5_ib_get_netdev from invoking dev_hold() with a 

Re: [PATCH for-next V1 00/10] Add RoCE support to the mlx5 driver

2015-12-23 Thread Matan Barak
On Wed, Dec 23, 2015 at 6:07 PM, Doug Ledford <dledf...@redhat.com> wrote:
> On 12/23/2015 08:17 AM, Matan Barak wrote:
>> Hi Doug,
>>
>> This patchset adds RoCE V1 and RoCE V2 support to the mlx5 device
>> driver.
>>
>> This patchset was applied and tested over the third version of
>> "Add RoCE v2 support".
>>
>> Regards,
>> Achiad
>>
>> Changes from V0:
>>  - Fixed using rwlock before initializing it.
>>  - Rebased over Doug's k.o/for-4.5 branch.
>>
>> Achiad Shochat (10):
>>   IB/mlx5: Support IB device's callback for getting the link layer
>>   IB/mlx5: Support IB device's callback for getting its netdev
>>   net/mlx5_core: Break down the vport mac address query function
>>   net/mlx5_core: Introduce access functions to enable/disable RoCE
>>   net/mlx5_core: Introduce access functions to query vport RoCE fields
>>   IB/mlx5: Extend query_device/port to support RoCE
>>   IB/mlx5: Set network_hdr_type upon RoCE responder completion
>>   IB/mlx5: Support IB device's callbacks for adding/deleting GIDs
>>   IB/mlx5: Add RoCE fields to Address Vector
>>   IB/mlx5: Support RoCE
>>
>>  drivers/infiniband/hw/mlx5/ah.c |  32 ++-
>>  drivers/infiniband/hw/mlx5/cq.c |  17 ++
>>  drivers/infiniband/hw/mlx5/main.c   | 318 
>> ++--
>>  drivers/infiniband/hw/mlx5/mlx5_ib.h|  15 +-
>>  drivers/infiniband/hw/mlx5/qp.c |  42 +++-
>>  drivers/net/ethernet/mellanox/mlx5/core/vport.c | 139 ++-
>>  include/linux/mlx5/device.h |  26 ++
>>  include/linux/mlx5/driver.h |   7 -
>>  include/linux/mlx5/mlx5_ifc.h   |  10 +-
>>  include/linux/mlx5/qp.h |  21 +-
>>  include/linux/mlx5/vport.h  |   8 +
>>  11 files changed, 578 insertions(+), 57 deletions(-)
>>
>
> This series doesn't apply to my tree even after the GID table lock
> series and the RoCEv2-V3 series.  I'm guessing you have some other mlx5
> specific series in your tree?
>

Strange I'll respin them over your k.o/for-4.4-rc tree + the
required patches. Ok?

> --
> Doug Ledford <dledf...@redhat.com>
>   GPG KeyID: 0E572FDD
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/cma: cma_match_net_dev needs to take into account port_num

2015-12-23 Thread Matan Barak
On Wed, Dec 23, 2015 at 6:08 PM, Doug Ledford <dledf...@redhat.com> wrote:
> On 12/22/2015 02:26 PM, Matan Barak wrote:
>> On Tue, Dec 22, 2015 at 8:58 PM, Doug Ledford <dledf...@redhat.com> wrote:
>>> On 12/22/2015 05:47 AM, Or Gerlitz wrote:
>>>> On 12/21/2015 5:01 PM, Matan Barak wrote:
>>>>> Previously, cma_match_net_dev called cma_protocol_roce which
>>>>> tried to verify that the IB device uses RoCE protocol. However,
>>>>> if rdma_id didn't have a bounded port, it used the first port
>>>>> of the device.
>>>>>
>>>>> In VPI systems, the first port might be an IB port while the second
>>>>> one could be an Ethernet port. This made requests for unbounded rdma_ids
>>>>> that come from the Ethernet port fail.
>>>>> Fixing this by passing the port of the request and checking this port
>>>>> of the device.
>>>>>
>>>>> Fixes: b8cab5dab15f ('IB/cma: Accept connection without a valid netdev
>>>>> on RoCE')
>>>>> Signed-off-by: Matan Barak<mat...@mellanox.com>
>>>>
>>>> seems that the patch is missing from patchworks, I can't explain that.
>>>
>>> I've already downloaded it and marked it accepted.
>>>
>>
>> Thanks Doug. Would you like that I'll repost the patch with the commit
>> message changed as Or suggested or is the current version good enough?
>>
>> Regarding the Ethernet loopback issue, I started looking into that,
>> but as Or stated, it's broken even before the RoCE patches.
>
> Ping.  Any progress on this?

Yeah, there's some progress - the basic problem is that we don't have
a bounded ndev and thus cma_resolve_iboe_route returns -ENODEV.
The root cause for this is that we have to store the ndev in
cma_bind_loopback. Even after doing that, cma_set_loopback changes the
sgid to be the localhost GID, which doesn't exist in the GID table and
thus will fail later in the GID lookup.
I think that regarding loopback, we actually want to send the data on
the link local default GID, which is guaranteed to exist. That's why I
think we should:
1. Change the cma_src_addr and cma_dst_addr in cma_bind_loopback to be
the default GID.
2. Store the associated ndev of this default GID as the bounded device.
3. In cma_resolve_loopback, get the MAC of this bounded device and
store it as the DMAC.
4. In cma_resolve_iboe_route, don't try to do route resolve if the
dGID matches the default GID.

It's still not working though, but this is where I'm headed. What do you think?


>
>
> --
> Doug Ledford <dledf...@redhat.com>
>   GPG KeyID: 0E572FDD
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 02/10] IB/mlx5: Support IB device's callback for getting its netdev

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

For Eth ports only:
Maintain a net device pointer in mlx5_ib_device and update it
upon NETDEV_REGISTER and NETDEV_UNREGISTER events if the
net-device and IB device have the same PCI parent device.
Implement the get_netdev callback to return this net device.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/main.c| 64 +++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 10 ++
 2 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index a3295bb..8c6e144 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -85,6 +85,41 @@ mlx5_ib_port_link_layer(struct ib_device *device, u8 
port_num)
return mlx5_port_type_cap_to_rdma_ll(port_type_cap);
 }
 
+static int mlx5_netdev_event(struct notifier_block *this,
+unsigned long event, void *ptr)
+{
+   struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
+   struct mlx5_ib_dev *ibdev = container_of(this, struct mlx5_ib_dev,
+roce.nb);
+
+   if ((event != NETDEV_UNREGISTER) && (event != NETDEV_REGISTER))
+   return NOTIFY_DONE;
+
+   write_lock(>roce.netdev_lock);
+   if (ndev->dev.parent == >mdev->pdev->dev)
+   ibdev->roce.netdev = (event == NETDEV_UNREGISTER) ? NULL : ndev;
+   write_unlock(>roce.netdev_lock);
+
+   return NOTIFY_DONE;
+}
+
+static struct net_device *mlx5_ib_get_netdev(struct ib_device *device,
+u8 port_num)
+{
+   struct mlx5_ib_dev *ibdev = to_mdev(device);
+   struct net_device *ndev;
+
+   /* Ensure ndev does not disappear before we invoke dev_hold()
+*/
+   read_lock(>roce.netdev_lock);
+   ndev = ibdev->roce.netdev;
+   if (ndev)
+   dev_hold(ndev);
+   read_unlock(>roce.netdev_lock);
+
+   return ndev;
+}
+
 static int mlx5_use_mad_ifc(struct mlx5_ib_dev *dev)
 {
return !dev->mdev->issi;
@@ -1329,6 +1364,17 @@ static int mlx5_port_immutable(struct ib_device *ibdev, 
u8 port_num,
return 0;
 }
 
+static int mlx5_enable_roce(struct mlx5_ib_dev *dev)
+{
+   dev->roce.nb.notifier_call = mlx5_netdev_event;
+   return register_netdevice_notifier(>roce.nb);
+}
+
+static void mlx5_disable_roce(struct mlx5_ib_dev *dev)
+{
+   unregister_netdevice_notifier(>roce.nb);
+}
+
 static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 {
struct mlx5_ib_dev *dev;
@@ -1352,6 +1398,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 
dev->mdev = mdev;
 
+   rwlock_init(>roce.netdev_lock);
err = get_port_caps(dev);
if (err)
goto err_dealloc;
@@ -1402,6 +1449,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
dev->ib_dev.query_device= mlx5_ib_query_device;
dev->ib_dev.query_port  = mlx5_ib_query_port;
dev->ib_dev.get_link_layer  = mlx5_ib_port_link_layer;
+   if (ll == IB_LINK_LAYER_ETHERNET)
+   dev->ib_dev.get_netdev  = mlx5_ib_get_netdev;
dev->ib_dev.query_gid   = mlx5_ib_query_gid;
dev->ib_dev.query_pkey  = mlx5_ib_query_pkey;
dev->ib_dev.modify_device   = mlx5_ib_modify_device;
@@ -1458,9 +1507,15 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 
mutex_init(>cap_mask_mutex);
 
+   if (ll == IB_LINK_LAYER_ETHERNET) {
+   err = mlx5_enable_roce(dev);
+   if (err)
+   goto err_dealloc;
+   }
+
err = create_dev_resources(>devr);
if (err)
-   goto err_dealloc;
+   goto err_disable_roce;
 
err = mlx5_ib_odp_init_one(dev);
if (err)
@@ -1497,6 +1552,10 @@ err_odp:
 err_rsrc:
destroy_dev_resources(>devr);
 
+err_disable_roce:
+   if (ll == IB_LINK_LAYER_ETHERNET)
+   mlx5_disable_roce(dev);
+
 err_dealloc:
ib_dealloc_device((struct ib_device *)dev);
 
@@ -1506,11 +1565,14 @@ err_dealloc:
 static void mlx5_ib_remove(struct mlx5_core_dev *mdev, void *context)
 {
struct mlx5_ib_dev *dev = context;
+   enum rdma_link_layer ll = mlx5_ib_port_link_layer(>ib_dev, 1);
 
ib_unregister_device(>ib_dev);
destroy_umrc_res(dev);
mlx5_ib_odp_remove_one(dev);
destroy_dev_resources(>devr);
+   if (ll == IB_LINK_LAYER_ETHERNET)
+   mlx5_disable_roce(dev);
ib_dealloc_device(>ib_dev);
 }
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 6333472..1eaa611 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -407,9 +407,19 @@ struct mlx5_ib_resources {
struct ib_srq   *s1;
 };
 
+struct mlx5_roce {
+   /* Protect 

[PATCH for-next V2 00/10] Add RoCE support to the mlx5 driver

2015-12-23 Thread Matan Barak
Hi Doug,

This patchset adds RoCE V1 and RoCE V2 support to the mlx5 device
driver.

This patchset was applied and tested over the third version of
"Add RoCE v2 support".

Regards,
Achiad and Matan

Changes from V1:
 - Rebased over Doug's k.o/for-4.4-rc branch.

Changes from V0:
 - Fixed using rwlock before initializing it.
 - Rebased over Doug's k.o/for-4.5 branch.


Achiad Shochat (10):
  IB/mlx5: Support IB device's callback for getting the link layer
  IB/mlx5: Support IB device's callback for getting its netdev
  net/mlx5_core: Break down the vport mac address query function
  net/mlx5_core: Introduce access functions to enable/disable RoCE
  net/mlx5_core: Introduce access functions to query vport RoCE fields
  IB/mlx5: Extend query_device/port to support RoCE
  IB/mlx5: Set network_hdr_type upon RoCE responder completion
  IB/mlx5: Support IB device's callbacks for adding/deleting GIDs
  IB/mlx5: Add RoCE fields to Address Vector
  IB/mlx5: Support RoCE

 drivers/infiniband/hw/mlx5/ah.c |  32 ++-
 drivers/infiniband/hw/mlx5/cq.c |  17 ++
 drivers/infiniband/hw/mlx5/main.c   | 318 ++--
 drivers/infiniband/hw/mlx5/mlx5_ib.h|  15 +-
 drivers/infiniband/hw/mlx5/qp.c |  42 +++-
 drivers/net/ethernet/mellanox/mlx5/core/vport.c | 139 ++-
 include/linux/mlx5/device.h |  26 ++
 include/linux/mlx5/driver.h |   7 -
 include/linux/mlx5/mlx5_ifc.h   |  10 +-
 include/linux/mlx5/qp.h |  21 +-
 include/linux/mlx5/vport.h  |   8 +
 11 files changed, 578 insertions(+), 57 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 04/10] net/mlx5_core: Introduce access functions to enable/disable RoCE

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

A mlx5 Ethernet port must be explicitly enabled for RoCE.
When RoCE is not enabled on the port, the NIC will refuse to create
QPs attached to it and incoming RoCE packets will be considered by the
NIC as plain Ethernet packets.

Signed-off-by: Achiad Shochat 
---
 drivers/net/ethernet/mellanox/mlx5/core/vport.c | 52 +
 include/linux/mlx5/vport.h  |  3 ++
 2 files changed, 55 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c 
b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index 54ab63b..245ff4a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -70,6 +70,17 @@ static int mlx5_query_nic_vport_context(struct mlx5_core_dev 
*mdev, u32 *out,
return mlx5_cmd_exec_check_status(mdev, in, sizeof(in), out, outlen);
 }
 
+static int mlx5_modify_nic_vport_context(struct mlx5_core_dev *mdev, void *in,
+int inlen)
+{
+   u32 out[MLX5_ST_SZ_DW(modify_nic_vport_context_out)];
+
+   MLX5_SET(modify_nic_vport_context_in, in, opcode,
+MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT);
+
+   return mlx5_cmd_exec_check_status(mdev, in, inlen, out, sizeof(out));
+}
+
 void mlx5_query_nic_vport_mac_address(struct mlx5_core_dev *mdev, u8 *addr)
 {
u32 *out;
@@ -350,3 +361,44 @@ int mlx5_query_hca_vport_node_guid(struct mlx5_core_dev 
*dev,
return err;
 }
 EXPORT_SYMBOL_GPL(mlx5_query_hca_vport_node_guid);
+
+enum mlx5_vport_roce_state {
+   MLX5_VPORT_ROCE_DISABLED = 0,
+   MLX5_VPORT_ROCE_ENABLED  = 1,
+};
+
+static int mlx5_nic_vport_update_roce_state(struct mlx5_core_dev *mdev,
+   enum mlx5_vport_roce_state state)
+{
+   void *in;
+   int inlen = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in);
+   int err;
+
+   in = mlx5_vzalloc(inlen);
+   if (!in) {
+   mlx5_core_warn(mdev, "failed to allocate inbox\n");
+   return -ENOMEM;
+   }
+
+   MLX5_SET(modify_nic_vport_context_in, in, field_select.roce_en, 1);
+   MLX5_SET(modify_nic_vport_context_in, in, nic_vport_context.roce_en,
+state);
+
+   err = mlx5_modify_nic_vport_context(mdev, in, inlen);
+
+   kvfree(in);
+
+   return err;
+}
+
+int mlx5_nic_vport_enable_roce(struct mlx5_core_dev *mdev)
+{
+   return mlx5_nic_vport_update_roce_state(mdev, MLX5_VPORT_ROCE_ENABLED);
+}
+EXPORT_SYMBOL_GPL(mlx5_nic_vport_enable_roce);
+
+int mlx5_nic_vport_disable_roce(struct mlx5_core_dev *mdev)
+{
+   return mlx5_nic_vport_update_roce_state(mdev, MLX5_VPORT_ROCE_DISABLED);
+}
+EXPORT_SYMBOL_GPL(mlx5_nic_vport_disable_roce);
diff --git a/include/linux/mlx5/vport.h b/include/linux/mlx5/vport.h
index 967e0fd..4c9ac60 100644
--- a/include/linux/mlx5/vport.h
+++ b/include/linux/mlx5/vport.h
@@ -52,4 +52,7 @@ int mlx5_query_hca_vport_system_image_guid(struct 
mlx5_core_dev *dev,
 int mlx5_query_hca_vport_node_guid(struct mlx5_core_dev *dev,
   u64 *node_guid);
 
+int mlx5_nic_vport_enable_roce(struct mlx5_core_dev *mdev);
+int mlx5_nic_vport_disable_roce(struct mlx5_core_dev *mdev);
+
 #endif /* __MLX5_VPORT_H__ */
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 01/10] IB/mlx5: Support IB device's callback for getting the link layer

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

Make the existing mlx5_ib_port_link_layer() signature match
the ib device callback signature (add port_num parameter).
Refactor it to use a sub function so that the link layer could
be queried also before the ibdev is created.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/main.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 7e97cb5..a3295bb 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -64,11 +64,9 @@ static char mlx5_version[] =
DRIVER_VERSION " (" DRIVER_RELDATE ")\n";
 
 static enum rdma_link_layer
-mlx5_ib_port_link_layer(struct ib_device *device)
+mlx5_port_type_cap_to_rdma_ll(int port_type_cap)
 {
-   struct mlx5_ib_dev *dev = to_mdev(device);
-
-   switch (MLX5_CAP_GEN(dev->mdev, port_type)) {
+   switch (port_type_cap) {
case MLX5_CAP_PORT_TYPE_IB:
return IB_LINK_LAYER_INFINIBAND;
case MLX5_CAP_PORT_TYPE_ETH:
@@ -78,6 +76,15 @@ mlx5_ib_port_link_layer(struct ib_device *device)
}
 }
 
+static enum rdma_link_layer
+mlx5_ib_port_link_layer(struct ib_device *device, u8 port_num)
+{
+   struct mlx5_ib_dev *dev = to_mdev(device);
+   int port_type_cap = MLX5_CAP_GEN(dev->mdev, port_type);
+
+   return mlx5_port_type_cap_to_rdma_ll(port_type_cap);
+}
+
 static int mlx5_use_mad_ifc(struct mlx5_ib_dev *dev)
 {
return !dev->mdev->issi;
@@ -94,7 +101,7 @@ static int mlx5_get_vport_access_method(struct ib_device 
*ibdev)
if (mlx5_use_mad_ifc(to_mdev(ibdev)))
return MLX5_VPORT_ACCESS_METHOD_MAD;
 
-   if (mlx5_ib_port_link_layer(ibdev) ==
+   if (mlx5_ib_port_link_layer(ibdev, 1) ==
IB_LINK_LAYER_ETHERNET)
return MLX5_VPORT_ACCESS_METHOD_NIC;
 
@@ -1325,11 +1332,16 @@ static int mlx5_port_immutable(struct ib_device *ibdev, 
u8 port_num,
 static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 {
struct mlx5_ib_dev *dev;
+   enum rdma_link_layer ll;
+   int port_type_cap;
int err;
int i;
 
+   port_type_cap = MLX5_CAP_GEN(mdev, port_type);
+   ll = mlx5_port_type_cap_to_rdma_ll(port_type_cap);
+
/* don't create IB instance over Eth ports, no RoCE yet! */
-   if (MLX5_CAP_GEN(mdev, port_type) == MLX5_CAP_PORT_TYPE_ETH)
+   if (ll == IB_LINK_LAYER_ETHERNET)
return NULL;
 
printk_once(KERN_INFO "%s", mlx5_version);
@@ -1389,6 +1401,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 
dev->ib_dev.query_device= mlx5_ib_query_device;
dev->ib_dev.query_port  = mlx5_ib_query_port;
+   dev->ib_dev.get_link_layer  = mlx5_ib_port_link_layer;
dev->ib_dev.query_gid   = mlx5_ib_query_gid;
dev->ib_dev.query_pkey  = mlx5_ib_query_pkey;
dev->ib_dev.modify_device   = mlx5_ib_modify_device;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 05/10] net/mlx5_core: Introduce access functions to query vport RoCE fields

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

Introduce access functions to query NIC vport system_image_guid,
node_guid and qkey_viol_cntr.

Signed-off-by: Achiad Shochat 
---
 drivers/net/ethernet/mellanox/mlx5/core/vport.c | 62 +
 include/linux/mlx5/mlx5_ifc.h   | 10 +++-
 include/linux/mlx5/vport.h  |  5 ++
 3 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c 
b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index 245ff4a..ecb274a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -103,6 +103,68 @@ void mlx5_query_nic_vport_mac_address(struct mlx5_core_dev 
*mdev, u8 *addr)
 }
 EXPORT_SYMBOL(mlx5_query_nic_vport_mac_address);
 
+int mlx5_query_nic_vport_system_image_guid(struct mlx5_core_dev *mdev,
+  u64 *system_image_guid)
+{
+   u32 *out;
+   int outlen = MLX5_ST_SZ_BYTES(query_nic_vport_context_out);
+
+   out = mlx5_vzalloc(outlen);
+   if (!out)
+   return -ENOMEM;
+
+   mlx5_query_nic_vport_context(mdev, out, outlen);
+
+   *system_image_guid = MLX5_GET64(query_nic_vport_context_out, out,
+   nic_vport_context.system_image_guid);
+
+   kfree(out);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_system_image_guid);
+
+int mlx5_query_nic_vport_node_guid(struct mlx5_core_dev *mdev, u64 *node_guid)
+{
+   u32 *out;
+   int outlen = MLX5_ST_SZ_BYTES(query_nic_vport_context_out);
+
+   out = mlx5_vzalloc(outlen);
+   if (!out)
+   return -ENOMEM;
+
+   mlx5_query_nic_vport_context(mdev, out, outlen);
+
+   *node_guid = MLX5_GET64(query_nic_vport_context_out, out,
+   nic_vport_context.node_guid);
+
+   kfree(out);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_node_guid);
+
+int mlx5_query_nic_vport_qkey_viol_cntr(struct mlx5_core_dev *mdev,
+   u16 *qkey_viol_cntr)
+{
+   u32 *out;
+   int outlen = MLX5_ST_SZ_BYTES(query_nic_vport_context_out);
+
+   out = mlx5_vzalloc(outlen);
+   if (!out)
+   return -ENOMEM;
+
+   mlx5_query_nic_vport_context(mdev, out, outlen);
+
+   *qkey_viol_cntr = MLX5_GET(query_nic_vport_context_out, out,
+  nic_vport_context.qkey_violation_counter);
+
+   kfree(out);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_qkey_viol_cntr);
+
 int mlx5_query_hca_vport_gid(struct mlx5_core_dev *dev, u8 other_vport,
 u8 port_num, u16  vf_num, u16 gid_index,
 union ib_gid *gid)
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 1565324..49b34c6 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -2141,7 +2141,15 @@ struct mlx5_ifc_nic_vport_context_bits {
u8 reserved_0[0x1f];
u8 roce_en[0x1];
 
-   u8 reserved_1[0x760];
+   u8 reserved_1[0x120];
+
+   u8 system_image_guid[0x40];
+   u8 port_guid[0x40];
+   u8 node_guid[0x40];
+
+   u8 reserved_5[0x140];
+   u8 qkey_violation_counter[0x10];
+   u8 reserved_6[0x430];
 
u8 reserved_2[0x5];
u8 allowed_list_type[0x3];
diff --git a/include/linux/mlx5/vport.h b/include/linux/mlx5/vport.h
index 4c9ac60..dfb2d94 100644
--- a/include/linux/mlx5/vport.h
+++ b/include/linux/mlx5/vport.h
@@ -37,6 +37,11 @@
 
 u8 mlx5_query_vport_state(struct mlx5_core_dev *mdev, u8 opmod);
 void mlx5_query_nic_vport_mac_address(struct mlx5_core_dev *mdev, u8 *addr);
+int mlx5_query_nic_vport_system_image_guid(struct mlx5_core_dev *mdev,
+  u64 *system_image_guid);
+int mlx5_query_nic_vport_node_guid(struct mlx5_core_dev *mdev, u64 *node_guid);
+int mlx5_query_nic_vport_qkey_viol_cntr(struct mlx5_core_dev *mdev,
+   u16 *qkey_viol_cntr);
 int mlx5_query_hca_vport_gid(struct mlx5_core_dev *dev, u8 other_vport,
 u8 port_num, u16  vf_num, u16 gid_index,
 union ib_gid *gid);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 03/10] net/mlx5_core: Break down the vport mac address query function

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

Introduce a new function called mlx5_query_nic_vport_context().
This function gets all the NIC vport attributes from the device.

The MAC address is just one of the NIC vport attributes, so
mlx5_query_nic_vport_mac_address() is now just a wrapper function
above mlx5_query_nic_vport_context().

More NIC vport attributes will be used in following commits.

Signed-off-by: Achiad Shochat 
---
 drivers/net/ethernet/mellanox/mlx5/core/vport.c | 27 -
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c 
b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index b94177e..54ab63b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -57,12 +57,25 @@ u8 mlx5_query_vport_state(struct mlx5_core_dev *mdev, u8 
opmod)
 }
 EXPORT_SYMBOL(mlx5_query_vport_state);
 
+static int mlx5_query_nic_vport_context(struct mlx5_core_dev *mdev, u32 *out,
+   int outlen)
+{
+   u32 in[MLX5_ST_SZ_DW(query_nic_vport_context_in)];
+
+   memset(in, 0, sizeof(in));
+
+   MLX5_SET(query_nic_vport_context_in, in, opcode,
+MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT);
+
+   return mlx5_cmd_exec_check_status(mdev, in, sizeof(in), out, outlen);
+}
+
 void mlx5_query_nic_vport_mac_address(struct mlx5_core_dev *mdev, u8 *addr)
 {
-   u32  in[MLX5_ST_SZ_DW(query_nic_vport_context_in)];
u32 *out;
int outlen = MLX5_ST_SZ_BYTES(query_nic_vport_context_out);
u8 *out_addr;
+   int err;
 
out = mlx5_vzalloc(outlen);
if (!out)
@@ -71,15 +84,9 @@ void mlx5_query_nic_vport_mac_address(struct mlx5_core_dev 
*mdev, u8 *addr)
out_addr = MLX5_ADDR_OF(query_nic_vport_context_out, out,
nic_vport_context.permanent_address);
 
-   memset(in, 0, sizeof(in));
-
-   MLX5_SET(query_nic_vport_context_in, in, opcode,
-MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT);
-
-   memset(out, 0, outlen);
-   mlx5_cmd_exec_check_status(mdev, in, sizeof(in), out, outlen);
-
-   ether_addr_copy(addr, _addr[2]);
+   err = mlx5_query_nic_vport_context(mdev, out, outlen);
+   if (!err)
+   ether_addr_copy(addr, _addr[2]);
 
kvfree(out);
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 09/10] IB/mlx5: Add RoCE fields to Address Vector

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

Set the address handle and QP address path fields according to the
link layer type (IB/Eth).

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/ah.c  | 32 +--
 drivers/infiniband/hw/mlx5/main.c| 21 ++
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  5 +++--
 drivers/infiniband/hw/mlx5/qp.c  | 42 ++--
 include/linux/mlx5/qp.h  | 21 --
 5 files changed, 96 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/ah.c b/drivers/infiniband/hw/mlx5/ah.c
index 6608058..745efa4 100644
--- a/drivers/infiniband/hw/mlx5/ah.c
+++ b/drivers/infiniband/hw/mlx5/ah.c
@@ -32,8 +32,10 @@
 
 #include "mlx5_ib.h"
 
-struct ib_ah *create_ib_ah(struct ib_ah_attr *ah_attr,
-  struct mlx5_ib_ah *ah)
+static struct ib_ah *create_ib_ah(struct mlx5_ib_dev *dev,
+ struct mlx5_ib_ah *ah,
+ struct ib_ah_attr *ah_attr,
+ enum rdma_link_layer ll)
 {
if (ah_attr->ah_flags & IB_AH_GRH) {
memcpy(ah->av.rgid, _attr->grh.dgid, 16);
@@ -44,9 +46,20 @@ struct ib_ah *create_ib_ah(struct ib_ah_attr *ah_attr,
ah->av.tclass = ah_attr->grh.traffic_class;
}
 
-   ah->av.rlid = cpu_to_be16(ah_attr->dlid);
-   ah->av.fl_mlid = ah_attr->src_path_bits & 0x7f;
-   ah->av.stat_rate_sl = (ah_attr->static_rate << 4) | (ah_attr->sl & 0xf);
+   ah->av.stat_rate_sl = (ah_attr->static_rate << 4);
+
+   if (ll == IB_LINK_LAYER_ETHERNET) {
+   memcpy(ah->av.rmac, ah_attr->dmac, sizeof(ah_attr->dmac));
+   ah->av.udp_sport =
+   mlx5_get_roce_udp_sport(dev,
+   ah_attr->port_num,
+   ah_attr->grh.sgid_index);
+   ah->av.stat_rate_sl |= (ah_attr->sl & 0x7) << 1;
+   } else {
+   ah->av.rlid = cpu_to_be16(ah_attr->dlid);
+   ah->av.fl_mlid = ah_attr->src_path_bits & 0x7f;
+   ah->av.stat_rate_sl |= (ah_attr->sl & 0xf);
+   }
 
return >ibah;
 }
@@ -54,12 +67,19 @@ struct ib_ah *create_ib_ah(struct ib_ah_attr *ah_attr,
 struct ib_ah *mlx5_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
 {
struct mlx5_ib_ah *ah;
+   struct mlx5_ib_dev *dev = to_mdev(pd->device);
+   enum rdma_link_layer ll;
+
+   ll = pd->device->get_link_layer(pd->device, ah_attr->port_num);
+
+   if (ll == IB_LINK_LAYER_ETHERNET && !(ah_attr->ah_flags & IB_AH_GRH))
+   return ERR_PTR(-EINVAL);
 
ah = kzalloc(sizeof(*ah), GFP_ATOMIC);
if (!ah)
return ERR_PTR(-ENOMEM);
 
-   return create_ib_ah(ah_attr, ah); /* never fails */
+   return create_ib_ah(dev, ah, ah_attr, ll); /* never fails */
 }
 
 int mlx5_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr)
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index bbe92ca..3ee431a 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -252,6 +253,26 @@ static int mlx5_ib_del_gid(struct ib_device *device, u8 
port_num,
return set_roce_addr(device, port_num, index, NULL, NULL);
 }
 
+__be16 mlx5_get_roce_udp_sport(struct mlx5_ib_dev *dev, u8 port_num,
+  int index)
+{
+   struct ib_gid_attr attr;
+   union ib_gid gid;
+
+   if (ib_get_cached_gid(>ib_dev, port_num, index, , ))
+   return 0;
+
+   if (!attr.ndev)
+   return 0;
+
+   dev_put(attr.ndev);
+
+   if (attr.gid_type != IB_GID_TYPE_ROCE_UDP_ENCAP)
+   return 0;
+
+   return cpu_to_be16(MLX5_CAP_ROCE(dev->mdev, r_roce_min_src_udp_port));
+}
+
 static int mlx5_use_mad_ifc(struct mlx5_ib_dev *dev)
 {
return !dev->mdev->issi;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 1eaa611..b0deeb3 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -517,8 +517,6 @@ void mlx5_ib_free_srq_wqe(struct mlx5_ib_srq *srq, int 
wqe_index);
 int mlx5_MAD_IFC(struct mlx5_ib_dev *dev, int ignore_mkey, int ignore_bkey,
 u8 port, const struct ib_wc *in_wc, const struct ib_grh 
*in_grh,
 const void *in_mad, void *response_mad);
-struct ib_ah *create_ib_ah(struct ib_ah_attr *ah_attr,
-  struct mlx5_ib_ah *ah);
 struct ib_ah *mlx5_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr);
 int mlx5_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr);
 int mlx5_ib_destroy_ah(struct ib_ah *ah);
@@ -647,6 +645,9 @@ static inline void 

[PATCH for-next V2 08/10] IB/mlx5: Support IB device's callbacks for adding/deleting GIDs

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

These callbacks write into the mlx5 RoCE address table.
Upon del_gid we write a zero'd GID.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/main.c | 89 +++
 include/linux/mlx5/device.h   | 20 +
 2 files changed, 109 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 2b6ac2e..bbe92ca 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -165,6 +165,93 @@ static int mlx5_query_port_roce(struct ib_device *device, 
u8 port_num,
return 0;
 }
 
+static void ib_gid_to_mlx5_roce_addr(const union ib_gid *gid,
+const struct ib_gid_attr *attr,
+void *mlx5_addr)
+{
+#define MLX5_SET_RA(p, f, v) MLX5_SET(roce_addr_layout, p, f, v)
+   char *mlx5_addr_l3_addr = MLX5_ADDR_OF(roce_addr_layout, mlx5_addr,
+  source_l3_address);
+   void *mlx5_addr_mac = MLX5_ADDR_OF(roce_addr_layout, mlx5_addr,
+  source_mac_47_32);
+
+   if (!gid)
+   return;
+
+   ether_addr_copy(mlx5_addr_mac, attr->ndev->dev_addr);
+
+   if (is_vlan_dev(attr->ndev)) {
+   MLX5_SET_RA(mlx5_addr, vlan_valid, 1);
+   MLX5_SET_RA(mlx5_addr, vlan_id, vlan_dev_vlan_id(attr->ndev));
+   }
+
+   switch (attr->gid_type) {
+   case IB_GID_TYPE_IB:
+   MLX5_SET_RA(mlx5_addr, roce_version, MLX5_ROCE_VERSION_1);
+   break;
+   case IB_GID_TYPE_ROCE_UDP_ENCAP:
+   MLX5_SET_RA(mlx5_addr, roce_version, MLX5_ROCE_VERSION_2);
+   break;
+
+   default:
+   WARN_ON(true);
+   }
+
+   if (attr->gid_type != IB_GID_TYPE_IB) {
+   if (ipv6_addr_v4mapped((void *)gid))
+   MLX5_SET_RA(mlx5_addr, roce_l3_type,
+   MLX5_ROCE_L3_TYPE_IPV4);
+   else
+   MLX5_SET_RA(mlx5_addr, roce_l3_type,
+   MLX5_ROCE_L3_TYPE_IPV6);
+   }
+
+   if ((attr->gid_type == IB_GID_TYPE_IB) ||
+   !ipv6_addr_v4mapped((void *)gid))
+   memcpy(mlx5_addr_l3_addr, gid, sizeof(*gid));
+   else
+   memcpy(_addr_l3_addr[12], >raw[12], 4);
+}
+
+static int set_roce_addr(struct ib_device *device, u8 port_num,
+unsigned int index,
+const union ib_gid *gid,
+const struct ib_gid_attr *attr)
+{
+   struct mlx5_ib_dev *dev = to_mdev(device);
+   u32  in[MLX5_ST_SZ_DW(set_roce_address_in)];
+   u32 out[MLX5_ST_SZ_DW(set_roce_address_out)];
+   void *in_addr = MLX5_ADDR_OF(set_roce_address_in, in, roce_address);
+   enum rdma_link_layer ll = mlx5_ib_port_link_layer(device, port_num);
+
+   if (ll != IB_LINK_LAYER_ETHERNET)
+   return -EINVAL;
+
+   memset(in, 0, sizeof(in));
+
+   ib_gid_to_mlx5_roce_addr(gid, attr, in_addr);
+
+   MLX5_SET(set_roce_address_in, in, roce_address_index, index);
+   MLX5_SET(set_roce_address_in, in, opcode, MLX5_CMD_OP_SET_ROCE_ADDRESS);
+
+   memset(out, 0, sizeof(out));
+   return mlx5_cmd_exec(dev->mdev, in, sizeof(in), out, sizeof(out));
+}
+
+static int mlx5_ib_add_gid(struct ib_device *device, u8 port_num,
+  unsigned int index, const union ib_gid *gid,
+  const struct ib_gid_attr *attr,
+  __always_unused void **context)
+{
+   return set_roce_addr(device, port_num, index, gid, attr);
+}
+
+static int mlx5_ib_del_gid(struct ib_device *device, u8 port_num,
+  unsigned int index, __always_unused void **context)
+{
+   return set_roce_addr(device, port_num, index, NULL, NULL);
+}
+
 static int mlx5_use_mad_ifc(struct mlx5_ib_dev *dev)
 {
return !dev->mdev->issi;
@@ -1515,6 +1602,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
if (ll == IB_LINK_LAYER_ETHERNET)
dev->ib_dev.get_netdev  = mlx5_ib_get_netdev;
dev->ib_dev.query_gid   = mlx5_ib_query_gid;
+   dev->ib_dev.add_gid = mlx5_ib_add_gid;
+   dev->ib_dev.del_gid = mlx5_ib_del_gid;
dev->ib_dev.query_pkey  = mlx5_ib_query_pkey;
dev->ib_dev.modify_device   = mlx5_ib_modify_device;
dev->ib_dev.modify_port = mlx5_ib_modify_port;
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 84aa7e0..ea4281b 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -279,6 +279,26 @@ enum {
 };
 
 enum {
+   MLX5_ROCE_VERSION_1 = 0,
+   MLX5_ROCE_VERSION_2 = 2,
+};
+
+enum {
+   MLX5_ROCE_VERSION_1_CAP = 1 

[PATCH for-next V2 06/10] IB/mlx5: Extend query_device/port to support RoCE

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

Using the vport access functions to retrieve the Ethernet
specific information and return this information in
ib_query_device and ib_query_port.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/main.c | 75 +++
 include/linux/mlx5/driver.h   |  7 
 2 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 8c6e144..2b6ac2e 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -120,6 +121,50 @@ static struct net_device *mlx5_ib_get_netdev(struct 
ib_device *device,
return ndev;
 }
 
+static int mlx5_query_port_roce(struct ib_device *device, u8 port_num,
+   struct ib_port_attr *props)
+{
+   struct mlx5_ib_dev *dev = to_mdev(device);
+   struct net_device *ndev;
+   enum ib_mtu ndev_ib_mtu;
+
+   memset(props, 0, sizeof(*props));
+
+   props->port_cap_flags  |= IB_PORT_CM_SUP;
+   props->port_cap_flags  |= IB_PORT_IP_BASED_GIDS;
+
+   props->gid_tbl_len  = MLX5_CAP_ROCE(dev->mdev,
+   roce_address_table_size);
+   props->max_mtu  = IB_MTU_4096;
+   props->max_msg_sz   = 1 << MLX5_CAP_GEN(dev->mdev, log_max_msg);
+   props->pkey_tbl_len = 1;
+   props->state= IB_PORT_DOWN;
+   props->phys_state   = 3;
+
+   mlx5_query_nic_vport_qkey_viol_cntr(dev->mdev,
+   (u16 *)>qkey_viol_cntr);
+
+   ndev = mlx5_ib_get_netdev(device, port_num);
+   if (!ndev)
+   return 0;
+
+   if (netif_running(ndev) && netif_carrier_ok(ndev)) {
+   props->state  = IB_PORT_ACTIVE;
+   props->phys_state = 5;
+   }
+
+   ndev_ib_mtu = iboe_get_mtu(ndev->mtu);
+
+   dev_put(ndev);
+
+   props->active_mtu   = min(props->max_mtu, ndev_ib_mtu);
+
+   props->active_width = IB_WIDTH_4X;  /* TODO */
+   props->active_speed = IB_SPEED_QDR; /* TODO */
+
+   return 0;
+}
+
 static int mlx5_use_mad_ifc(struct mlx5_ib_dev *dev)
 {
return !dev->mdev->issi;
@@ -158,13 +203,21 @@ static int mlx5_query_system_image_guid(struct ib_device 
*ibdev,
 
case MLX5_VPORT_ACCESS_METHOD_HCA:
err = mlx5_query_hca_vport_system_image_guid(mdev, );
-   if (!err)
-   *sys_image_guid = cpu_to_be64(tmp);
-   return err;
+   break;
+
+   case MLX5_VPORT_ACCESS_METHOD_NIC:
+   err = mlx5_query_nic_vport_system_image_guid(mdev, );
+   break;
 
default:
return -EINVAL;
}
+
+   if (!err)
+   *sys_image_guid = cpu_to_be64(tmp);
+
+   return err;
+
 }
 
 static int mlx5_query_max_pkeys(struct ib_device *ibdev,
@@ -218,13 +271,20 @@ static int mlx5_query_node_guid(struct mlx5_ib_dev *dev,
 
case MLX5_VPORT_ACCESS_METHOD_HCA:
err = mlx5_query_hca_vport_node_guid(dev->mdev, );
-   if (!err)
-   *node_guid = cpu_to_be64(tmp);
-   return err;
+   break;
+
+   case MLX5_VPORT_ACCESS_METHOD_NIC:
+   err = mlx5_query_nic_vport_node_guid(dev->mdev, );
+   break;
 
default:
return -EINVAL;
}
+
+   if (!err)
+   *node_guid = cpu_to_be64(tmp);
+
+   return err;
 }
 
 struct mlx5_reg_node_desc {
@@ -522,6 +582,9 @@ int mlx5_ib_query_port(struct ib_device *ibdev, u8 port,
case MLX5_VPORT_ACCESS_METHOD_HCA:
return mlx5_query_hca_port(ibdev, port, props);
 
+   case MLX5_VPORT_ACCESS_METHOD_NIC:
+   return mlx5_query_port_roce(ibdev, port, props);
+
default:
return -EINVAL;
}
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 5c857f2..7b9c976 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -632,13 +632,6 @@ extern struct workqueue_struct *mlx5_core_wq;
.struct_offset_bytes = offsetof(struct ib_unpacked_ ## header, field),  
\
.struct_size_bytes   = sizeof((struct ib_unpacked_ ## header *)0)->field
 
-struct ib_field {
-   size_t struct_offset_bytes;
-   size_t struct_size_bytes;
-   intoffset_bits;
-   intsize_bits;
-};
-
 static inline struct mlx5_core_dev *pci2mlx5_core_dev(struct pci_dev *pdev)
 {
return pci_get_drvdata(pdev);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 10/10] IB/mlx5: Support RoCE

2015-12-23 Thread Matan Barak
From: Achiad Shochat 

Advertise RoCE support for IB/core layer and set the hardware to
work in RoCE mode.

Signed-off-by: Achiad Shochat 
---
 drivers/infiniband/hw/mlx5/main.c | 48 +++
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 3ee431a..22ae093 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1517,6 +1517,32 @@ static void destroy_dev_resources(struct 
mlx5_ib_resources *devr)
mlx5_ib_dealloc_pd(devr->p0);
 }
 
+static u32 get_core_cap_flags(struct ib_device *ibdev)
+{
+   struct mlx5_ib_dev *dev = to_mdev(ibdev);
+   enum rdma_link_layer ll = mlx5_ib_port_link_layer(ibdev, 1);
+   u8 l3_type_cap = MLX5_CAP_ROCE(dev->mdev, l3_type);
+   u8 roce_version_cap = MLX5_CAP_ROCE(dev->mdev, roce_version);
+   u32 ret = 0;
+
+   if (ll == IB_LINK_LAYER_INFINIBAND)
+   return RDMA_CORE_PORT_IBA_IB;
+
+   if (!(l3_type_cap & MLX5_ROCE_L3_TYPE_IPV4_CAP))
+   return 0;
+
+   if (!(l3_type_cap & MLX5_ROCE_L3_TYPE_IPV6_CAP))
+   return 0;
+
+   if (roce_version_cap & MLX5_ROCE_VERSION_1_CAP)
+   ret |= RDMA_CORE_PORT_IBA_ROCE;
+
+   if (roce_version_cap & MLX5_ROCE_VERSION_2_CAP)
+   ret |= RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP;
+
+   return ret;
+}
+
 static int mlx5_port_immutable(struct ib_device *ibdev, u8 port_num,
   struct ib_port_immutable *immutable)
 {
@@ -1529,7 +1555,7 @@ static int mlx5_port_immutable(struct ib_device *ibdev, 
u8 port_num,
 
immutable->pkey_tbl_len = attr.pkey_tbl_len;
immutable->gid_tbl_len = attr.gid_tbl_len;
-   immutable->core_cap_flags = RDMA_CORE_PORT_IBA_IB;
+   immutable->core_cap_flags = get_core_cap_flags(ibdev);
immutable->max_mad_size = IB_MGMT_MAD_SIZE;
 
return 0;
@@ -1537,12 +1563,27 @@ static int mlx5_port_immutable(struct ib_device *ibdev, 
u8 port_num,
 
 static int mlx5_enable_roce(struct mlx5_ib_dev *dev)
 {
+   int err;
+
dev->roce.nb.notifier_call = mlx5_netdev_event;
-   return register_netdevice_notifier(>roce.nb);
+   err = register_netdevice_notifier(>roce.nb);
+   if (err)
+   return err;
+
+   err = mlx5_nic_vport_enable_roce(dev->mdev);
+   if (err)
+   goto err_unregister_netdevice_notifier;
+
+   return 0;
+
+err_unregister_netdevice_notifier:
+   unregister_netdevice_notifier(>roce.nb);
+   return err;
 }
 
 static void mlx5_disable_roce(struct mlx5_ib_dev *dev)
 {
+   mlx5_nic_vport_disable_roce(dev->mdev);
unregister_netdevice_notifier(>roce.nb);
 }
 
@@ -1557,8 +1598,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
port_type_cap = MLX5_CAP_GEN(mdev, port_type);
ll = mlx5_port_type_cap_to_rdma_ll(port_type_cap);
 
-   /* don't create IB instance over Eth ports, no RoCE yet! */
-   if (ll == IB_LINK_LAYER_ETHERNET)
+   if ((ll == IB_LINK_LAYER_ETHERNET) && !MLX5_CAP_GEN(mdev, roce))
return NULL;
 
printk_once(KERN_INFO "%s", mlx5_version);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC

2015-12-23 Thread Matan Barak
On Wed, Dec 23, 2015 at 10:04 PM, Doug Ledford  wrote:
> On 10/15/2015 12:58 PM, Hefty, Sean wrote:
> ib_create_ah_from_wc needs to resolve the DMAC in order to create the
> AH (this may result sending an ARP and waiting for response).
> CM uses this function (which is now sleepable).

 This is a significant change to the CM.  The CM calls are invoked
>>> assuming that they return relatively quickly.  They're invoked from
>>> callbacks and internally.  Having the calls now wait for an ARP response
>>> requires that this be re-architected, so the calling thread doesn't go out
>>> to lunch for several seconds.
>>>
>>> Agree - this is a significant change, but it was done a long time ago
>>> (at v4.3 if I recall). When we need to send a message we need to
>>
>> We're at 4.3-rc5?
>>
>>> figure out the destination MAC. Even the passive side needs to do that
>>> as some vendors don't report the source MAC of the packet in their wc.
>>> Even if they did, since IP based addressing is rout-able by its
>>> nature, it should follow the networking stack rules. Some crazy
>>> configurations could force sending responses to packets that came from
>>> router1 to router2 - so we have no choice than resolving the DMAC at
>>> every side.
>>
>> Ib_create_ah_from_wc is broken.   It is now an asynchronous operation, only 
>> the call itself was left as synchronous.  We can't block kernel threads for 
>> a minute, or however long ARP takes to resolve.  The call itself must change 
>> to be async, and all users of it updated to allocate some request, queue it, 
>> and handle all race conditions that result -- such as state changes or 
>> destruction of the work that caused the request to be initiated.
>>
>
> I don't know who had intended to address this, but it got left out of
> the 4.4 work.  We need to not let this drop through the cracks (for
> another release).  Can someone please put fixing this properly on their
> TODO list?
>

IMHO, the proposed patch makes things better. Not applying the current
patch means we have a "sleeping while atomic" error (in addition to
the fact that kernel threads could wait until the ARP process
finishes), which is pretty bad. I tend to agree that adding another CM
state is probably a better approach, but unless someone steps up and
add this for v4.5, I think that's the best thing we have.

> --
> Doug Ledford 
>   GPG KeyID: 0E572FDD
>
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/cma: cma_match_net_dev needs to take into account port_num

2015-12-23 Thread Matan Barak
On Wed, Dec 23, 2015 at 7:57 PM, Doug Ledford <dledf...@redhat.com> wrote:
> On 12/23/2015 11:35 AM, Matan Barak wrote:
>> On Wed, Dec 23, 2015 at 6:08 PM, Doug Ledford <dledf...@redhat.com> wrote:
>>> On 12/22/2015 02:26 PM, Matan Barak wrote:
>>>> On Tue, Dec 22, 2015 at 8:58 PM, Doug Ledford <dledf...@redhat.com> wrote:
>>>>> On 12/22/2015 05:47 AM, Or Gerlitz wrote:
>>>>>> On 12/21/2015 5:01 PM, Matan Barak wrote:
>>>>>>> Previously, cma_match_net_dev called cma_protocol_roce which
>>>>>>> tried to verify that the IB device uses RoCE protocol. However,
>>>>>>> if rdma_id didn't have a bounded port, it used the first port
>>>>>>> of the device.
>>>>>>>
>>>>>>> In VPI systems, the first port might be an IB port while the second
>>>>>>> one could be an Ethernet port. This made requests for unbounded rdma_ids
>>>>>>> that come from the Ethernet port fail.
>>>>>>> Fixing this by passing the port of the request and checking this port
>>>>>>> of the device.
>>>>>>>
>>>>>>> Fixes: b8cab5dab15f ('IB/cma: Accept connection without a valid netdev
>>>>>>> on RoCE')
>>>>>>> Signed-off-by: Matan Barak<mat...@mellanox.com>
>>>>>>
>>>>>> seems that the patch is missing from patchworks, I can't explain that.
>>>>>
>>>>> I've already downloaded it and marked it accepted.
>>>>>
>>>>
>>>> Thanks Doug. Would you like that I'll repost the patch with the commit
>>>> message changed as Or suggested or is the current version good enough?
>>>>
>>>> Regarding the Ethernet loopback issue, I started looking into that,
>>>> but as Or stated, it's broken even before the RoCE patches.
>>>
>>> Ping.  Any progress on this?
>>
>> Yeah, there's some progress - the basic problem is that we don't have
>> a bounded ndev and thus cma_resolve_iboe_route returns -ENODEV.
>
> Which makes sense considering that 127.0.0.1 doesn't belong to any of
> the devs.
>
>> The root cause for this is that we have to store the ndev in
>> cma_bind_loopback. Even after doing that, cma_set_loopback changes the
>> sgid to be the localhost GID, which doesn't exist in the GID table and
>> thus will fail later in the GID lookup.
>
> Again, makes sense.
>
>> I think that regarding loopback, we actually want to send the data on
>> the link local default GID,
>
> Which link local default GID?  If you have more than one port or card,
> then that is not a unique value.

We assume that every RoCE port has an associated net device. Since a
net device should have a unique MAC, it should have a unique IPv6 link
local address and thus a unique GID.

>
>> which is guaranteed to exist.
>
> And in many cases, multiple times.
>
>> That's why I
>> think we should:
>> 1. Change the cma_src_addr and cma_dst_addr in cma_bind_loopback to be
>> the default GID.
>> 2. Store the associated ndev of this default GID as the bounded device.
>> 3. In cma_resolve_loopback, get the MAC of this bounded device and
>> store it as the DMAC.
>> 4. In cma_resolve_iboe_route, don't try to do route resolve if the
>> dGID matches the default GID.
>>
>> It's still not working though, but this is where I'm headed. What do you 
>> think?
>
> Let's punt this until later.  It only effects the situation when you use
> 127.0.0.1 as the address.  If you use the local IP address of a specific
> interface, you get the same loopback behavior, but no failures (and on
> top of that instead of getting a random device to handle the loopback
> transfer, you get a specific device of your choosing).  To me, that
> qualifies as a reasonable workaround.  The 127.0.0.1 behavior has been
> broken for a while (and I'm not sure it should have ever been relied
> upon anyway), so I don't think we have to hold things up.
>

I totally agree that it's better to use the local IP address and not
just get a random device by using 127.0.0.1. You could get a specific
device by binding it, but then - use its local IP instead of
127.0.0.1.


> --
> Doug Ledford <dledf...@redhat.com>
>   GPG KeyID: 0E572FDD
>
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V3 10/11] IB/core: Initialize UD header structure with IP and UDP headers

2015-12-23 Thread Matan Barak
From: Moni Shoua <mo...@mellanox.com>

ib_ud_header_init() is used to format InfiniBand headers
in a buffer up to (but not with) BTH. For RoCE UDP ENCAP it is
required that this function would be able to build also IP and UDP
headers.

Signed-off-by: Moni Shoua <mo...@mellanox.com>
Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/ud_header.c|  155 +---
 drivers/infiniband/hw/mlx4/qp.c|7 +-
 drivers/infiniband/hw/mthca/mthca_qp.c |2 +-
 include/rdma/ib_pack.h |   45 --
 4 files changed, 188 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/core/ud_header.c 
b/drivers/infiniband/core/ud_header.c
index 72feee6..96697e7 100644
--- a/drivers/infiniband/core/ud_header.c
+++ b/drivers/infiniband/core/ud_header.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -116,6 +117,72 @@ static const struct ib_field vlan_table[]  = {
  .size_bits= 16 }
 };
 
+static const struct ib_field ip4_table[]  = {
+   { STRUCT_FIELD(ip4, ver),
+ .offset_words = 0,
+ .offset_bits  = 0,
+ .size_bits= 4 },
+   { STRUCT_FIELD(ip4, hdr_len),
+ .offset_words = 0,
+ .offset_bits  = 4,
+ .size_bits= 4 },
+   { STRUCT_FIELD(ip4, tos),
+ .offset_words = 0,
+ .offset_bits  = 8,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, tot_len),
+ .offset_words = 0,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, id),
+ .offset_words = 1,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, frag_off),
+ .offset_words = 1,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, ttl),
+ .offset_words = 2,
+ .offset_bits  = 0,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, protocol),
+ .offset_words = 2,
+ .offset_bits  = 8,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, check),
+ .offset_words = 2,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, saddr),
+ .offset_words = 3,
+ .offset_bits  = 0,
+ .size_bits= 32 },
+   { STRUCT_FIELD(ip4, daddr),
+ .offset_words = 4,
+ .offset_bits  = 0,
+ .size_bits= 32 }
+};
+
+static const struct ib_field udp_table[]  = {
+   { STRUCT_FIELD(udp, sport),
+ .offset_words = 0,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, dport),
+ .offset_words = 0,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, length),
+ .offset_words = 1,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, csum),
+ .offset_words = 1,
+ .offset_bits  = 16,
+ .size_bits= 16 }
+};
+
 static const struct ib_field grh_table[]  = {
{ STRUCT_FIELD(grh, ip_version),
  .offset_words = 0,
@@ -213,26 +280,57 @@ static const struct ib_field deth_table[] = {
  .size_bits= 24 }
 };
 
+__be16 ib_ud_ip4_csum(struct ib_ud_header *header)
+{
+   struct iphdr iph;
+
+   iph.ihl = 5;
+   iph.version = 4;
+   iph.tos = header->ip4.tos;
+   iph.tot_len = header->ip4.tot_len;
+   iph.id  = header->ip4.id;
+   iph.frag_off= header->ip4.frag_off;
+   iph.ttl = header->ip4.ttl;
+   iph.protocol= header->ip4.protocol;
+   iph.check   = 0;
+   iph.saddr   = header->ip4.saddr;
+   iph.daddr   = header->ip4.daddr;
+
+   return ip_fast_csum((u8 *), iph.ihl);
+}
+EXPORT_SYMBOL(ib_ud_ip4_csum);
+
 /**
  * ib_ud_header_init - Initialize UD header structure
  * @payload_bytes:Length of packet payload
  * @lrh_present: specify if LRH is present
  * @eth_present: specify if Eth header is present
  * @vlan_present: packet is tagged vlan
- * @grh_present:GRH flag (if non-zero, GRH will be included)
+ * @grh_present: GRH flag (if non-zero, GRH will be included)
+ * @ip_version: if non-zero, IP header, V4 or V6, will be included
+ * @udp_present :if non-zero, UDP header will be included
  * @immediate_present: specify if immediate data is present
  * @header:Structure to initialize
  */
-void ib_ud_header_init(int payload_bytes,
-  int  lrh_present,
-  int  eth_present,
-  int  vlan_present,
-  int  grh_present,
-  int  immediate_present,
-  struct ib_ud_header *header)
+int ib_ud_header_init(int payload_bytes,
+ intlrh_present

[PATCH for-next V3 06/11] IB/core: Move rdma_is_upper_dev_rcu to header file

2015-12-23 Thread Matan Barak
In order to validate the route, we need an easy way to check if a
net-device belongs to our RDMA device. Move this helper function
to a header file in order to make this check easier.

Signed-off-by: Matan Barak <mat...@mellanox.com>
Reviewed-by: Haggai Eran <hagg...@mellanox.com>
---
 drivers/infiniband/core/core_priv.h |   13 +
 drivers/infiniband/core/roce_gid_mgmt.c |   20 
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index d531f91..3b250a2 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -96,4 +96,17 @@ int ib_cache_setup_one(struct ib_device *device);
 void ib_cache_cleanup_one(struct ib_device *device);
 void ib_cache_release_one(struct ib_device *device);
 
+static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
+struct net_device *upper)
+{
+   struct net_device *_upper = NULL;
+   struct list_head *iter;
+
+   netdev_for_each_all_upper_dev_rcu(dev, _upper, iter)
+   if (_upper == upper)
+   break;
+
+   return _upper == upper;
+}
+
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index 1e3673f..06556c3 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -139,18 +139,6 @@ static enum bonding_slave_state 
is_eth_active_slave_of_bonding_rcu(struct net_de
return BONDING_SLAVE_STATE_NA;
 }
 
-static bool is_upper_dev_rcu(struct net_device *dev, struct net_device *upper)
-{
-   struct net_device *_upper = NULL;
-   struct list_head *iter;
-
-   netdev_for_each_all_upper_dev_rcu(dev, _upper, iter)
-   if (_upper == upper)
-   break;
-
-   return _upper == upper;
-}
-
 #define REQUIRED_BOND_STATES   (BONDING_SLAVE_STATE_ACTIVE |   \
 BONDING_SLAVE_STATE_NA)
 static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
@@ -168,7 +156,7 @@ static int is_eth_port_of_netdev(struct ib_device *ib_dev, 
u8 port,
if (!real_dev)
real_dev = event_ndev;
 
-   res = ((is_upper_dev_rcu(rdma_ndev, event_ndev) &&
+   res = ((rdma_is_upper_dev_rcu(rdma_ndev, event_ndev) &&
   (is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) &
REQUIRED_BOND_STATES)) ||
   real_dev == rdma_ndev);
@@ -214,7 +202,7 @@ static int upper_device_filter(struct ib_device *ib_dev, u8 
port,
return 1;
 
rcu_read_lock();
-   res = is_upper_dev_rcu(rdma_ndev, event_ndev);
+   res = rdma_is_upper_dev_rcu(rdma_ndev, event_ndev);
rcu_read_unlock();
 
return res;
@@ -244,7 +232,7 @@ static void enum_netdev_default_gids(struct ib_device 
*ib_dev,
rcu_read_lock();
if (!rdma_ndev ||
((rdma_ndev != event_ndev &&
- !is_upper_dev_rcu(rdma_ndev, event_ndev)) ||
+ !rdma_is_upper_dev_rcu(rdma_ndev, event_ndev)) ||
 is_eth_active_slave_of_bonding_rcu(rdma_ndev,

netdev_master_upper_dev_get_rcu(rdma_ndev)) ==
 BONDING_SLAVE_STATE_INACTIVE)) {
@@ -274,7 +262,7 @@ static void bond_delete_netdev_default_gids(struct 
ib_device *ib_dev,
 
rcu_read_lock();
 
-   if (is_upper_dev_rcu(rdma_ndev, event_ndev) &&
+   if (rdma_is_upper_dev_rcu(rdma_ndev, event_ndev) &&
is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) ==
BONDING_SLAVE_STATE_INACTIVE) {
unsigned long gid_type_mask;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V3 07/11] IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path

2015-12-23 Thread Matan Barak
In order to make sure API users don't try to use SGIDs which don't
conform to the routing table, validate the route before searching
the RoCE GID table.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/addr.c   |  175 +-
 drivers/infiniband/core/cm.c |   10 ++-
 drivers/infiniband/core/cma.c|   30 +-
 drivers/infiniband/core/sa_query.c   |   75 --
 drivers/infiniband/core/verbs.c  |   48 ++---
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |2 +-
 include/rdma/ib_addr.h   |   10 ++-
 7 files changed, 270 insertions(+), 80 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 6e35299..0b5f245 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -121,7 +121,8 @@ int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct 
net_device *dev,
 }
 EXPORT_SYMBOL(rdma_copy_addr);
 
-int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
+int rdma_translate_ip(const struct sockaddr *addr,
+ struct rdma_dev_addr *dev_addr,
  u16 *vlan_id)
 {
struct net_device *dev;
@@ -139,7 +140,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
switch (addr->sa_family) {
case AF_INET:
dev = ip_dev_find(dev_addr->net,
-   ((struct sockaddr_in *) addr)->sin_addr.s_addr);
+   ((const struct sockaddr_in *)addr)->sin_addr.s_addr);
 
if (!dev)
return ret;
@@ -154,7 +155,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
rcu_read_lock();
for_each_netdev_rcu(dev_addr->net, dev) {
if (ipv6_chk_addr(dev_addr->net,
- &((struct sockaddr_in6 *) 
addr)->sin6_addr,
+ &((const struct sockaddr_in6 
*)addr)->sin6_addr,
  dev, 1)) {
ret = rdma_copy_addr(dev_addr, dev, NULL);
if (vlan_id)
@@ -198,7 +199,8 @@ static void queue_req(struct addr_req *req)
mutex_unlock();
 }
 
-static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr, 
void *daddr)
+static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr,
+   const void *daddr)
 {
struct neighbour *n;
int ret;
@@ -222,8 +224,9 @@ static int dst_fetch_ha(struct dst_entry *dst, struct 
rdma_dev_addr *dev_addr, v
 }
 
 static int addr4_resolve(struct sockaddr_in *src_in,
-struct sockaddr_in *dst_in,
-struct rdma_dev_addr *addr)
+const struct sockaddr_in *dst_in,
+struct rdma_dev_addr *addr,
+struct rtable **prt)
 {
__be32 src_ip = src_in->sin_addr.s_addr;
__be32 dst_ip = dst_in->sin_addr.s_addr;
@@ -243,36 +246,23 @@ static int addr4_resolve(struct sockaddr_in *src_in,
src_in->sin_family = AF_INET;
src_in->sin_addr.s_addr = fl4.saddr;
 
-   if (rt->dst.dev->flags & IFF_LOOPBACK) {
-   ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL);
-   if (!ret)
-   memcpy(addr->dst_dev_addr, addr->src_dev_addr, 
MAX_ADDR_LEN);
-   goto put;
-   }
-
-   /* If the device does ARP internally, return 'done' */
-   if (rt->dst.dev->flags & IFF_NOARP) {
-   ret = rdma_copy_addr(addr, rt->dst.dev, NULL);
-   goto put;
-   }
-
/* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
 * routable) and we could set the network type accordingly.
 */
if (rt->rt_uses_gateway)
addr->network = RDMA_NETWORK_IPV4;
 
-   ret = dst_fetch_ha(>dst, addr, );
-put:
-   ip_rt_put(rt);
+   *prt = rt;
+   return 0;
 out:
return ret;
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
 static int addr6_resolve(struct sockaddr_in6 *src_in,
-struct sockaddr_in6 *dst_in,
-struct rdma_dev_addr *addr)
+const struct sockaddr_in6 *dst_in,
+struct rdma_dev_addr *addr,
+struct dst_entry **pdst)
 {
struct flowi6 fl6;
struct dst_entry *dst;
@@ -299,49 +289,109 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
src_in->sin6_addr = fl6.saddr;
}
 
-   if (dst->dev->flags & IFF_LOOPBACK) {
-   ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL);
-   if (!ret)
- 

[PATCH for-next V3 04/11] IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type

2015-12-23 Thread Matan Barak
Adding RoCE v2 GID type and port type. Vendors
which support this type will get their GID table
populated with RoCE v2 GIDs automatically.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/cache.c |1 +
 drivers/infiniband/core/roce_gid_mgmt.c |3 ++-
 include/rdma/ib_verbs.h |   23 +--
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 566fd8f..88b4b6f 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -128,6 +128,7 @@ static void dispatch_gid_change_event(struct ib_device 
*ib_dev, u8 port)
 
 static const char * const gid_type_str[] = {
[IB_GID_TYPE_IB]= "IB/RoCE v1",
+   [IB_GID_TYPE_ROCE_UDP_ENCAP]= "RoCE v2",
 };
 
 const char *ib_cache_gid_type_str(enum ib_gid_type gid_type)
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index 61c27a7..1e3673f 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -71,7 +71,8 @@ static const struct {
bool (*is_supported)(const struct ib_device *device, u8 port_num);
enum ib_gid_type gid_type;
 } PORT_CAP_TO_GID_TYPE[] = {
-   {rdma_protocol_roce,   IB_GID_TYPE_ROCE},
+   {rdma_protocol_roce_eth_encap, IB_GID_TYPE_ROCE},
+   {rdma_protocol_roce_udp_encap, IB_GID_TYPE_ROCE_UDP_ENCAP},
 };
 
 #define CAP_TO_GID_TABLE_SIZE  ARRAY_SIZE(PORT_CAP_TO_GID_TYPE)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 3f2100b..5848696 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -73,6 +73,7 @@ enum ib_gid_type {
/* If link layer is Ethernet, this is RoCE V1 */
IB_GID_TYPE_IB= 0,
IB_GID_TYPE_ROCE  = 0,
+   IB_GID_TYPE_ROCE_UDP_ENCAP = 1,
IB_GID_TYPE_SIZE
 };
 
@@ -355,6 +356,7 @@ union rdma_protocol_stats {
 #define RDMA_CORE_CAP_PROT_IB   0x0010
 #define RDMA_CORE_CAP_PROT_ROCE 0x0020
 #define RDMA_CORE_CAP_PROT_IWARP0x0040
+#define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x0080
 
 #define RDMA_CORE_PORT_IBA_IB  (RDMA_CORE_CAP_PROT_IB  \
| RDMA_CORE_CAP_IB_MAD \
@@ -367,6 +369,12 @@ union rdma_protocol_stats {
| RDMA_CORE_CAP_IB_CM   \
| RDMA_CORE_CAP_AF_IB   \
| RDMA_CORE_CAP_ETH_AH)
+#define RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP  \
+   (RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP \
+   | RDMA_CORE_CAP_IB_MAD  \
+   | RDMA_CORE_CAP_IB_CM   \
+   | RDMA_CORE_CAP_AF_IB   \
+   | RDMA_CORE_CAP_ETH_AH)
 #define RDMA_CORE_PORT_IWARP   (RDMA_CORE_CAP_PROT_IWARP \
| RDMA_CORE_CAP_IW_CM)
 #define RDMA_CORE_PORT_INTEL_OPA   (RDMA_CORE_PORT_IBA_IB  \
@@ -1997,6 +2005,17 @@ static inline bool rdma_protocol_ib(const struct 
ib_device *device, u8 port_num)
 
 static inline bool rdma_protocol_roce(const struct ib_device *device, u8 
port_num)
 {
+   return device->port_immutable[port_num].core_cap_flags &
+   (RDMA_CORE_CAP_PROT_ROCE | RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP);
+}
+
+static inline bool rdma_protocol_roce_udp_encap(const struct ib_device 
*device, u8 port_num)
+{
+   return device->port_immutable[port_num].core_cap_flags & 
RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP;
+}
+
+static inline bool rdma_protocol_roce_eth_encap(const struct ib_device 
*device, u8 port_num)
+{
return device->port_immutable[port_num].core_cap_flags & 
RDMA_CORE_CAP_PROT_ROCE;
 }
 
@@ -2007,8 +2026,8 @@ static inline bool rdma_protocol_iwarp(const struct 
ib_device *device, u8 port_n
 
 static inline bool rdma_ib_or_roce(const struct ib_device *device, u8 port_num)
 {
-   return device->port_immutable[port_num].core_cap_flags &
-   (RDMA_CORE_CAP_PROT_IB | RDMA_CORE_CAP_PROT_ROCE);
+   return rdma_protocol_ib(device, port_num) ||
+   rdma_protocol_roce(device, port_num);
 }
 
 /**
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V3 03/11] IB/core: Add gid attributes to sysfs

2015-12-23 Thread Matan Barak
This patch set adds attributes of net device and gid type to each GID
in the GID table. Users that use verbs directly need to specify
the GID index. Since the same GID could have different types or
associated net devices, users should have the ability to query the
associated GID attributes. Adding these attributes to sysfs.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 Documentation/ABI/testing/sysfs-class-infiniband |   16 ++
 drivers/infiniband/core/sysfs.c  |  184 +-
 2 files changed, 198 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband

diff --git a/Documentation/ABI/testing/sysfs-class-infiniband 
b/Documentation/ABI/testing/sysfs-class-infiniband
new file mode 100644
index 000..a86abe6
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-infiniband
@@ -0,0 +1,16 @@
+What:  
/sys/class/infiniband//ports//gid_attrs/ndevs/
+Date:  November 29, 2015
+KernelVersion: 4.4.0
+Contact:   linux-rdma@vger.kernel.org
+Description:   The net-device's name associated with the GID resides
+   at index .
+
+What:  
/sys/class/infiniband//ports//gid_attrs/types/
+Date:  November 29, 2015
+KernelVersion: 4.4.0
+Contact:   linux-rdma@vger.kernel.org
+Description:   The RoCE type of the associated GID resides at index 
.
+   This could either be "IB/RoCE v1" for IB and RoCE v1 based GODs
+   or "RoCE v2" for RoCE v2 based GIDs.
+
+
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index dbfa27c..be75994 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -37,12 +37,22 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
+struct ib_port;
+
+struct gid_attr_group {
+   struct ib_port  *port;
+   struct kobject  kobj;
+   struct attribute_group  ndev;
+   struct attribute_group  type;
+};
 struct ib_port {
struct kobject kobj;
struct ib_device  *ibdev;
+   struct gid_attr_group *gid_attr_group;
struct attribute_group gid_group;
struct attribute_group pkey_group;
u8 port_num;
@@ -84,6 +94,24 @@ static const struct sysfs_ops port_sysfs_ops = {
.show = port_attr_show
 };
 
+static ssize_t gid_attr_show(struct kobject *kobj,
+struct attribute *attr, char *buf)
+{
+   struct port_attribute *port_attr =
+   container_of(attr, struct port_attribute, attr);
+   struct ib_port *p = container_of(kobj, struct gid_attr_group,
+kobj)->port;
+
+   if (!port_attr->show)
+   return -EIO;
+
+   return port_attr->show(p, port_attr, buf);
+}
+
+static const struct sysfs_ops gid_attr_sysfs_ops = {
+   .show = gid_attr_show
+};
+
 static ssize_t state_show(struct ib_port *p, struct port_attribute *unused,
  char *buf)
 {
@@ -281,6 +309,46 @@ static struct attribute *port_default_attrs[] = {
NULL
 };
 
+static size_t print_ndev(struct ib_gid_attr *gid_attr, char *buf)
+{
+   if (!gid_attr->ndev)
+   return -EINVAL;
+
+   return sprintf(buf, "%s\n", gid_attr->ndev->name);
+}
+
+static size_t print_gid_type(struct ib_gid_attr *gid_attr, char *buf)
+{
+   return sprintf(buf, "%s\n", ib_cache_gid_type_str(gid_attr->gid_type));
+}
+
+static ssize_t _show_port_gid_attr(struct ib_port *p,
+  struct port_attribute *attr,
+  char *buf,
+  size_t (*print)(struct ib_gid_attr *gid_attr,
+  char *buf))
+{
+   struct port_table_attribute *tab_attr =
+   container_of(attr, struct port_table_attribute, attr);
+   union ib_gid gid;
+   struct ib_gid_attr gid_attr = {};
+   ssize_t ret;
+   va_list args;
+
+   ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, ,
+  _attr);
+   if (ret)
+   goto err;
+
+   ret = print(_attr, buf);
+
+err:
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   va_end(args);
+   return ret;
+}
+
 static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr,
 char *buf)
 {
@@ -296,6 +364,19 @@ static ssize_t show_port_gid(struct ib_port *p, struct 
port_attribute *attr,
return sprintf(buf, "%pI6\n", gid.raw);
 }
 
+static ssize_t show_port_gid_attr_ndev(struct ib_port *p,
+  struct port_attribute *attr, char *buf)
+{
+   return _show_port_gid_attr(p, attr, buf, print_ndev);
+}
+
+static ssize_t show_port_gid_attr_gid_type(struct ib_port *p,
+ 

[PATCH for-next V3 01/11] IB/core: Add gid_type to gid attribute

2015-12-23 Thread Matan Barak
In order to support multiple GID types, we need to store the gid_type
with each GID. This is also aligned with the RoCE v2 annex "RoCEv2 PORT
GID table entries shall have a "GID type" attribute that denotes the L3
Address type". The currently supported GID is IB_GID_TYPE_IB which is
also RoCE v1 GID type.

This implies that gid_type should be added to roce_gid_table meta-data.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/cache.c   |  144 +++-
 drivers/infiniband/core/cm.c  |2 +-
 drivers/infiniband/core/cma.c |3 +-
 drivers/infiniband/core/core_priv.h   |4 +
 drivers/infiniband/core/device.c  |9 ++-
 drivers/infiniband/core/multicast.c   |2 +-
 drivers/infiniband/core/roce_gid_mgmt.c   |   60 ++--
 drivers/infiniband/core/sa_query.c|5 +-
 drivers/infiniband/core/uverbs_marshall.c |1 +
 drivers/infiniband/core/verbs.c   |1 +
 include/rdma/ib_cache.h   |4 +
 include/rdma/ib_sa.h  |1 +
 include/rdma/ib_verbs.h   |   11 ++-
 13 files changed, 185 insertions(+), 62 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 097e9df..566fd8f 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -64,6 +64,7 @@ enum gid_attr_find_mask {
GID_ATTR_FIND_MASK_GID  = 1UL << 0,
GID_ATTR_FIND_MASK_NETDEV   = 1UL << 1,
GID_ATTR_FIND_MASK_DEFAULT  = 1UL << 2,
+   GID_ATTR_FIND_MASK_GID_TYPE = 1UL << 3,
 };
 
 enum gid_table_entry_props {
@@ -125,6 +126,19 @@ static void dispatch_gid_change_event(struct ib_device 
*ib_dev, u8 port)
}
 }
 
+static const char * const gid_type_str[] = {
+   [IB_GID_TYPE_IB]= "IB/RoCE v1",
+};
+
+const char *ib_cache_gid_type_str(enum ib_gid_type gid_type)
+{
+   if (gid_type < ARRAY_SIZE(gid_type_str) && gid_type_str[gid_type])
+   return gid_type_str[gid_type];
+
+   return "Invalid GID type";
+}
+EXPORT_SYMBOL(ib_cache_gid_type_str);
+
 /* This function expects that rwlock will be write locked in all
  * scenarios and that lock will be locked in sleep-able (RoCE)
  * scenarios.
@@ -233,6 +247,10 @@ static int find_gid(struct ib_gid_table *table, const 
union ib_gid *gid,
if (found >=0)
continue;
 
+   if (mask & GID_ATTR_FIND_MASK_GID_TYPE &&
+   attr->gid_type != val->gid_type)
+   continue;
+
if (mask & GID_ATTR_FIND_MASK_GID &&
memcmp(gid, >gid, sizeof(*gid)))
continue;
@@ -296,6 +314,7 @@ int ib_cache_gid_add(struct ib_device *ib_dev, u8 port,
write_lock_irq(>rwlock);
 
ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID |
+ GID_ATTR_FIND_MASK_GID_TYPE |
  GID_ATTR_FIND_MASK_NETDEV, );
if (ix >= 0)
goto out_unlock;
@@ -329,6 +348,7 @@ int ib_cache_gid_del(struct ib_device *ib_dev, u8 port,
 
ix = find_gid(table, gid, attr, false,
  GID_ATTR_FIND_MASK_GID  |
+ GID_ATTR_FIND_MASK_GID_TYPE |
  GID_ATTR_FIND_MASK_NETDEV   |
  GID_ATTR_FIND_MASK_DEFAULT,
  NULL);
@@ -427,11 +447,13 @@ static int _ib_cache_gid_table_find(struct ib_device 
*ib_dev,
 
 static int ib_cache_gid_find(struct ib_device *ib_dev,
 const union ib_gid *gid,
+enum ib_gid_type gid_type,
 struct net_device *ndev, u8 *port,
 u16 *index)
 {
-   unsigned long mask = GID_ATTR_FIND_MASK_GID;
-   struct ib_gid_attr gid_attr_val = {.ndev = ndev};
+   unsigned long mask = GID_ATTR_FIND_MASK_GID |
+GID_ATTR_FIND_MASK_GID_TYPE;
+   struct ib_gid_attr gid_attr_val = {.ndev = ndev, .gid_type = gid_type};
 
if (ndev)
mask |= GID_ATTR_FIND_MASK_NETDEV;
@@ -442,14 +464,16 @@ static int ib_cache_gid_find(struct ib_device *ib_dev,
 
 int ib_find_cached_gid_by_port(struct ib_device *ib_dev,
   const union ib_gid *gid,
+  enum ib_gid_type gid_type,
   u8 port, struct net_device *ndev,
   u16 *index)
 {
int local_index;
struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
struct ib_gid_table *table;
-   unsigned long mask = GID_ATTR_FIND_MASK_GID;
-   struct ib_gid_attr val = {.ndev = ndev};
+   unsigned long mask = GID_ATTR_FIND_MASK_GID |
+G

[PATCH for-next V3 08/11] IB/rdma_cm: Add wrapper for cma reference count

2015-12-23 Thread Matan Barak
Currently, cma users can't increase or decrease the cma reference
count. This is necassary when setting cma attributes (like the
default GID type) in order to avoid use-after-free errors.
Adding cma_ref_dev and cma_deref_dev APIs.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/cma.c   |   11 +--
 drivers/infiniband/core/core_priv.h |4 
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index fce11df..322f1c6 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -60,6 +60,8 @@
 #include 
 #include 
 
+#include "core_priv.h"
+
 MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("Generic RDMA CM Agent");
 MODULE_LICENSE("Dual BSD/GPL");
@@ -185,6 +187,11 @@ enum {
CMA_OPTION_AFONLY,
 };
 
+void cma_ref_dev(struct cma_device *cma_dev)
+{
+   atomic_inc(_dev->refcount);
+}
+
 /*
  * Device removal can occur at anytime, so we need extra handling to
  * serialize notifying the user of device removal with other callbacks.
@@ -339,7 +346,7 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 
ip_ver)
 static void cma_attach_to_dev(struct rdma_id_private *id_priv,
  struct cma_device *cma_dev)
 {
-   atomic_inc(_dev->refcount);
+   cma_ref_dev(cma_dev);
id_priv->cma_dev = cma_dev;
id_priv->id.device = cma_dev->device;
id_priv->id.route.addr.dev_addr.transport =
@@ -347,7 +354,7 @@ static void cma_attach_to_dev(struct rdma_id_private 
*id_priv,
list_add_tail(_priv->list, _dev->id_list);
 }
 
-static inline void cma_deref_dev(struct cma_device *cma_dev)
+void cma_deref_dev(struct cma_device *cma_dev)
 {
if (atomic_dec_and_test(_dev->refcount))
complete(_dev->comp);
diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index 3b250a2..1945b4e 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -38,6 +38,10 @@
 
 #include 
 
+struct cma_device;
+void cma_ref_dev(struct cma_device *cma_dev);
+void cma_deref_dev(struct cma_device *cma_dev);
+
 int  ib_device_register_sysfs(struct ib_device *device,
  int (*port_callback)(struct ib_device *,
   u8, struct kobject *));
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V3 00/11] Add RoCE v2 support

2015-12-23 Thread Matan Barak
Hi Doug,

This series adds the support for RoCE v2. In order to support RoCE v2,
we add gid_type attribute to every GID. When the RoCE GID management
populates the GID table, it duplicates each GID with all supported types.
This gives the user the ability to communicate over each supported
type.

Patch 0001, 0002 and 0003 add support for multiple GID types to the
cache and related APIs. The third patch exposes the GID attributes
information is sysfs.

Patch 0004 adds the RoCE v2 GID type and the capabilities required
from the vendor in order to implement RoCE v2. These capabilities
are grouped together as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP.

RoCE v2 could work at IPv4 and IPv6 networks. When receiving ib_wc, this
information should come from the vendor's driver. In case the vendor
doesn't supply this information, we parse the packet headers and resolve
its network type. Patch 0005 adds this information and required utilities.

Patches 0006 and 0007 adds route validation. This is mandatory to ensure
that we send packets using GIDS which corresponds to a net-device that
can be routed to the destination.

Patches 0008 and 0009 add configfs support (and the required
infrastructure) for CMA. The administrator should be able to set the
default RoCE type. This is done through a new per-port
default_roce_mode configfs file.

Patch 0010 formats a QP1 packet in order to support RoCE v2 CM
packets. This is required for vendors which implement their
QP1 as a Raw QP.

Patch 0011 adds support for IPv4 multicast as an IPv4 network
requires IGMP to be sent in order to join multicast groups.

Vendors code aren't part of this patch-set. Soft-Roce will be
sent soon and depends on these patches. Other vendors, like
mlx4, ocrdma and mlx5 will follow.

This patch is applied on the "Change per-entry locks in GID cache to
table lock" series which was sent to the mailing list.

Thanks,
Matan

Changes from V2:
 - Rebase over Doug's k.o/for-4.5
 - Make INFINIBAND_ADDR_TRANS_CONFIGFS depends on CONFIGFS

Changes from V1:
 - Rebased against Linux 4.4-rc2 master branch.
 - Add route validation
 - ConfigFS - avoid compiling INFINIBAND=y and CONFIGFS_FS=m
 - Add documentation for configfs and sysfs ABI
 - Remove ifindex and gid_type from mcmember

Changes from V0:
 - Rebased patches against Doug's latest k.o/for-4.4 tree.
 - Fixed a bug in configfs (rmdir caused an incorrect free).

Matan Barak (8):
  IB/core: Add gid_type to gid attribute
  IB/cm: Use the source GID index type
  IB/core: Add gid attributes to sysfs
  IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type
  IB/core: Move rdma_is_upper_dev_rcu to header file
  IB/core: Validate route in ib_init_ah_from_wc and
ib_init_ah_from_path
  IB/rdma_cm: Add wrapper for cma reference count
  IB/cma: Add configfs for rdma_cm

Moni Shoua (2):
  IB/core: Initialize UD header structure with IP and UDP headers
  IB/cma: Join and leave multicast groups with IGMP

Somnath Kotur (1):
  IB/core: Add rdma_network_type to wc

 Documentation/ABI/testing/configfs-rdma_cm   |   22 ++
 Documentation/ABI/testing/sysfs-class-infiniband |   16 +
 drivers/infiniband/Kconfig   |9 +
 drivers/infiniband/core/Makefile |2 +
 drivers/infiniband/core/addr.c   |  185 +
 drivers/infiniband/core/cache.c  |  169 ---
 drivers/infiniband/core/cm.c |   31 ++-
 drivers/infiniband/core/cma.c|  259 --
 drivers/infiniband/core/cma_configfs.c   |  322 ++
 drivers/infiniband/core/core_priv.h  |   45 +++
 drivers/infiniband/core/device.c |9 +-
 drivers/infiniband/core/multicast.c  |   17 +-
 drivers/infiniband/core/roce_gid_mgmt.c  |   81 --
 drivers/infiniband/core/sa_query.c   |   76 +-
 drivers/infiniband/core/sysfs.c  |  184 -
 drivers/infiniband/core/ud_header.c  |  155 ++-
 drivers/infiniband/core/uverbs_marshall.c|1 +
 drivers/infiniband/core/verbs.c  |  170 ++--
 drivers/infiniband/hw/mlx4/qp.c  |7 +-
 drivers/infiniband/hw/mthca/mthca_qp.c   |2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |2 +-
 include/rdma/ib_addr.h   |   11 +-
 include/rdma/ib_cache.h  |4 +
 include/rdma/ib_pack.h   |   45 +++-
 include/rdma/ib_sa.h |3 +
 include/rdma/ib_verbs.h  |   78 +-
 26 files changed, 1703 insertions(+), 202 deletions(-)
 create mode 100644 Documentation/ABI/testing/configfs-rdma_cm
 create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband
 create mode 100644 drivers/infiniband/core/cma_configfs.c

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of

[PATCH for-next V3 02/11] IB/cm: Use the source GID index type

2015-12-23 Thread Matan Barak
Previosuly, cm and cma modules supported only IB and RoCE v1 GID type.
In order to support multiple GID types, the gid_type is passed to
cm_init_av_by_path and stored in the path record.

The rdma cm client would use a default GID type that will be saved in
rdma_id_private.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/cm.c  |   25 -
 drivers/infiniband/core/cma.c |2 ++
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 6960386..ce0ca90 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -364,7 +364,7 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, 
struct cm_av *av)
read_lock_irqsave(_lock, flags);
list_for_each_entry(cm_dev, _list, list) {
if (!ib_find_cached_gid(cm_dev->ib_device, >sgid,
-   IB_GID_TYPE_IB, ndev, , NULL)) {
+   path->gid_type, ndev, , NULL)) {
port = cm_dev->port[p-1];
break;
}
@@ -1600,6 +1600,8 @@ static int cm_req_handler(struct cm_work *work)
struct ib_cm_id *cm_id;
struct cm_id_private *cm_id_priv, *listen_cm_id_priv;
struct cm_req_msg *req_msg;
+   union ib_gid gid;
+   struct ib_gid_attr gid_attr;
int ret;
 
req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -1639,11 +1641,24 @@ static int cm_req_handler(struct cm_work *work)
cm_format_paths_from_req(req_msg, >path[0], >path[1]);
 
memcpy(work->path[0].dmac, cm_id_priv->av.ah_attr.dmac, ETH_ALEN);
-   ret = cm_init_av_by_path(>path[0], _id_priv->av);
+   ret = ib_get_cached_gid(work->port->cm_dev->ib_device,
+   work->port->port_num,
+   cm_id_priv->av.ah_attr.grh.sgid_index,
+   , _attr);
+   if (!ret) {
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   work->path[0].gid_type = gid_attr.gid_type;
+   ret = cm_init_av_by_path(>path[0], _id_priv->av);
+   }
if (ret) {
-   ib_get_cached_gid(work->port->cm_dev->ib_device,
- work->port->port_num, 0, >path[0].sgid,
- NULL);
+   int err = ib_get_cached_gid(work->port->cm_dev->ib_device,
+   work->port->port_num, 0,
+   >path[0].sgid,
+   _attr);
+   if (!err && gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   work->path[0].gid_type = gid_attr.gid_type;
ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
   >path[0].sgid, sizeof work->path[0].sgid,
   NULL, 0);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 2637ebf..446323a 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -228,6 +228,7 @@ struct rdma_id_private {
u8  tos;
u8  reuseaddr;
u8  afonly;
+   enum ib_gid_typegid_type;
 };
 
 struct cma_multicast {
@@ -2314,6 +2315,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
ndev = dev_get_by_index(_net, addr->dev_addr.bound_dev_if);
route->path_rec->net = _net;
route->path_rec->ifindex = addr->dev_addr.bound_dev_if;
+   route->path_rec->gid_type = id_priv->gid_type;
}
if (!ndev) {
ret = -ENODEV;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V3 09/11] IB/cma: Add configfs for rdma_cm

2015-12-23 Thread Matan Barak
Users would like to control the behaviour of rdma_cm.
For example, old applications which don't set the
required RoCE gid type could be executed on RoCE V2
network types. In order to support this configuration,
we implement a configfs for rdma_cm.

In order to use the configfs, one needs to mount it and
mkdir  inside rdma_cm directory.

The patch adds support for a single configuration file,
default_roce_mode. The mode can either be "IB/RoCE v1" or
"RoCE v2".

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 Documentation/ABI/testing/configfs-rdma_cm |   22 ++
 drivers/infiniband/Kconfig |9 +
 drivers/infiniband/core/Makefile   |2 +
 drivers/infiniband/core/cache.c|   24 ++
 drivers/infiniband/core/cma.c  |  108 +-
 drivers/infiniband/core/cma_configfs.c |  322 
 drivers/infiniband/core/core_priv.h|   24 ++
 7 files changed, 504 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/ABI/testing/configfs-rdma_cm
 create mode 100644 drivers/infiniband/core/cma_configfs.c

diff --git a/Documentation/ABI/testing/configfs-rdma_cm 
b/Documentation/ABI/testing/configfs-rdma_cm
new file mode 100644
index 000..5c389aa
--- /dev/null
+++ b/Documentation/ABI/testing/configfs-rdma_cm
@@ -0,0 +1,22 @@
+What:  /config/rdma_cm
+Date:  November 29, 2015
+KernelVersion:  4.4.0
+Description:   Interface is used to configure RDMA-cable HCAs in respect to
+   RDMA-CM attributes.
+
+   Attributes are visible only when configfs is mounted. To mount
+   configfs in /config directory use:
+   # mount -t configfs none /config/
+
+   In order to set parameters related to a specific HCA, a 
directory
+   for this HCA has to be created:
+   mkdir -p /config/rdma_cm/
+
+
+What:  /config/rdma_cm//ports//default_roce_mode
+Date:  November 29, 2015
+KernelVersion:  4.4.0
+Description:   RDMA-CM based connections from HCA  at port 
+   will be initiated with this RoCE type as default.
+   The possible RoCE types are either "IB/RoCE v1" or "RoCE v2".
+   This parameter has RW access.
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 282ec0b..8a8440c 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -55,6 +55,15 @@ config INFINIBAND_ADDR_TRANS
depends on INFINIBAND
default y
 
+config INFINIBAND_ADDR_TRANS_CONFIGFS
+   bool
+   depends on INFINIBAND_ADDR_TRANS && CONFIGFS_FS && !(INFINIBAND=y && 
CONFIGFS_FS=m)
+   default y
+   ---help---
+ ConfigFS support for RDMA communication manager (CM).
+ This allows the user to config the default GID type that the CM
+ uses for each device, when initiaing new connections.
+
 source "drivers/infiniband/hw/mthca/Kconfig"
 source "drivers/infiniband/hw/qib/Kconfig"
 source "drivers/infiniband/hw/cxgb3/Kconfig"
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index ae48d87..f818538 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -24,6 +24,8 @@ iw_cm-y :=iwcm.o iwpm_util.o iwpm_msg.o
 
 rdma_cm-y :=   cma.o
 
+rdma_cm-$(CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS) += cma_configfs.o
+
 rdma_ucm-y :=  ucma.o
 
 ib_addr-y :=   addr.o
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 88b4b6f..4aada52 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -140,6 +140,30 @@ const char *ib_cache_gid_type_str(enum ib_gid_type 
gid_type)
 }
 EXPORT_SYMBOL(ib_cache_gid_type_str);
 
+int ib_cache_gid_parse_type_str(const char *buf)
+{
+   unsigned int i;
+   size_t len;
+   int err = -EINVAL;
+
+   len = strlen(buf);
+   if (len == 0)
+   return -EINVAL;
+
+   if (buf[len - 1] == '\n')
+   len--;
+
+   for (i = 0; i < ARRAY_SIZE(gid_type_str); ++i)
+   if (gid_type_str[i] && !strncmp(buf, gid_type_str[i], len) &&
+   len == strlen(gid_type_str[i])) {
+   err = i;
+   break;
+   }
+
+   return err;
+}
+EXPORT_SYMBOL(ib_cache_gid_parse_type_str);
+
 /* This function expects that rwlock will be write locked in all
  * scenarios and that lock will be locked in sleep-able (RoCE)
  * scenarios.
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 322f1c6..75987b0 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -152,6 +152,7 @@ struct cma_device {
struct completion   comp;
atomic_trefcount;
struc

[PATCH for-next V3 11/11] IB/cma: Join and leave multicast groups with IGMP

2015-12-23 Thread Matan Barak
From: Moni Shoua 

Since RoCEv2 is a protocol over IP header it is required to send IGMP
join and leave requests to the network when joining and leaving
multicast groups.

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/core/cma.c   |   96 ---
 drivers/infiniband/core/multicast.c |   17 ++-
 include/rdma/ib_sa.h|2 +
 3 files changed, 106 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 75987b0..559ee3d 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -304,6 +305,7 @@ struct cma_multicast {
void*context;
struct sockaddr_storage addr;
struct kref mcref;
+   booligmp_joined;
 };
 
 struct cma_work {
@@ -400,6 +402,26 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 
ip_ver)
hdr->ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF);
 }
 
+static int cma_igmp_send(struct net_device *ndev, union ib_gid *mgid, bool 
join)
+{
+   struct in_device *in_dev = NULL;
+
+   if (ndev) {
+   rtnl_lock();
+   in_dev = __in_dev_get_rtnl(ndev);
+   if (in_dev) {
+   if (join)
+   ip_mc_inc_group(in_dev,
+   *(__be32 *)(mgid->raw + 12));
+   else
+   ip_mc_dec_group(in_dev,
+   *(__be32 *)(mgid->raw + 12));
+   }
+   rtnl_unlock();
+   }
+   return (in_dev) ? 0 : -ENODEV;
+}
+
 static void _cma_attach_to_dev(struct rdma_id_private *id_priv,
   struct cma_device *cma_dev)
 {
@@ -1532,8 +1554,24 @@ static void cma_leave_mc_groups(struct rdma_id_private 
*id_priv)
  id_priv->id.port_num)) {
ib_sa_free_multicast(mc->multicast.ib);
kfree(mc);
-   } else
+   } else {
+   if (mc->igmp_joined) {
+   struct rdma_dev_addr *dev_addr =
+   _priv->id.route.addr.dev_addr;
+   struct net_device *ndev = NULL;
+
+   if (dev_addr->bound_dev_if)
+   ndev = dev_get_by_index(_net,
+   
dev_addr->bound_dev_if);
+   if (ndev) {
+   cma_igmp_send(ndev,
+ 
>multicast.ib->rec.mgid,
+ false);
+   dev_put(ndev);
+   }
+   }
kref_put(>mcref, release_mc);
+   }
}
 }
 
@@ -3645,12 +3683,23 @@ static int cma_ib_mc_handler(int status, struct 
ib_sa_multicast *multicast)
event.status = status;
event.param.ud.private_data = mc->context;
if (!status) {
+   struct rdma_dev_addr *dev_addr =
+   _priv->id.route.addr.dev_addr;
+   struct net_device *ndev =
+   dev_get_by_index(_net, dev_addr->bound_dev_if);
+   enum ib_gid_type gid_type =
+   id_priv->cma_dev->default_gid_type[id_priv->id.port_num 
-
+   rdma_start_port(id_priv->cma_dev->device)];
+
event.event = RDMA_CM_EVENT_MULTICAST_JOIN;
ib_init_ah_from_mcmember(id_priv->id.device,
 id_priv->id.port_num, >rec,
+ndev, gid_type,
 _attr);
event.param.ud.qp_num = 0xFF;
event.param.ud.qkey = be32_to_cpu(multicast->rec.qkey);
+   if (ndev)
+   dev_put(ndev);
} else
event.event = RDMA_CM_EVENT_MULTICAST_ERROR;
 
@@ -3783,9 +3832,10 @@ static int cma_iboe_join_multicast(struct 
rdma_id_private *id_priv,
 {
struct iboe_mcast_work *work;
struct rdma_dev_addr *dev_addr = _priv->id.route.addr.dev_addr;
-   int err;
+   int err = 0;
struct sockaddr *addr = (struct sockaddr *)>addr;
struct net_device *ndev = NULL;
+   enum ib_gid_type gid_type;
 
if (cma_zero_addr((struct sockaddr *)>addr))
return -EINVAL;
@@ -3815,9 +3865,25 @@ static int cma_iboe_join_multicast(struct 
rdma_id_private *id_priv,
mc->multicast.ib->rec.rate = iboe_get_rate(ndev);

[PATCH for-next V3 05/11] IB/core: Add rdma_network_type to wc

2015-12-23 Thread Matan Barak
From: Somnath Kotur <somnath.ko...@avagotech.com>

Providers should tell IB core the wc's network type.
This is used in order to search for the proper GID in the
GID table. When using HCAs that can't provide this info,
IB core tries to deep examine the packet and extract
the GID type by itself.

We choose sgid_index and type from all the matching entries in
RDMA-CM based on hint from the IP stack and we set hop_limit for
the IP packet based on above hint from IP stack.

Signed-off-by: Matan Barak <mat...@mellanox.com>
Signed-off-by: Somnath Kotur <somnath.ko...@avagotech.com>
---
 drivers/infiniband/core/addr.c  |   14 +
 drivers/infiniband/core/cma.c   |   11 +++-
 drivers/infiniband/core/verbs.c |  123 +--
 include/rdma/ib_addr.h  |1 +
 include/rdma/ib_verbs.h |   44 ++
 5 files changed, 187 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 34b1ada..6e35299 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -256,6 +256,12 @@ static int addr4_resolve(struct sockaddr_in *src_in,
goto put;
}
 
+   /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
+* routable) and we could set the network type accordingly.
+*/
+   if (rt->rt_uses_gateway)
+   addr->network = RDMA_NETWORK_IPV4;
+
ret = dst_fetch_ha(>dst, addr, );
 put:
ip_rt_put(rt);
@@ -270,6 +276,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
 {
struct flowi6 fl6;
struct dst_entry *dst;
+   struct rt6_info *rt;
int ret;
 
memset(, 0, sizeof fl6);
@@ -281,6 +288,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
if ((ret = dst->error))
goto put;
 
+   rt = (struct rt6_info *)dst;
if (ipv6_addr_any()) {
ret = ipv6_dev_get_saddr(addr->net, ip6_dst_idev(dst)->dev,
 , 0, );
@@ -304,6 +312,12 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
goto put;
}
 
+   /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
+* routable) and we could set the network type accordingly.
+*/
+   if (rt->rt6i_flags & RTF_GATEWAY)
+   addr->network = RDMA_NETWORK_IPV6;
+
ret = dst_fetch_ha(dst, addr, );
 put:
dst_release(dst);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 446323a..0a29d60 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2291,6 +2291,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
 {
struct rdma_route *route = _priv->id.route;
struct rdma_addr *addr = >addr;
+   enum ib_gid_type network_gid_type;
struct cma_work *work;
int ret;
struct net_device *ndev = NULL;
@@ -2329,7 +2330,15 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
rdma_ip2gid((struct sockaddr *)_priv->id.route.addr.dst_addr,
>path_rec->dgid);
 
-   route->path_rec->hop_limit = 1;
+   /* Use the hint from IP Stack to select GID Type */
+   network_gid_type = ib_network_to_gid_type(addr->dev_addr.network);
+   if (addr->dev_addr.network != RDMA_NETWORK_IB) {
+   route->path_rec->gid_type = network_gid_type;
+   /* TODO: get the hoplimit from the inet/inet6 device */
+   route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT;
+   } else {
+   route->path_rec->hop_limit = 1;
+   }
route->path_rec->reversible = 1;
route->path_rec->pkey = cpu_to_be16(0x);
route->path_rec->mtu_selector = IB_SA_EQ;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 4d1737c..4efb119 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -305,8 +305,61 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct 
ib_ah_attr *ah_attr)
 }
 EXPORT_SYMBOL(ib_create_ah);
 
+static int ib_get_header_version(const union rdma_network_hdr *hdr)
+{
+   const struct iphdr *ip4h = (struct iphdr *)>roce4grh;
+   struct iphdr ip4h_checked;
+   const struct ipv6hdr *ip6h = (struct ipv6hdr *)>ibgrh;
+
+   /* If it's IPv6, the version must be 6, otherwise, the first
+* 20 bytes (before the IPv4 header) are garbled.
+*/
+   if (ip6h->version != 6)
+   return (ip4h->version == 4) ? 4 : 0;
+   /* version may be 6 or 4 because the first 20 bytes could be garbled */
+
+   /* RoCE v2 requires no options, thus header length
+* must be 5 words
+*/
+   if (ip4h->ihl != 5)
+   return 

Re: [PATCH] IB/cma: cma_match_net_dev needs to take into account port_num

2015-12-22 Thread Matan Barak
On Tue, Dec 22, 2015 at 8:58 PM, Doug Ledford <dledf...@redhat.com> wrote:
> On 12/22/2015 05:47 AM, Or Gerlitz wrote:
>> On 12/21/2015 5:01 PM, Matan Barak wrote:
>>> Previously, cma_match_net_dev called cma_protocol_roce which
>>> tried to verify that the IB device uses RoCE protocol. However,
>>> if rdma_id didn't have a bounded port, it used the first port
>>> of the device.
>>>
>>> In VPI systems, the first port might be an IB port while the second
>>> one could be an Ethernet port. This made requests for unbounded rdma_ids
>>> that come from the Ethernet port fail.
>>> Fixing this by passing the port of the request and checking this port
>>> of the device.
>>>
>>> Fixes: b8cab5dab15f ('IB/cma: Accept connection without a valid netdev
>>> on RoCE')
>>> Signed-off-by: Matan Barak<mat...@mellanox.com>
>>
>> seems that the patch is missing from patchworks, I can't explain that.
>
> I've already downloaded it and marked it accepted.
>

Thanks Doug. Would you like that I'll repost the patch with the commit
message changed as Or suggested or is the current version good enough?

Regarding the Ethernet loopback issue, I started looking into that,
but as Or stated, it's broken even before the RoCE patches.

> --
> Doug Ledford <dledf...@redhat.com>
>   GPG KeyID: 0E572FDD
>
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/cma: cma_match_net_dev needs to take into account port_num

2015-12-21 Thread Matan Barak
Previously, cma_match_net_dev called cma_protocol_roce which
tried to verify that the IB device uses RoCE protocol. However,
if rdma_id didn't have a bounded port, it used the first port
of the device.

In VPI systems, the first port might be an IB port while the second
one could be an Ethernet port. This made requests for unbounded rdma_ids
that come from the Ethernet port fail.
Fixing this by passing the port of the request and checking this port
of the device.

Fixes: b8cab5dab15f ('IB/cma: Accept connection without a valid netdev on RoCE')
Signed-off-by: Matan Barak <mat...@mellanox.com>
---
Hi Doug,

This patch fixes a bug in VPI systems, where the first port is configured
as IB and the second one is configured as Ethernet.
In this case, if the rdma_id isn't bounded to a port, cma_match_net_dev
will try to verify that the first port is a RoCE port and fail.
This is fixed by passing the port of the incoming request.

Regards,
Matan

 drivers/infiniband/core/cma.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index d2d5d00..c8a265c 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1265,15 +1265,17 @@ static bool cma_protocol_roce(const struct rdma_cm_id 
*id)
return cma_protocol_roce_dev_port(device, port_num);
 }
 
-static bool cma_match_net_dev(const struct rdma_id_private *id_priv,
- const struct net_device *net_dev)
+static bool cma_match_net_dev(const struct rdma_cm_id *id,
+ const struct net_device *net_dev,
+ u8 port_num)
 {
-   const struct rdma_addr *addr = _priv->id.route.addr;
+   const struct rdma_addr *addr = >route.addr;
 
if (!net_dev)
/* This request is an AF_IB request or a RoCE request */
-   return addr->src_addr.ss_family == AF_IB ||
-  cma_protocol_roce(_priv->id);
+   return (!id->port_num || id->port_num == port_num) &&
+  (addr->src_addr.ss_family == AF_IB ||
+   cma_protocol_roce_dev_port(id->device, port_num));
 
return !addr->dev_addr.bound_dev_if ||
   (net_eq(dev_net(net_dev), addr->dev_addr.net) &&
@@ -1295,13 +1297,13 @@ static struct rdma_id_private *cma_find_listener(
hlist_for_each_entry(id_priv, _list->owners, node) {
if (cma_match_private_data(id_priv, ib_event->private_data)) {
if (id_priv->id.device == cm_id->device &&
-   cma_match_net_dev(id_priv, net_dev))
+   cma_match_net_dev(_priv->id, net_dev, req->port))
return id_priv;
list_for_each_entry(id_priv_dev,
_priv->listen_list,
listen_list) {
if (id_priv_dev->id.device == cm_id->device &&
-   cma_match_net_dev(id_priv_dev, net_dev))
+   cma_match_net_dev(_priv_dev->id, 
net_dev, req->port))
return id_priv_dev;
}
}
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RoCE passive side failures on 4.4-rc5

2015-12-21 Thread Matan Barak
On Sun, Dec 20, 2015 at 9:29 AM, Or Gerlitz  wrote:
> On 12/17/2015 3:58 PM, Or Gerlitz wrote:
>>
>> Using 4.4-rc5+ [1] and **not** applying any of the patches I sent today,
>> I noted that RoCE passive side isn't working (rdma-cm, ibv_rc_pingpong
>> works).
>>
>> I have two nodes in ConnectX3 VPI config (port1 IB and port2 Eth), the one
>> with the 4.4-rc5 kernel can act as both (rping) client/server for IB links
>> but only (rping) client for RoCE.
>>
>> I tried both inter-node and loopback runs, in all cases, the client side
>> getsCM
>> reject with reason 28, see [2], tried both iser and rping. Eth (ICMP, TCP)
>> works OK.
>
>
> OK, small progress, when the force Eth link type on my IB port (using mlx4
> sysfs), things work.
>
> You should be able to reproduce it on your non-VPI systems the other way
> around, by
> forcing IB link type on one of the Eth ports and see the failure.
>
> I Saw the same behavior with both 4.4-rc2 and 4.4-rc5
>
> Or.

I've posted a patch that fixes that, please take a look at [1].

Regards,
Matan

[1] https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg30777.html

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-20 Thread Matan Barak
On Mon, Dec 14, 2015 at 6:06 PM, Christoph Lameter <c...@linux.com> wrote:
> On Mon, 14 Dec 2015, Matan Barak wrote:
>
>> > No idea what the counter is doing. Saw another EXT counter implementation
>> > use 0 so I thought that was fine.
>>
>> It seems like a counter index, but I might be wrong though. If it is,
>> don't we want to preserve the existing non-EXT schema for the new
>> counters too?
>
> I do not see any use of that field so I am not sure what to put in there.
> Could it be obsolete?
>

I don't see any usage to that field either, but I think 0 is a bit misleading.
So maybe it's more appropriate to delete this field.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 3/5] IB/mlx5: Add support for hca_core_clock and timestamp_mask

2015-12-16 Thread Matan Barak
On Wed, Dec 16, 2015 at 4:43 PM, Sagi Grimberg  wrote:
>
>> Reporting the hca_core_clock (in kHZ) and the timestamp_mask in
>> query_device extended verb. timestamp_mask is used by users in order
>> to know what is the valid range of the raw timestamps, while
>> hca_core_clock reports the clock frequency that is used for
>> timestamps.
>
>
> Hi Matan,
>
> Shouldn't this patch come last?
>

Not necessarily. In order to support completion timestamping (that's
what defined in this query_device patch), we only need create_cq_ex in
mlx5_ib.
The down stream patches adds support for reading the HCA core clock
(via query_values).
One could have completion timestamping support without having
ibv_query_values support.

Thanks for taking a look.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 4/5] IB/mlx5: Add hca_core_clock_offset to udata in init_ucontext

2015-12-15 Thread Matan Barak
Pass hca_core_clock_offset to user-space is mandatory in order to
let the user-space read the free-running clock register from the
right offset in the memory mapped page.
Passing this value is done by changing the vendor's command
and response of init_ucontext to be in extensible form.

Signed-off-by: Matan Barak <mat...@mellanox.com>
Reviewed-by: Moshe Lazer <mos...@mellanox.com>
---
 drivers/infiniband/hw/mlx5/main.c| 37 
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  3 +++
 drivers/infiniband/hw/mlx5/user.h| 12 ++--
 include/linux/mlx5/device.h  |  7 +--
 4 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index c707c43..917363c 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -582,8 +582,8 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
  struct ib_udata *udata)
 {
struct mlx5_ib_dev *dev = to_mdev(ibdev);
-   struct mlx5_ib_alloc_ucontext_req_v2 req;
-   struct mlx5_ib_alloc_ucontext_resp resp;
+   struct mlx5_ib_alloc_ucontext_req_v2 req = {};
+   struct mlx5_ib_alloc_ucontext_resp resp = {};
struct mlx5_ib_ucontext *context;
struct mlx5_uuar_info *uuari;
struct mlx5_uar *uars;
@@ -598,20 +598,19 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
if (!dev->ib_active)
return ERR_PTR(-EAGAIN);
 
-   memset(, 0, sizeof(req));
reqlen = udata->inlen - sizeof(struct ib_uverbs_cmd_hdr);
if (reqlen == sizeof(struct mlx5_ib_alloc_ucontext_req))
ver = 0;
-   else if (reqlen == sizeof(struct mlx5_ib_alloc_ucontext_req_v2))
+   else if (reqlen >= sizeof(struct mlx5_ib_alloc_ucontext_req_v2))
ver = 2;
else
return ERR_PTR(-EINVAL);
 
-   err = ib_copy_from_udata(, udata, reqlen);
+   err = ib_copy_from_udata(, udata, min(reqlen, sizeof(req)));
if (err)
return ERR_PTR(err);
 
-   if (req.flags || req.reserved)
+   if (req.flags)
return ERR_PTR(-EINVAL);
 
if (req.total_num_uuars > MLX5_MAX_UUARS)
@@ -620,6 +619,14 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
if (req.total_num_uuars == 0)
return ERR_PTR(-EINVAL);
 
+   if (req.comp_mask)
+   return ERR_PTR(-EOPNOTSUPP);
+
+   if (reqlen > sizeof(req) &&
+   !ib_is_udata_cleared(udata, sizeof(req),
+udata->inlen - sizeof(req)))
+   return ERR_PTR(-EOPNOTSUPP);
+
req.total_num_uuars = ALIGN(req.total_num_uuars,
MLX5_NON_FP_BF_REGS_PER_PAGE);
if (req.num_low_latency_uuars > req.total_num_uuars - 1)
@@ -635,6 +642,8 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
resp.max_send_wqebb = 1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz);
resp.max_recv_wr = 1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz);
resp.max_srq_recv_wr = 1 << MLX5_CAP_GEN(dev->mdev, log_max_srq_sz);
+   resp.response_length = min(offsetof(typeof(resp), response_length) +
+  sizeof(resp.response_length), udata->outlen);
 
context = kzalloc(sizeof(*context), GFP_KERNEL);
if (!context)
@@ -685,8 +694,20 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
 
resp.tot_uuars = req.total_num_uuars;
resp.num_ports = MLX5_CAP_GEN(dev->mdev, num_ports);
-   err = ib_copy_to_udata(udata, ,
-  sizeof(resp) - sizeof(resp.reserved));
+
+   if (field_avail(typeof(resp), reserved2, udata->outlen))
+   resp.response_length += sizeof(resp.reserved2);
+
+   if (field_avail(typeof(resp), hca_core_clock_offset, udata->outlen)) {
+   resp.comp_mask |=
+   MLX5_IB_ALLOC_UCONTEXT_RESP_MASK_CORE_CLOCK_OFFSET;
+   resp.hca_core_clock_offset =
+   offsetof(struct mlx5_init_seg, internal_timer_h) %
+   PAGE_SIZE;
+   resp.response_length += sizeof(resp.hca_core_clock_offset);
+   }
+
+   err = ib_copy_to_udata(udata, , resp.response_length);
if (err)
goto out_uars;
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 6333472..43b3c58 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -55,6 +55,9 @@ pr_err("%s:%s:%d:(pid %d): " format, (dev)->ib_dev.name, 
__func__,\
 pr_warn("%s:%s:%d:(pid %d): " format, (dev)->ib_dev.name, __fun

[PATCH for-next V2 0/5] User-space time-stamping support for mlx5_ib

2015-12-15 Thread Matan Barak
Hi Eli,

This patch-set adds user-space support for time-stamping in mlx5_ib.
It implements the necessary API:
(a) ib_create_cq_ex - Add support for CQ creation flags
(b) ib_query_device - return timestamp_mask and hca_core_clock.

We also add support for mmaping the HCA's free running clock.
In order to do so, we use the response of the vendor's extended
part in init_ucontext. This allows us to pass the page offset
of the free running clock register to the user-space driver.
In order to implement it in a future extensible manner, we  use the
same mechanism of verbs extensions to the mlx5 vendor part as well.

Regards,
Matan

Changes from v1:
 * Change ib_is_udata_cleared to use memchr_inv.

Changes from v0:
 * Limit mmap PAGE_SIZE to 4K (security wise).
 * Optimize ib_is_udata_cleared.
 * Pass hca_core_clock_offset in the vendor's response part of init_ucontext.

Matan Barak (5):
  IB/mlx5: Add create_cq extended command
  IB/core: Add ib_is_udata_cleared
  IB/mlx5: Add support for hca_core_clock and timestamp_mask
  IB/mlx5: Add hca_core_clock_offset to udata in init_ucontext
  IB/mlx5: Mmap the HCA's core clock register to user-space

 drivers/infiniband/hw/mlx5/cq.c  |  7 
 drivers/infiniband/hw/mlx5/main.c| 67 +++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  7 +++-
 drivers/infiniband/hw/mlx5/user.h| 12 +--
 include/linux/mlx5/device.h  |  7 ++--
 include/linux/mlx5/mlx5_ifc.h|  9 +++--
 include/rdma/ib_verbs.h  | 27 +++
 7 files changed, 120 insertions(+), 16 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 5/5] IB/mlx5: Mmap the HCA's core clock register to user-space

2015-12-15 Thread Matan Barak
In order to read the HCA's current cycles register, we need
to map it to user-space. Add support to map this register
via mmap command.

Signed-off-by: Matan Barak <mat...@mellanox.com>
Reviewed-by: Moshe Lazer <mos...@mellanox.com>
---
 drivers/infiniband/hw/mlx5/main.c| 28 
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  4 +++-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 917363c..d037a72 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -810,6 +810,34 @@ static int mlx5_ib_mmap(struct ib_ucontext *ibcontext, 
struct vm_area_struct *vm
case MLX5_IB_MMAP_GET_CONTIGUOUS_PAGES:
return -ENOSYS;
 
+   case MLX5_IB_MMAP_CORE_CLOCK:
+   {
+   phys_addr_t pfn;
+
+   if (vma->vm_end - vma->vm_start != PAGE_SIZE)
+   return -EINVAL;
+
+   if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+   return -EPERM;
+
+   /* Don't expose to user-space information it shouldn't have */
+   if (PAGE_SIZE > 4096)
+   return -EOPNOTSUPP;
+
+   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+   pfn = (dev->mdev->iseg_base +
+  offsetof(struct mlx5_init_seg, internal_timer_h)) >>
+   PAGE_SHIFT;
+   if (io_remap_pfn_range(vma, vma->vm_start, pfn,
+  PAGE_SIZE, vma->vm_page_prot))
+   return -EAGAIN;
+
+   mlx5_ib_dbg(dev, "mapped internal timer at 0x%lx, PA 0x%llx\n",
+   vma->vm_start,
+   (unsigned long long)pfn << PAGE_SHIFT);
+   break;
+   }
+
default:
return -EINVAL;
}
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 43b3c58..a87f312 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -65,7 +65,9 @@ enum {
 
 enum mlx5_ib_mmap_cmd {
MLX5_IB_MMAP_REGULAR_PAGE   = 0,
-   MLX5_IB_MMAP_GET_CONTIGUOUS_PAGES   = 1, /* always last */
+   MLX5_IB_MMAP_GET_CONTIGUOUS_PAGES   = 1,
+   /* 5 is chosen in order to be compatible with old versions of libmlx5 */
+   MLX5_IB_MMAP_CORE_CLOCK = 5,
 };
 
 enum {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 3/5] IB/mlx5: Add support for hca_core_clock and timestamp_mask

2015-12-15 Thread Matan Barak
Reporting the hca_core_clock (in kHZ) and the timestamp_mask in
query_device extended verb. timestamp_mask is used by users in order
to know what is the valid range of the raw timestamps, while
hca_core_clock reports the clock frequency that is used for
timestamps.

Signed-off-by: Matan Barak <mat...@mellanox.com>
Reviewed-by: Moshe Lazer <mos...@mellanox.com>
---
 drivers/infiniband/hw/mlx5/main.c | 2 ++
 include/linux/mlx5/mlx5_ifc.h | 9 ++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 7e97cb5..c707c43 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -293,6 +293,8 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
props->max_total_mcast_qp_attach = props->max_mcast_qp_attach *
   props->max_mcast_grp;
props->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */
+   props->hca_core_clock = MLX5_CAP_GEN(mdev, device_frequency_khz);
+   props->timestamp_mask = 0x7FFFULL;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
if (MLX5_CAP_GEN(mdev, pg))
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 1565324..1a74114 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -794,15 +794,18 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 reserved_63[0x8];
u8 log_uar_page_sz[0x10];
 
-   u8 reserved_64[0x100];
+   u8 reserved_64[0x20];
+   u8 device_frequency_mhz[0x20];
+   u8 device_frequency_khz[0x20];
+   u8 reserved_65[0xa0];
 
-   u8 reserved_65[0x1f];
+   u8 reserved_66[0x1f];
u8 cqe_zip[0x1];
 
u8 cqe_zip_timeout[0x10];
u8 cqe_zip_max_num[0x10];
 
-   u8 reserved_66[0x220];
+   u8 reserved_67[0x220];
 };
 
 enum {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 2/5] IB/core: Add ib_is_udata_cleared

2015-12-15 Thread Matan Barak
Extending core and vendor verb commands require us to check that the
unknown part of the user's given command is all zeros.
Adding ib_is_udata_cleared in order to do so.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 include/rdma/ib_verbs.h | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9a68a19..33ab4eb 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -50,6 +50,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1887,6 +1889,31 @@ static inline int ib_copy_to_udata(struct ib_udata 
*udata, void *src, size_t len
return copy_to_user(udata->outbuf, src, len) ? -EFAULT : 0;
 }
 
+static inline bool ib_is_udata_cleared(struct ib_udata *udata,
+  size_t offset,
+  size_t len)
+{
+   const void __user *p = udata->inbuf + offset;
+   bool ret = false;
+   u8 *buf;
+
+   if (len > USHRT_MAX)
+   return false;
+
+   buf = kmalloc(len, GFP_KERNEL);
+   if (!buf)
+   return false;
+
+   if (copy_from_user(buf, p, len))
+   goto free;
+
+   ret = !memchr_inv(buf, 0, len);
+
+free:
+   kfree(buf);
+   return ret;
+}
+
 /**
  * ib_modify_qp_is_ok - Check that the supplied attribute mask
  * contains all required attributes and no attributes not allowed for
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 1/5] IB/mlx5: Add create_cq extended command

2015-12-15 Thread Matan Barak
In order to create a CQ that supports timestamp, mlx5 needs to
support the extended create CQ command with the timestamp flag.

Signed-off-by: Matan Barak <mat...@mellanox.com>
Reviewed-by: Eli Cohen <e...@mellanox.com>
---
 drivers/infiniband/hw/mlx5/cq.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 3dfd287..186debf 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -743,6 +743,10 @@ static void destroy_cq_kernel(struct mlx5_ib_dev *dev, 
struct mlx5_ib_cq *cq)
mlx5_db_free(dev->mdev, >db);
 }
 
+enum {
+   CQ_CREATE_FLAGS_SUPPORTED = IB_CQ_FLAGS_TIMESTAMP_COMPLETION
+};
+
 struct ib_cq *mlx5_ib_create_cq(struct ib_device *ibdev,
const struct ib_cq_init_attr *attr,
struct ib_ucontext *context,
@@ -766,6 +770,9 @@ struct ib_cq *mlx5_ib_create_cq(struct ib_device *ibdev,
if (entries < 0)
return ERR_PTR(-EINVAL);
 
+   if (attr->flags & ~CQ_CREATE_FLAGS_SUPPORTED)
+   return ERR_PTR(-EOPNOTSUPP);
+
entries = roundup_pow_of_two(entries + 1);
if (entries > (1 << MLX5_CAP_GEN(dev->mdev, log_max_cq_sz)))
return ERR_PTR(-EINVAL);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-14 Thread Matan Barak
On Mon, Dec 14, 2015 at 4:55 PM, Christoph Lameter <c...@linux.com> wrote:
> On Mon, 14 Dec 2015, Matan Barak wrote:
>
>> > +static PORT_PMA_ATTR(unicast_rcv_packets   ,  0, 64, 384, 
>> > IB_PMA_PORT_COUNTERS_EXT);
>> > +static PORT_PMA_ATTR(multicast_xmit_packets,  0, 64, 448, 
>> > IB_PMA_PORT_COUNTERS_EXT);
>> > +static PORT_PMA_ATTR(multicast_rcv_packets ,  0, 64, 512, 
>> > IB_PMA_PORT_COUNTERS_EXT);
>> >
>>
>> Why do we use 0 as the counter argument for all EXT counters?
>
> No idea what the counter is doing. Saw another EXT counter implementation
> use 0 so I thought that was fine.

It seems like a counter index, but I might be wrong though. If it is,
don't we want to preserve the existing non-EXT schema for the new
counters too?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-14 Thread Matan Barak
On Fri, Dec 11, 2015 at 8:25 PM, Christoph Lameter  wrote:
> Display the additional 64 bit counters available through the extended
> set and replace the existing 32 bit counters if there is a 64 bit
> alternative available.
>
> Note: This requires universal support of extended counters in
> the devices. If there are still devices around that do not
> support extended counters then we will have to add some fallback
> technique here.
>
> Signed-off-by: Christoph Lameter 
> ---
>  drivers/infiniband/core/sysfs.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
> index 0083a4f..f7f2954 100644
> --- a/drivers/infiniband/core/sysfs.c
> +++ b/drivers/infiniband/core/sysfs.c
> @@ -406,10 +406,14 @@ static PORT_PMA_ATTR(port_rcv_constraint_errors   , 
>  8,  8, 136, IB_PMA_PORT_C
>  static PORT_PMA_ATTR(local_link_integrity_errors,  9,  4, 152, 
> IB_PMA_PORT_COUNTERS);
>  static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10,  4, 156, 
> IB_PMA_PORT_COUNTERS);
>  static PORT_PMA_ATTR(VL15_dropped  , 11, 16, 176, 
> IB_PMA_PORT_COUNTERS);
> -static PORT_PMA_ATTR(port_xmit_data, 12, 32, 192, 
> IB_PMA_PORT_COUNTERS);
> -static PORT_PMA_ATTR(port_rcv_data , 13, 32, 224, 
> IB_PMA_PORT_COUNTERS);
> -static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256, 
> IB_PMA_PORT_COUNTERS);
> -static PORT_PMA_ATTR(port_rcv_packets  , 15, 32, 288, 
> IB_PMA_PORT_COUNTERS);
> +static PORT_PMA_ATTR(port_xmit_data,  0, 64,  64, 
> IB_PMA_PORT_COUNTERS_EXT);
> +static PORT_PMA_ATTR(port_rcv_data ,  0, 64, 128, 
> IB_PMA_PORT_COUNTERS_EXT);
> +static PORT_PMA_ATTR(port_xmit_packets ,  0, 64, 192, 
> IB_PMA_PORT_COUNTERS_EXT);
> +static PORT_PMA_ATTR(port_rcv_packets  ,  0, 64, 256, 
> IB_PMA_PORT_COUNTERS_EXT);
> +static PORT_PMA_ATTR(unicast_xmit_packets  ,  0, 64, 320, 
> IB_PMA_PORT_COUNTERS_EXT);
> +static PORT_PMA_ATTR(unicast_rcv_packets   ,  0, 64, 384, 
> IB_PMA_PORT_COUNTERS_EXT);
> +static PORT_PMA_ATTR(multicast_xmit_packets,  0, 64, 448, 
> IB_PMA_PORT_COUNTERS_EXT);
> +static PORT_PMA_ATTR(multicast_rcv_packets ,  0, 64, 512, 
> IB_PMA_PORT_COUNTERS_EXT);
>

Why do we use 0 as the counter argument for all EXT counters?

>  static struct attribute *pma_attrs[] = {
> _pma_attr_symbol_error.attr.attr,
> @@ -428,6 +432,10 @@ static struct attribute *pma_attrs[] = {
> _pma_attr_port_rcv_data.attr.attr,
> _pma_attr_port_xmit_packets.attr.attr,
> _pma_attr_port_rcv_packets.attr.attr,
> +   _pma_attr_unicast_rcv_packets.attr.attr,
> +   _pma_attr_unicast_xmit_packets.attr.attr,
> +   _pma_attr_multicast_rcv_packets.attr.attr,
> +   _pma_attr_multicast_xmit_packets.attr.attr,
> NULL
>  };
>
> --
> 2.5.0
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-14 Thread Matan Barak
On Sun, Dec 13, 2015 at 5:47 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 10/12/2015 19:29, Matan Barak wrote:
>> On Thu, Dec 10, 2015 at 5:20 PM, Haggai Eran <hagg...@mellanox.com> wrote:
>>> On 10/12/2015 16:59, Matan Barak wrote:
>>>> On Mon, Dec 7, 2015 at 3:18 PM, Haggai Eran <hagg...@mellanox.com> wrote:
>>>>> On 12/03/2015 05:44 PM, Matan Barak wrote:
>>>>>> Extending core and vendor verb commands require us to check that the
>>>>>> unknown part of the user's given command is all zeros.
>>>>>> Adding ib_is_udata_cleared in order to do so.
>>>>>>
>>>>>
>>>>> Why not copy the data into kernel space and run memchr_inv() on it?
>>>>>
>>>>
>>>> Probably less efficient, isn't it?
>>> Why do you think it is less efficient?
>>>
>>> I'm not sure calling copy_from_user multiple times is very efficient.
>>> For once, you are calling access_ok multiple times. I guess it depends
>>> on the amount of data you are copying.
>>>
>>
>> Isn't access_ok pretty cheap?
>> It calls __chk_range_not_ok which on x86 seems like a very cheap
>> function and __chk_user_ptr which is a compiler check.
>> I guess most kernel-user implementation will be pretty much in sync,
>> so we'll possibly call it for a few/dozens of bytes. In that case, I
>> think this implementation is a bit faster.
>>
>>>> I know it isn't data path, but we'll execute this code in all extended
>>>> functions (sometimes even more than once).
>>> Do you think it is important enough to maintain our own copy of
>>> memchr_inv()?
>>>
>>
>> True, I'm not sure it's important enough, but do you think it's that
>> complicated?
>
> It is complicated in my opinion. It is 67 lines of code, it's
> architecture dependent and relies on preprocessor macros and conditional
> code. I think this kind of stuff belongs in lib/string.c and not in the
> RDMA stack.
>

I'm not sure regarding the string.c location, as it deals with user
buffers, but in order not to
be dependent on this, I'll change this code to the following.

static inline bool ib_is_udata_cleared(struct ib_udata *udata,
   u8 cleared_char,
   size_t offset,
   size_t len)
{
const void __user *p = udata->inbuf + offset;
bool ret = false;
u8 *buf;

if (len > USHRT_MAX)
return false;

buf = kmalloc(len, GFP_KERNEL);
if (!buf)
return false;

if (copy_from_user(buf, p, len))
goto free;

ret = !memchr_inv(buf, cleared_char, len);

free:
kfree(buf);
return ret;
}

> Haggai

Regards,
Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-13 Thread Matan Barak
On Sun, Dec 13, 2015 at 3:56 PM, Liran Liss  wrote:
>> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
>
>>
>> > You are pushing abstraction into provider code instead of handling it in a
>> generic way.
>>
>> No, I am defining an API that *make sense* and doesn't leak useless details.
>> Of course that doesn't force code duplication or anyhting like that, just
>> implement it smartly.
>>
>> I think mlx made a big mistake returning network_type instead of gid index,
>> and I don't want to see that error enshrined in our API.
>>
>> > The Verbs are a low-level API, that should report exactly what was
>> > received from the wire.  In the RoCEv2 case, it should be the GID/IP
>> > addresses and the protocol type.  The addressing information is not
>> > intended to be used directly by applications; it is the raw bits that
>> > were accepted from the wire.
>>
>> Low level details isn't what any in kernel consumer needs. Everything in
>> kernel needs the gid index to determine the namespace, routing and other
>> details. It is not optional. A common API is thus needed to do this 
>> conversion.
>
> The Verbs are not intended only for kernel consumers, but also for the 
> ib_core, cma, etc.
> For the ib_core, a provider needs to report *all* relevant information that 
> is not visible in the packet payload.
> The network type is part of this information.
> The proposed changes are a straightforward extension to the code base, 
> directly follow the specification, and adhere to the RDMA stack design in 
> which IP addressing is handled by the cma.
>
> Also, I don't want to do any route resolution on the Rx path.
> A UD QP completion just reports the details of the packet it received.
>
> Conceptually, an incoming packet may not even match an SGID index at all.
> Maybe, responses should be sent from another device. This should not be 
> decided that the point that a packet was received.
>
>>
>> API-wise, once you get the gid index then it is trivial to make easy 
>> extractors
>> for everything else. ie for example:
>> rdma_get_ud_src_sockaddr(gid_index,,wc,grh)
>> rdma_get_ud_dst_sockaddr(gid_index,,wc,grh)
>>
>
> Nice ideas, but not relevant to completions.
> The resolved dst address could point to another port or device completely.
> The proper way to handle remote UD addresses is by rdma_ids that encapsulate 
> address handles and bound devices.
>
>> > ib_init_ah_from_wc() and friends is exactly the place that you want to
>> > create an address handle based on completion and packet fields.
>>
>> CMA needs exactly the same logic as well, the fact it doesn't have it is a 
>> bug
>> in this series.
>>
>
> init_ah_from_wc() should include a route lookup, and this will be fixed.
>

This was already fixed in v2.

>> Jason
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
>> body of a message to majord...@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-10 Thread Matan Barak
On Thu, Dec 10, 2015 at 5:20 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 10/12/2015 16:59, Matan Barak wrote:
>> On Mon, Dec 7, 2015 at 3:18 PM, Haggai Eran <hagg...@mellanox.com> wrote:
>>> On 12/03/2015 05:44 PM, Matan Barak wrote:
>>>> Extending core and vendor verb commands require us to check that the
>>>> unknown part of the user's given command is all zeros.
>>>> Adding ib_is_udata_cleared in order to do so.
>>>>
>>>
>>> Why not copy the data into kernel space and run memchr_inv() on it?
>>>
>>
>> Probably less efficient, isn't it?
> Why do you think it is less efficient?
>
> I'm not sure calling copy_from_user multiple times is very efficient.
> For once, you are calling access_ok multiple times. I guess it depends
> on the amount of data you are copying.
>

Isn't access_ok pretty cheap?
It calls __chk_range_not_ok which on x86 seems like a very cheap
function and __chk_user_ptr which is a compiler check.
I guess most kernel-user implementation will be pretty much in sync,
so we'll possibly call it for a few/dozens of bytes. In that case, I
think this implementation is a bit faster.

>> I know it isn't data path, but we'll execute this code in all extended
>> functions (sometimes even more than once).
> Do you think it is important enough to maintain our own copy of
> memchr_inv()?
>

True, I'm not sure it's important enough, but do you think it's that
complicated?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 07/11] IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path

2015-12-10 Thread Matan Barak
On Mon, Dec 7, 2015 at 3:42 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 12/03/2015 03:47 PM, Matan Barak wrote:
>> +static int addr_resolve_neigh(struct dst_entry *dst,
>> +   const struct sockaddr *dst_in,
>> +   struct rdma_dev_addr *addr)
>> +{
>> + if (dst->dev->flags & IFF_LOOPBACK) {
>> + int ret;
>> +
>> + ret = rdma_translate_ip(dst_in, addr, NULL);
>> + if (!ret)
>> + memcpy(addr->dst_dev_addr, addr->src_dev_addr,
>> +MAX_ADDR_LEN);
>> +
>> + return ret;
>> + }
>> +
>> + /* If the device does ARP internally */
> You mean "doesn't do ARP internally" right?
>

Correct, nice catch :)

>> + if (!(dst->dev->flags & IFF_NOARP)) {
>> + const struct sockaddr_in *dst_in4 =
>> + (const struct sockaddr_in *)dst_in;
>> + const struct sockaddr_in6 *dst_in6 =
>> + (const struct sockaddr_in6 *)dst_in;
>> +
>> + return dst_fetch_ha(dst, addr,
>> + dst_in->sa_family == AF_INET ?
>> + (const void *)_in4->sin_addr.s_addr :
>> + (const void *)_in6->sin6_addr);
>> + }
>> +
>> + return rdma_copy_addr(addr, dst->dev, NULL);
>> +}
>
>> +int rdma_resolve_ip_route(struct sockaddr *src_addr,
>> +   const struct sockaddr *dst_addr,
>> +   struct rdma_dev_addr *addr)
>> +{
>> + struct sockaddr_storage ssrc_addr;
>> + struct sockaddr *src_in = (struct sockaddr *)_addr;
>> +
>> + if (src_addr->sa_family != dst_addr->sa_family)
>> + return -EINVAL;
>> +
>> + if (src_addr)
>> + memcpy(src_in, src_addr, rdma_addr_size(src_addr));
>> + else
>> + src_in->sa_family = dst_addr->sa_family;
> Don't you need to clear the rest of src_in? I believe you pass
> uninitialized memory to it.
>

Correct, I'll fix that.

>> +
>> + return addr_resolve(src_in, dst_addr, addr, false);
>> +}
>> +EXPORT_SYMBOL(rdma_resolve_ip_route);
>
> Haggai

Thanks for taking a look.

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-10 Thread Matan Barak
On Mon, Dec 7, 2015 at 3:18 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 12/03/2015 05:44 PM, Matan Barak wrote:
>> Extending core and vendor verb commands require us to check that the
>> unknown part of the user's given command is all zeros.
>> Adding ib_is_udata_cleared in order to do so.
>>
>
> Why not copy the data into kernel space and run memchr_inv() on it?
>

Probably less efficient, isn't it?
I know it isn't data path, but we'll execute this code in all extended
functions (sometimes even more than once).

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libmlx5 V1 6/6] Add always_inline check

2015-12-10 Thread Matan Barak
On Mon, Dec 7, 2015 at 3:07 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 12/03/2015 06:02 PM, Matan Barak wrote:
>> Always inline isn't supported by every compiler. Adding it to
>> configure.ac in order to support it only when possible.
>> Inline other poll_one data path functions in order to eliminate
>> "ifs".
>>
>> Signed-off-by: Matan Barak <mat...@mellanox.com>
>> ---
>>  configure.ac | 17 +
>>  src/cq.c | 42 +-
>>  src/mlx5.h   |  6 ++
>>  3 files changed, 52 insertions(+), 13 deletions(-)
>>
>> diff --git a/configure.ac b/configure.ac
>> index fca0b46..50b4f9c 100644
>> --- a/configure.ac
>> +++ b/configure.ac
>> @@ -65,6 +65,23 @@ AC_CHECK_FUNC(ibv_read_sysfs_file, [],
>>  AC_MSG_ERROR([ibv_read_sysfs_file() not found.  libmlx5 requires 
>> libibverbs >= 1.0.3.]))
>>  AC_CHECK_FUNCS(ibv_dontfork_range ibv_dofork_range ibv_register_driver)
>>
>> +AC_MSG_CHECKING("always inline")
> Did you consider using an existing script like AX_GCC_FUNC_ATTRIBUTE [1]?
>

That's probably better, I'll change.

>> +CFLAGS_BAK="$CFLAGS"
>> +CFLAGS="$CFLAGS -Werror"
>> +AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
>> + static inline int f(void)
>> + __attribute__((always_inline));
>> + static inline int f(void)
>> + {
>> + return 1;
>> + }
>> +]],[[
>> + int a = f();
>> + a = a;
>> +]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if 
>> __attribute((always_inline)).])]
> The description here doesn't look right. How about "Define if
> __attribute__((always_inline) is supported"?
>

If I use AX_GCC_FUNC_ATTRIBUTE, I don't need this anymore.

> Regards,
> Haggai
>
> [1] https://www.gnu.org/software/autoconf-archive/ax_gcc_func_attribute.html

Thanks for the review.

Regards,
Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 02/11] IB/cm: Use the source GID index type

2015-12-03 Thread Matan Barak
Previosuly, cm and cma modules supported only IB and RoCE v1 GID type.
In order to support multiple GID types, the gid_type is passed to
cm_init_av_by_path and stored in the path record.

The rdma cm client would use a default GID type that will be saved in
rdma_id_private.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/cm.c  | 25 -
 drivers/infiniband/core/cma.c |  2 ++
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index af8b907..5ea78ab 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -364,7 +364,7 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, 
struct cm_av *av)
read_lock_irqsave(_lock, flags);
list_for_each_entry(cm_dev, _list, list) {
if (!ib_find_cached_gid(cm_dev->ib_device, >sgid,
-   IB_GID_TYPE_IB, ndev, , NULL)) {
+   path->gid_type, ndev, , NULL)) {
port = cm_dev->port[p-1];
break;
}
@@ -1600,6 +1600,8 @@ static int cm_req_handler(struct cm_work *work)
struct ib_cm_id *cm_id;
struct cm_id_private *cm_id_priv, *listen_cm_id_priv;
struct cm_req_msg *req_msg;
+   union ib_gid gid;
+   struct ib_gid_attr gid_attr;
int ret;
 
req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -1639,11 +1641,24 @@ static int cm_req_handler(struct cm_work *work)
cm_format_paths_from_req(req_msg, >path[0], >path[1]);
 
memcpy(work->path[0].dmac, cm_id_priv->av.ah_attr.dmac, ETH_ALEN);
-   ret = cm_init_av_by_path(>path[0], _id_priv->av);
+   ret = ib_get_cached_gid(work->port->cm_dev->ib_device,
+   work->port->port_num,
+   cm_id_priv->av.ah_attr.grh.sgid_index,
+   , _attr);
+   if (!ret) {
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   work->path[0].gid_type = gid_attr.gid_type;
+   ret = cm_init_av_by_path(>path[0], _id_priv->av);
+   }
if (ret) {
-   ib_get_cached_gid(work->port->cm_dev->ib_device,
- work->port->port_num, 0, >path[0].sgid,
- NULL);
+   int err = ib_get_cached_gid(work->port->cm_dev->ib_device,
+   work->port->port_num, 0,
+   >path[0].sgid,
+   _attr);
+   if (!err && gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   work->path[0].gid_type = gid_attr.gid_type;
ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
   >path[0].sgid, sizeof work->path[0].sgid,
   NULL, 0);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index c19f822..2914e08 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -228,6 +228,7 @@ struct rdma_id_private {
u8  tos;
u8  reuseaddr;
u8  afonly;
+   enum ib_gid_typegid_type;
 };
 
 struct cma_multicast {
@@ -2325,6 +2326,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
ndev = dev_get_by_index(_net, addr->dev_addr.bound_dev_if);
route->path_rec->net = _net;
route->path_rec->ifindex = addr->dev_addr.bound_dev_if;
+   route->path_rec->gid_type = id_priv->gid_type;
}
if (!ndev) {
ret = -ENODEV;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 03/11] IB/core: Add gid attributes to sysfs

2015-12-03 Thread Matan Barak
This patch set adds attributes of net device and gid type to each GID
in the GID table. Users that use verbs directly need to specify
the GID index. Since the same GID could have different types or
associated net devices, users should have the ability to query the
associated GID attributes. Adding these attributes to sysfs.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 Documentation/ABI/testing/sysfs-class-infiniband |  16 ++
 drivers/infiniband/core/sysfs.c  | 184 ++-
 2 files changed, 198 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband

diff --git a/Documentation/ABI/testing/sysfs-class-infiniband 
b/Documentation/ABI/testing/sysfs-class-infiniband
new file mode 100644
index 000..a86abe6
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-infiniband
@@ -0,0 +1,16 @@
+What:  
/sys/class/infiniband//ports//gid_attrs/ndevs/
+Date:  November 29, 2015
+KernelVersion: 4.4.0
+Contact:   linux-rdma@vger.kernel.org
+Description:   The net-device's name associated with the GID resides
+   at index .
+
+What:  
/sys/class/infiniband//ports//gid_attrs/types/
+Date:  November 29, 2015
+KernelVersion: 4.4.0
+Contact:   linux-rdma@vger.kernel.org
+Description:   The RoCE type of the associated GID resides at index 
.
+   This could either be "IB/RoCE v1" for IB and RoCE v1 based GODs
+   or "RoCE v2" for RoCE v2 based GIDs.
+
+
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index b1f37d4..4d5d87a 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -37,12 +37,22 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
+struct ib_port;
+
+struct gid_attr_group {
+   struct ib_port  *port;
+   struct kobject  kobj;
+   struct attribute_group  ndev;
+   struct attribute_group  type;
+};
 struct ib_port {
struct kobject kobj;
struct ib_device  *ibdev;
+   struct gid_attr_group *gid_attr_group;
struct attribute_group gid_group;
struct attribute_group pkey_group;
u8 port_num;
@@ -84,6 +94,24 @@ static const struct sysfs_ops port_sysfs_ops = {
.show = port_attr_show
 };
 
+static ssize_t gid_attr_show(struct kobject *kobj,
+struct attribute *attr, char *buf)
+{
+   struct port_attribute *port_attr =
+   container_of(attr, struct port_attribute, attr);
+   struct ib_port *p = container_of(kobj, struct gid_attr_group,
+kobj)->port;
+
+   if (!port_attr->show)
+   return -EIO;
+
+   return port_attr->show(p, port_attr, buf);
+}
+
+static const struct sysfs_ops gid_attr_sysfs_ops = {
+   .show = gid_attr_show
+};
+
 static ssize_t state_show(struct ib_port *p, struct port_attribute *unused,
  char *buf)
 {
@@ -281,6 +309,46 @@ static struct attribute *port_default_attrs[] = {
NULL
 };
 
+static size_t print_ndev(struct ib_gid_attr *gid_attr, char *buf)
+{
+   if (!gid_attr->ndev)
+   return -EINVAL;
+
+   return sprintf(buf, "%s\n", gid_attr->ndev->name);
+}
+
+static size_t print_gid_type(struct ib_gid_attr *gid_attr, char *buf)
+{
+   return sprintf(buf, "%s\n", ib_cache_gid_type_str(gid_attr->gid_type));
+}
+
+static ssize_t _show_port_gid_attr(struct ib_port *p,
+  struct port_attribute *attr,
+  char *buf,
+  size_t (*print)(struct ib_gid_attr *gid_attr,
+  char *buf))
+{
+   struct port_table_attribute *tab_attr =
+   container_of(attr, struct port_table_attribute, attr);
+   union ib_gid gid;
+   struct ib_gid_attr gid_attr = {};
+   ssize_t ret;
+   va_list args;
+
+   ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, ,
+  _attr);
+   if (ret)
+   goto err;
+
+   ret = print(_attr, buf);
+
+err:
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   va_end(args);
+   return ret;
+}
+
 static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr,
 char *buf)
 {
@@ -296,6 +364,19 @@ static ssize_t show_port_gid(struct ib_port *p, struct 
port_attribute *attr,
return sprintf(buf, "%pI6\n", gid.raw);
 }
 
+static ssize_t show_port_gid_attr_ndev(struct ib_port *p,
+  struct port_attribute *attr, char *buf)
+{
+   return _show_port_gid_attr(p, attr, buf, print_ndev);
+}
+
+static ssize_t show_port_gid_attr_gid_type(struct ib_port *p,
+ 

[PATCH for-next V2 01/11] IB/core: Add gid_type to gid attribute

2015-12-03 Thread Matan Barak
In order to support multiple GID types, we need to store the gid_type
with each GID. This is also aligned with the RoCE v2 annex "RoCEv2 PORT
GID table entries shall have a "GID type" attribute that denotes the L3
Address type". The currently supported GID is IB_GID_TYPE_IB which is
also RoCE v1 GID type.

This implies that gid_type should be added to roce_gid_table meta-data.

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 drivers/infiniband/core/cache.c   | 144 --
 drivers/infiniband/core/cm.c  |   2 +-
 drivers/infiniband/core/cma.c |   3 +-
 drivers/infiniband/core/core_priv.h   |   4 +
 drivers/infiniband/core/device.c  |   9 +-
 drivers/infiniband/core/multicast.c   |   2 +-
 drivers/infiniband/core/roce_gid_mgmt.c   |  60 +++--
 drivers/infiniband/core/sa_query.c|   5 +-
 drivers/infiniband/core/uverbs_marshall.c |   1 +
 drivers/infiniband/core/verbs.c   |   1 +
 include/rdma/ib_cache.h   |   4 +
 include/rdma/ib_sa.h  |   1 +
 include/rdma/ib_verbs.h   |  11 ++-
 13 files changed, 185 insertions(+), 62 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 097e9df..566fd8f 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -64,6 +64,7 @@ enum gid_attr_find_mask {
GID_ATTR_FIND_MASK_GID  = 1UL << 0,
GID_ATTR_FIND_MASK_NETDEV   = 1UL << 1,
GID_ATTR_FIND_MASK_DEFAULT  = 1UL << 2,
+   GID_ATTR_FIND_MASK_GID_TYPE = 1UL << 3,
 };
 
 enum gid_table_entry_props {
@@ -125,6 +126,19 @@ static void dispatch_gid_change_event(struct ib_device 
*ib_dev, u8 port)
}
 }
 
+static const char * const gid_type_str[] = {
+   [IB_GID_TYPE_IB]= "IB/RoCE v1",
+};
+
+const char *ib_cache_gid_type_str(enum ib_gid_type gid_type)
+{
+   if (gid_type < ARRAY_SIZE(gid_type_str) && gid_type_str[gid_type])
+   return gid_type_str[gid_type];
+
+   return "Invalid GID type";
+}
+EXPORT_SYMBOL(ib_cache_gid_type_str);
+
 /* This function expects that rwlock will be write locked in all
  * scenarios and that lock will be locked in sleep-able (RoCE)
  * scenarios.
@@ -233,6 +247,10 @@ static int find_gid(struct ib_gid_table *table, const 
union ib_gid *gid,
if (found >=0)
continue;
 
+   if (mask & GID_ATTR_FIND_MASK_GID_TYPE &&
+   attr->gid_type != val->gid_type)
+   continue;
+
if (mask & GID_ATTR_FIND_MASK_GID &&
memcmp(gid, >gid, sizeof(*gid)))
continue;
@@ -296,6 +314,7 @@ int ib_cache_gid_add(struct ib_device *ib_dev, u8 port,
write_lock_irq(>rwlock);
 
ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID |
+ GID_ATTR_FIND_MASK_GID_TYPE |
  GID_ATTR_FIND_MASK_NETDEV, );
if (ix >= 0)
goto out_unlock;
@@ -329,6 +348,7 @@ int ib_cache_gid_del(struct ib_device *ib_dev, u8 port,
 
ix = find_gid(table, gid, attr, false,
  GID_ATTR_FIND_MASK_GID  |
+ GID_ATTR_FIND_MASK_GID_TYPE |
  GID_ATTR_FIND_MASK_NETDEV   |
  GID_ATTR_FIND_MASK_DEFAULT,
  NULL);
@@ -427,11 +447,13 @@ static int _ib_cache_gid_table_find(struct ib_device 
*ib_dev,
 
 static int ib_cache_gid_find(struct ib_device *ib_dev,
 const union ib_gid *gid,
+enum ib_gid_type gid_type,
 struct net_device *ndev, u8 *port,
 u16 *index)
 {
-   unsigned long mask = GID_ATTR_FIND_MASK_GID;
-   struct ib_gid_attr gid_attr_val = {.ndev = ndev};
+   unsigned long mask = GID_ATTR_FIND_MASK_GID |
+GID_ATTR_FIND_MASK_GID_TYPE;
+   struct ib_gid_attr gid_attr_val = {.ndev = ndev, .gid_type = gid_type};
 
if (ndev)
mask |= GID_ATTR_FIND_MASK_NETDEV;
@@ -442,14 +464,16 @@ static int ib_cache_gid_find(struct ib_device *ib_dev,
 
 int ib_find_cached_gid_by_port(struct ib_device *ib_dev,
   const union ib_gid *gid,
+  enum ib_gid_type gid_type,
   u8 port, struct net_device *ndev,
   u16 *index)
 {
int local_index;
struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
struct ib_gid_table *table;
-   unsigned long mask = GID_ATTR_FIND_MASK_GID;
-   struct ib_gid_attr val = {.ndev = ndev};
+   unsigned long mask = GID_ATTR_FIND_MASK_GID |
+G

[PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-03 Thread Matan Barak
Hi Doug,

This series adds the support for RoCE v2. In order to support RoCE v2,
we add gid_type attribute to every GID. When the RoCE GID management
populates the GID table, it duplicates each GID with all supported types.
This gives the user the ability to communicate over each supported
type.

Patch 0001, 0002 and 0003 add support for multiple GID types to the
cache and related APIs. The third patch exposes the GID attributes
information is sysfs.

Patch 0004 adds the RoCE v2 GID type and the capabilities required
from the vendor in order to implement RoCE v2. These capabilities
are grouped together as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP.

RoCE v2 could work at IPv4 and IPv6 networks. When receiving ib_wc, this
information should come from the vendor's driver. In case the vendor
doesn't supply this information, we parse the packet headers and resolve
its network type. Patch 0005 adds this information and required utilities.

Patches 0006 and 0007 adds route validation. This is mandatory to ensure
that we send packets using GIDS which corresponds to a net-device that
can be routed to the destination.

Patches 0008 and 0009 add configfs support (and the required
infrastructure) for CMA. The administrator should be able to set the
default RoCE type. This is done through a new per-port
default_roce_mode configfs file.

Patch 0010 formats a QP1 packet in order to support RoCE v2 CM
packets. This is required for vendors which implement their
QP1 as a Raw QP.

Patch 0011 adds support for IPv4 multicast as an IPv4 network
requires IGMP to be sent in order to join multicast groups.

Vendors code aren't part of this patch-set. Soft-Roce will be
sent soon and depends on these patches. Other vendors, like
mlx4, ocrdma and mlx5 will follow.

This patch is applied on "Change per-entry locks in GID cache to table lock"
which was sent to the mailing list.

Thanks,
Matan

Changed from V1:
 - Rebased against Linux 4.4-rc2 master branch.
 - Add route validation
 - ConfigFS - avoid compiling INFINIBAND=y and CONFIGFS_FS=m
 - Add documentation for configfs and sysfs ABI
 - Remove ifindex and gid_type from mcmember

Changes from V0:
 - Rebased patches against Doug's latest k.o/for-4.4 tree.
 - Fixed a bug in configfs (rmdir caused an incorrect free).

Matan Barak (8):
  IB/core: Add gid_type to gid attribute
  IB/cm: Use the source GID index type
  IB/core: Add gid attributes to sysfs
  IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type
  IB/core: Move rdma_is_upper_dev_rcu to header file
  IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path
  IB/rdma_cm: Add wrapper for cma reference count
  IB/cma: Add configfs for rdma_cm

Moni Shoua (2):
  IB/core: Initialize UD header structure with IP and UDP headers
  IB/cma: Join and leave multicast groups with IGMP

Somnath Kotur (1):
  IB/core: Add rdma_network_type to wc

 Documentation/ABI/testing/configfs-rdma_cm   |  22 ++
 Documentation/ABI/testing/sysfs-class-infiniband |  16 ++
 drivers/infiniband/Kconfig   |   9 +
 drivers/infiniband/core/Makefile |   2 +
 drivers/infiniband/core/addr.c   | 185 +
 drivers/infiniband/core/cache.c  | 169 
 drivers/infiniband/core/cm.c |  31 ++-
 drivers/infiniband/core/cma.c| 261 --
 drivers/infiniband/core/cma_configfs.c   | 321 +++
 drivers/infiniband/core/core_priv.h  |  45 
 drivers/infiniband/core/device.c |  10 +-
 drivers/infiniband/core/multicast.c  |  17 +-
 drivers/infiniband/core/roce_gid_mgmt.c  |  81 --
 drivers/infiniband/core/sa_query.c   |  76 +-
 drivers/infiniband/core/sysfs.c  | 184 -
 drivers/infiniband/core/ud_header.c  | 155 ++-
 drivers/infiniband/core/uverbs_marshall.c|   1 +
 drivers/infiniband/core/verbs.c  | 170 ++--
 drivers/infiniband/hw/mlx4/qp.c  |   7 +-
 drivers/infiniband/hw/mthca/mthca_qp.c   |   2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |   2 +-
 include/rdma/ib_addr.h   |  11 +-
 include/rdma/ib_cache.h  |   4 +
 include/rdma/ib_pack.h   |  45 +++-
 include/rdma/ib_sa.h |   3 +
 include/rdma/ib_verbs.h  |  78 +-
 26 files changed, 1704 insertions(+), 203 deletions(-)
 create mode 100644 Documentation/ABI/testing/configfs-rdma_cm
 create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband
 create mode 100644 drivers/infiniband/core/cma_configfs.c

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 09/11] IB/cma: Add configfs for rdma_cm

2015-12-03 Thread Matan Barak
Users would like to control the behaviour of rdma_cm.
For example, old applications which don't set the
required RoCE gid type could be executed on RoCE V2
network types. In order to support this configuration,
we implement a configfs for rdma_cm.

In order to use the configfs, one needs to mount it and
mkdir  inside rdma_cm directory.

The patch adds support for a single configuration file,
default_roce_mode. The mode can either be "IB/RoCE v1" or
"RoCE v2".

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 Documentation/ABI/testing/configfs-rdma_cm |  22 ++
 drivers/infiniband/Kconfig |   9 +
 drivers/infiniband/core/Makefile   |   2 +
 drivers/infiniband/core/cache.c|  24 +++
 drivers/infiniband/core/cma.c  | 108 +-
 drivers/infiniband/core/cma_configfs.c | 321 +
 drivers/infiniband/core/core_priv.h|  24 +++
 7 files changed, 503 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/ABI/testing/configfs-rdma_cm
 create mode 100644 drivers/infiniband/core/cma_configfs.c

diff --git a/Documentation/ABI/testing/configfs-rdma_cm 
b/Documentation/ABI/testing/configfs-rdma_cm
new file mode 100644
index 000..5c389aa
--- /dev/null
+++ b/Documentation/ABI/testing/configfs-rdma_cm
@@ -0,0 +1,22 @@
+What:  /config/rdma_cm
+Date:  November 29, 2015
+KernelVersion:  4.4.0
+Description:   Interface is used to configure RDMA-cable HCAs in respect to
+   RDMA-CM attributes.
+
+   Attributes are visible only when configfs is mounted. To mount
+   configfs in /config directory use:
+   # mount -t configfs none /config/
+
+   In order to set parameters related to a specific HCA, a 
directory
+   for this HCA has to be created:
+   mkdir -p /config/rdma_cm/
+
+
+What:  /config/rdma_cm//ports//default_roce_mode
+Date:  November 29, 2015
+KernelVersion:  4.4.0
+Description:   RDMA-CM based connections from HCA  at port 
+   will be initiated with this RoCE type as default.
+   The possible RoCE types are either "IB/RoCE v1" or "RoCE v2".
+   This parameter has RW access.
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index aa26f3c..f5312da 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -54,6 +54,15 @@ config INFINIBAND_ADDR_TRANS
depends on INFINIBAND
default y
 
+config INFINIBAND_ADDR_TRANS_CONFIGFS
+   bool
+   depends on INFINIBAND_ADDR_TRANS && !(INFINIBAND=y && CONFIGFS_FS=m)
+   default y
+   ---help---
+ ConfigFS support for RDMA communication manager (CM).
+ This allows the user to config the default GID type that the CM
+ uses for each device, when initiaing new connections.
+
 source "drivers/infiniband/hw/mthca/Kconfig"
 source "drivers/infiniband/hw/qib/Kconfig"
 source "drivers/infiniband/hw/cxgb3/Kconfig"
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d43a899..7922fa7 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -24,6 +24,8 @@ iw_cm-y :=iwcm.o iwpm_util.o iwpm_msg.o
 
 rdma_cm-y :=   cma.o
 
+rdma_cm-$(CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS) += cma_configfs.o
+
 rdma_ucm-y :=  ucma.o
 
 ib_addr-y :=   addr.o
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 88b4b6f..4aada52 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -140,6 +140,30 @@ const char *ib_cache_gid_type_str(enum ib_gid_type 
gid_type)
 }
 EXPORT_SYMBOL(ib_cache_gid_type_str);
 
+int ib_cache_gid_parse_type_str(const char *buf)
+{
+   unsigned int i;
+   size_t len;
+   int err = -EINVAL;
+
+   len = strlen(buf);
+   if (len == 0)
+   return -EINVAL;
+
+   if (buf[len - 1] == '\n')
+   len--;
+
+   for (i = 0; i < ARRAY_SIZE(gid_type_str); ++i)
+   if (gid_type_str[i] && !strncmp(buf, gid_type_str[i], len) &&
+   len == strlen(gid_type_str[i])) {
+   err = i;
+   break;
+   }
+
+   return err;
+}
+EXPORT_SYMBOL(ib_cache_gid_parse_type_str);
+
 /* This function expects that rwlock will be write locked in all
  * scenarios and that lock will be locked in sleep-able (RoCE)
  * scenarios.
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index f78088a..8fab267 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -152,6 +152,7 @@ struct cma_device {
struct completion   comp;
atomic_trefcount;
struct list_headid_li

[PATCH for-next V2 09/11] IB/cma: Add configfs for rdma_cm

2015-12-03 Thread Matan Barak
Users would like to control the behaviour of rdma_cm.
For example, old applications which don't set the
required RoCE gid type could be executed on RoCE V2
network types. In order to support this configuration,
we implement a configfs for rdma_cm.

In order to use the configfs, one needs to mount it and
mkdir  inside rdma_cm directory.

The patch adds support for a single configuration file,
default_roce_mode. The mode can either be "IB/RoCE v1" or
"RoCE v2".

Signed-off-by: Matan Barak <mat...@mellanox.com>
---
 Documentation/ABI/testing/configfs-rdma_cm |  22 ++
 drivers/infiniband/Kconfig |   9 +
 drivers/infiniband/core/Makefile   |   2 +
 drivers/infiniband/core/cache.c|  24 +++
 drivers/infiniband/core/cma.c  | 108 +-
 drivers/infiniband/core/cma_configfs.c | 321 +
 drivers/infiniband/core/core_priv.h|  24 +++
 7 files changed, 503 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/ABI/testing/configfs-rdma_cm
 create mode 100644 drivers/infiniband/core/cma_configfs.c

diff --git a/Documentation/ABI/testing/configfs-rdma_cm 
b/Documentation/ABI/testing/configfs-rdma_cm
new file mode 100644
index 000..5c389aa
--- /dev/null
+++ b/Documentation/ABI/testing/configfs-rdma_cm
@@ -0,0 +1,22 @@
+What:  /config/rdma_cm
+Date:  November 29, 2015
+KernelVersion:  4.4.0
+Description:   Interface is used to configure RDMA-cable HCAs in respect to
+   RDMA-CM attributes.
+
+   Attributes are visible only when configfs is mounted. To mount
+   configfs in /config directory use:
+   # mount -t configfs none /config/
+
+   In order to set parameters related to a specific HCA, a 
directory
+   for this HCA has to be created:
+   mkdir -p /config/rdma_cm/
+
+
+What:  /config/rdma_cm//ports//default_roce_mode
+Date:  November 29, 2015
+KernelVersion:  4.4.0
+Description:   RDMA-CM based connections from HCA  at port 
+   will be initiated with this RoCE type as default.
+   The possible RoCE types are either "IB/RoCE v1" or "RoCE v2".
+   This parameter has RW access.
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index aa26f3c..f5312da 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -54,6 +54,15 @@ config INFINIBAND_ADDR_TRANS
depends on INFINIBAND
default y
 
+config INFINIBAND_ADDR_TRANS_CONFIGFS
+   bool
+   depends on INFINIBAND_ADDR_TRANS && !(INFINIBAND=y && CONFIGFS_FS=m)
+   default y
+   ---help---
+ ConfigFS support for RDMA communication manager (CM).
+ This allows the user to config the default GID type that the CM
+ uses for each device, when initiaing new connections.
+
 source "drivers/infiniband/hw/mthca/Kconfig"
 source "drivers/infiniband/hw/qib/Kconfig"
 source "drivers/infiniband/hw/cxgb3/Kconfig"
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d43a899..7922fa7 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -24,6 +24,8 @@ iw_cm-y :=iwcm.o iwpm_util.o iwpm_msg.o
 
 rdma_cm-y :=   cma.o
 
+rdma_cm-$(CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS) += cma_configfs.o
+
 rdma_ucm-y :=  ucma.o
 
 ib_addr-y :=   addr.o
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 88b4b6f..4aada52 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -140,6 +140,30 @@ const char *ib_cache_gid_type_str(enum ib_gid_type 
gid_type)
 }
 EXPORT_SYMBOL(ib_cache_gid_type_str);
 
+int ib_cache_gid_parse_type_str(const char *buf)
+{
+   unsigned int i;
+   size_t len;
+   int err = -EINVAL;
+
+   len = strlen(buf);
+   if (len == 0)
+   return -EINVAL;
+
+   if (buf[len - 1] == '\n')
+   len--;
+
+   for (i = 0; i < ARRAY_SIZE(gid_type_str); ++i)
+   if (gid_type_str[i] && !strncmp(buf, gid_type_str[i], len) &&
+   len == strlen(gid_type_str[i])) {
+   err = i;
+   break;
+   }
+
+   return err;
+}
+EXPORT_SYMBOL(ib_cache_gid_parse_type_str);
+
 /* This function expects that rwlock will be write locked in all
  * scenarios and that lock will be locked in sleep-able (RoCE)
  * scenarios.
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index f78088a..8fab267 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -152,6 +152,7 @@ struct cma_device {
struct completion   comp;
atomic_trefcount;
struct list_headid_li

  1   2   3   4   5   >