Re: [PATCH] openvswitch: perform refragmentation for packets which pass through conntrack

2021-03-23 Thread Flavio Leitner


Hi,

On Fri, Mar 19, 2021 at 04:43:07PM -0400, Aaron Conole wrote:
> When a user instructs a flow pipeline to perform connection tracking,
> there is an implicit L3 operation that occurs - namely the IP fragments
> are reassembled and then processed as a single unit.  After this, new
> fragments are generated and then transmitted, with the hint that they
> should be fragmented along the max rx unit boundary.  In general, this
> behavior works well to forward packets along when the MTUs are congruent
> across the datapath.
> 
> However, if using a protocol such as UDP on a network with mismatching
> MTUs, it is possible that the refragmentation will still produce an
> invalid fragment, and that fragmented packet will not be delivered.
> Such a case shouldn't happen because the user explicitly requested a
> layer 3+4 function (conntrack), and that function generates new fragments,
> so we should perform the needed actions in that case (namely, refragment
> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).
> 
> Additionally, introduce a test suite for openvswitch with a test case
> that ensures this MTU behavior, with the expectation that new tests are
> added when needed.
> 
> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
> Signed-off-by: Aaron Conole 
> ---
> NOTE: checkpatch reports a whitespace error with the openvswitch.sh
>   script - this is due to using tab as the IFS value.  This part
>   of the script was copied from
>   tools/testing/selftests/net/pmtu.sh so I think should be
>   permissible.
> 
>  net/openvswitch/actions.c  |   2 +-
>  tools/testing/selftests/net/.gitignore |   1 +
>  tools/testing/selftests/net/Makefile   |   1 +
>  tools/testing/selftests/net/openvswitch.sh | 394 +
>  4 files changed, 397 insertions(+), 1 deletion(-)
>  create mode 100755 tools/testing/selftests/net/openvswitch.sh
> 
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 92a0b67b2728..d858ea580e43 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff 
> *skb, int out_port,
>   if (likely(!mru ||
>  (skb->len <= mru + vport->dev->hard_header_len))) {
>   ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
> - } else if (mru <= vport->dev->mtu) {
> + } else if (mru) {
>   struct net *net = read_pnet(&dp->net);

If the egress port has MTU less than MRU, then before the patch
the packet is dropped and after the patch ip_do_fragment() will
correctly take care of fragmenting according with egress MTU.

That seems a reasonable change to me.

This condition below makes little sense to me now that this patch
is changing the MTU assumption:
skb->len <= mru + vport->dev->hard_header_len

The MRU depends on the ingress port. Perhaps that should check if
the skb length fits into the egress port even with mru available:

  if (!mru || (packet_length(skb, vport->dev) <= vport->dev->mtu)) {
  ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
  else {
  ovs_fragment(net, vport, skb, mru, key);
  }

IIRC, a GSO packet will always fall into the first condition
otherwise we need an additional skb_is_gso(skb).

Also, that last 'else' branch is not needed anymore.

Then we have check_pkt_len which Aaron and I discussed a bit
offline. I think we should revert commit[1] at least the part
relying on mru because a packet with mru > mtu is not dropped
after the patch.

[1] 178436557 ("openvswitch: take into account de-fragmentation/gso_size
in execute_check_pkt_len")

Thanks,
fbl


>  
>   ovs_fragment(net, vport, skb, mru, key);
> diff --git a/tools/testing/selftests/net/.gitignore 
> b/tools/testing/selftests/net/.gitignore
> index 61ae899cfc17..d4d7487833be 100644
> --- a/tools/testing/selftests/net/.gitignore
> +++ b/tools/testing/selftests/net/.gitignore
> @@ -30,3 +30,4 @@ hwtstamp_config
>  rxtimestamp
>  timestamping
>  txtimestamp
> +test_mismatched_mtu_with_conntrack
> diff --git a/tools/testing/selftests/net/Makefile 
> b/tools/testing/selftests/net/Makefile
> index 25f198bec0b2..dc9b556f86fd 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -23,6 +23,7 @@ TEST_PROGS += drop_monitor_tests.sh
>  TEST_PROGS += vrf_route_leaking.sh
>  TEST_PROGS += bareudp.sh
>  TEST_PROGS += unicast_extensions.sh
> +TEST_PROGS += openvswitch.sh
>  TEST_PROGS_EXTENDED := in_netns.sh
>  TEST_GEN_FILES =  socket nettest
>  TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
> diff --git a/tools/testing/selftests/net/openvswitch.sh 
> b/tools/testing/selftests/net/openvswitch.sh
> new file mode 100755
> index ..7b6341688cc3
> --- /dev/null
> +++ b/tools/testing/selftests/net/openvswitch.sh
> @@ -0,0 +1,394 @@

[PATCH net-next] openvswitch: Warn over-mtu packets only if iface is UP.

2021-03-16 Thread Flavio Leitner
It is not unusual to have the bridge port down. Sometimes
it has the old MTU, which is fine since it's not being used.

However, the kernel spams the log with a warning message
when a packet is going to be sent over such port. Fix that
by warning only if the interface is UP.

Signed-off-by: Flavio Leitner 
---
 net/openvswitch/vport.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 4ed7e52c7012..88deb5b41429 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -497,10 +497,12 @@ void ovs_vport_send(struct vport *vport, struct sk_buff 
*skb, u8 mac_proto)
 
if (unlikely(packet_length(skb, vport->dev) > mtu &&
 !skb_is_gso(skb))) {
-   net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
-vport->dev->name,
-packet_length(skb, vport->dev), mtu);
vport->dev->stats.tx_errors++;
+   if (vport->dev->flags & IFF_UP)
+   net_warn_ratelimited("%s: dropped over-mtu packet: "
+"%d > %d\n", vport->dev->name,
+packet_length(skb, vport->dev),
+mtu);
goto drop;
}
 
-- 
2.29.2



Re: [PATCH net-next] net: openvswitch: add log message for error case

2021-01-13 Thread Flavio Leitner
On Wed, Jan 13, 2021 at 02:50:00PM +0100, Eelco Chaudron wrote:
> As requested by upstream OVS, added some error messages in the
> validate_and_copy_dec_ttl function.
> 
> Includes a small cleanup, which removes an unnecessary parameter
> from the dec_ttl_exception_handler() function.
> 
> Reported-by: Flavio Leitner 
> Signed-off-by: Eelco Chaudron 
> ---

Acked-by: Flavio Leitner 



Re: [ovs-dev] [PATCH net-next] net: openvswitch: return an error instead of doing BUG_ON()

2019-05-02 Thread Flavio Leitner
On Thu, May 02, 2019 at 04:12:38PM -0400, Eelco Chaudron wrote:
> For all other error cases in queue_userspace_packet() the error is
> returned, so it makes sense to do the same for these two error cases.
> 
> Reported-by: Davide Caratti 
> Signed-off-by: Eelco Chaudron 
> ---

LGTM
Acked-by: Flavio Leitner 



[PATCH] Revert "openvswitch: Fix template leak in error cases."

2018-09-28 Thread Flavio Leitner
This reverts commit 90c7afc96cbbd77f44094b5b651261968e97de67.

When the commit was merged, the code used nf_ct_put() to free
the entry, but later on commit 76644232e612 ("openvswitch: Free
tmpl with tmpl_free.") replaced that with nf_ct_tmpl_free which
is a more appropriate. Now the original problem is removed.

Then 44d6e2f27328 ("net: Replace NF_CT_ASSERT() with WARN_ON().")
replaced a debug assert with a WARN_ON() which is trigged now.

Signed-off-by: Flavio Leitner 
---
 net/openvswitch/conntrack.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 86a75105af1a..0aeb34c6389d 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1624,10 +1624,6 @@ int ovs_ct_copy_action(struct net *net, const struct 
nlattr *attr,
OVS_NLERR(log, "Failed to allocate conntrack template");
return -ENOMEM;
}
-
-   __set_bit(IPS_CONFIRMED_BIT, &ct_info.ct->status);
-   nf_conntrack_get(&ct_info.ct->ct_general);
-
if (helper) {
err = ovs_ct_add_helper(&ct_info, helper, key, log);
if (err)
@@ -1639,6 +1635,8 @@ int ovs_ct_copy_action(struct net *net, const struct 
nlattr *attr,
if (err)
goto err_free_ct;
 
+   __set_bit(IPS_CONFIRMED_BIT, &ct_info.ct->status);
+   nf_conntrack_get(&ct_info.ct->ct_general);
return 0;
 err_free_ct:
__ovs_ct_free_action(&ct_info);
-- 
2.14.4



Re: [PATCH] netfilter: check if the socket netns is correct.

2018-09-27 Thread Flavio Leitner
On Thu, Sep 27, 2018 at 01:46:29PM -0700, Guenter Roeck wrote:
> Hi Flavio,
> 
> On Wed, Jun 27, 2018 at 10:34:25AM -0300, Flavio Leitner wrote:
> > Netfilter assumes that if the socket is present in the skb, then
> > it can be used because that reference is cleaned up while the skb
> > is crossing netns.
> > 
> > We want to change that to preserve the socket reference in a future
> > patch, so this is a preparation updating netfilter to check if the
> > socket netns matches before use it.
> > 
> > Signed-off-by: Flavio Leitner 
> > Acked-by: Florian Westphal 
> > Signed-off-by: David S. Miller 
> > ---
> ...
> > --- a/net/netfilter/xt_socket.c
> > +++ b/net/netfilter/xt_socket.c
> > @@ -56,8 +56,12 @@ socket_match(const struct sk_buff *skb, struct 
> > xt_action_param *par,
> > struct sk_buff *pskb = (struct sk_buff *)skb;
> > struct sock *sk = skb->sk;
> >  
> > +   if (!net_eq(xt_net(par), sock_net(sk)))
> > +   sk = NULL;
> > +
> 
> I am having trouble with this code. With CONFIG_NET_NS enabled, it crashes
> for me in read_pnet() because sk is NULL.
> 
> > if (!sk)
> > sk = nf_sk_lookup_slow_v4(xt_net(par), skb, xt_in(par));
> 
> The old code seems to suggest that sk == NULL was possible.
> 
> I see the problem with the Chrome OS kernel rebased to v4.19-rc5, so I
> can not guarantee that this really an upstream problem. The change seems
> odd, though. Are you sure that it is not (or, rather, no longer) necessary
> to check if sk == NULL before dereferencing it in sock_net() ?

Oops, it is necessary but if it's not and the netns doesn't match, we need
do the lookup. So, could you check if this fixes the problem for you?

>From a5f927e7f1368d753f87cb978d630d786d5adb62 Mon Sep 17 00:00:00 2001
From: Flavio Leitner 
Date: Thu, 27 Sep 2018 19:36:28 -0300
Subject: [PATCH] xt_socket: check sk before checking for netns.

Only check for the network namespace if the socket is available.

Fixes: f564650106a6 ("netfilter: check if the socket netns is correct.")
Reported-by: Guenter Roeck 
Signed-off-by: Flavio Leitner 
---
 net/netfilter/xt_socket.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 0472f3472842..ada144e5645b 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -56,7 +56,7 @@ socket_match(const struct sk_buff *skb, struct 
xt_action_param *par,
struct sk_buff *pskb = (struct sk_buff *)skb;
struct sock *sk = skb->sk;
 
-   if (!net_eq(xt_net(par), sock_net(sk)))
+   if (sk && !net_eq(xt_net(par), sock_net(sk)))
sk = NULL;
 
if (!sk)
@@ -117,7 +117,7 @@ socket_mt6_v1_v2_v3(const struct sk_buff *skb, struct 
xt_action_param *par)
struct sk_buff *pskb = (struct sk_buff *)skb;
struct sock *sk = skb->sk;
 
-   if (!net_eq(xt_net(par), sock_net(sk)))
+   if (sk && !net_eq(xt_net(par), sock_net(sk)))
sk = NULL;
 
if (!sk)
-- 
2.14.4



Re: Poor TCP performance with XPS enabled after scrubbing skb

2018-05-24 Thread Flavio Leitner
On Tue, May 15, 2018 at 02:08:09PM -0700, Eric Dumazet wrote:
> 
> 
> On 05/15/2018 12:31 PM, Flavio Leitner wrote:
> > Hi,
> > 
> > There is a significant throughput issue (~50% drop) for a single TCP
> > stream when the skb is scrubbed and XPS is enabled.
> > 
> > If I turn CONFIG_XPS off, then the issue never happens and the test
> > reaches line rate.  The same happens if I echo 0 to tx-*/xps_cpus.
> > 
> > It looks like that when the skb is scrubbed, there is no more reference
> > to the struct sock, 
> 
> And this is really the problem here, since it breaks back pressure (and TCP 
> Small queues)
> 
> I am not sure why skb_orphan() is used in this scrubbing really.
> 

veth originally called skb_orphan() on veth_xmit() most probably
because there was no TX completion. Then the code got generalized to
dev_forward_skb() and later on moved to skb_scrub_packet().

The issue is that we call skb_scrub_packet() on TX and RX paths and
that is done while crossing netns.  It doesn't look correct to keep
the ->sk because I suspect that iptables/selinux/bpf, or some code
path that I am probably missing could expose/use the wrong ->sk, for
example.

However, netdev_pick_tx() can't store the queue mapping without ->sk.

The hack in the first email relies on the headers (skb_tx_hash) to
always selected the same TX queue, which solves the original problem
but not the TCP small queues you mentioned.

-- 
Flavio



Poor TCP performance with XPS enabled after scrubbing skb

2018-05-15 Thread Flavio Leitner
Hi,

There is a significant throughput issue (~50% drop) for a single TCP
stream when the skb is scrubbed and XPS is enabled.

If I turn CONFIG_XPS off, then the issue never happens and the test
reaches line rate.  The same happens if I echo 0 to tx-*/xps_cpus.

It looks like that when the skb is scrubbed, there is no more reference
to the struct sock, which forces XPS to use a TX queue mapped to the
running CPU. However, since there is no mapping between RX queue and
TX queue, the returning traffic usually ends up in another CPU. This
other CPU process the skb and if the stack needs to send something,
then we have two TX queues being used in parallel for the same stream
and TCP seems to not like that (Out-Of-Order, dup ACKS, retransmissions..)

The test environment is quite simple. The iperf/iperf3 -s can be
just a NIC with IP address.  The peer running iperf/iperf3 -c needs
to use veth (scrub the packet), so create a pair, attach one end
to a linux bridge with the NIC and add the IP address to the other
end:
  Bridge
NIC ---/  \--- veth0  veth1 [ IP address ]

Paolo and I discussed the issue and we came up with a patch[1] that
supports the explanation above. It may not be the best way to fix the
problem though, so for now consider it just as an experiment :-)

Kernel net-next updated with today's:
commit f3002c1374fb2367c9d8dbb28852791ef90d2bac
Date:   Mon May 14 08:14:49 2018 -0400


Default config (CONFIG_XPS on)
# iperf -c 192.168.1.2 -t 30

Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)

[  3] local 192.168.1.1 port 40332 connected with 192.168.1.2 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-30.0 sec  16.8 GBytes  4.80 Gbits/sec


# ./xps_disable.sh; iperf -c 192.168.1.2 -t 30

Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)

[  3] local 192.168.1.1 port 40334 connected with 192.168.1.2 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-30.0 sec  32.2 GBytes  9.21 Gbits/sec


[root@dell-r430-23 ~]# ./xps_restore.sh; iperf -c 192.168.1.2 -t 30

Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)

[  3] local 192.168.1.1 port 40336 connected with 192.168.1.2 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-30.0 sec  16.0 GBytes  4.59 Gbits/sec


Experimental patch applied and XPS functioning:

# iperf -c 192.168.1.2 -t 30

Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)

[  3] local 192.168.1.1 port 34202 connected with 192.168.1.2 port
5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-30.0 sec  32.2 GBytes  9.21 Gbits/sec


Sometimes the return traffic ends up in the same CPU running iperf -c.
When that happens, the same TX queue is used and I see line rate.

The issue always happen with MLX and be2net NICs, but so far I am
unable to reproduce with i40e, though I could see two TX queues being
used in parallel as in other cases.

[1]
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 71c72a9..482d046 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -31,9 +31,10 @@
 
 /* 0 - Reserved to indicate value not set
  * 1..NR_CPUS - Reserved for sender_cpu
- *  NR_CPUS+1..~0 - Region available for NAPI IDs
+ *  NR_CPUS+1 - Scrubbed packet, do not use XPS
+ *  NR_CPUS+2..~0 - Region available for NAPI IDs
  */
-#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))
+#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 2))
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 
diff --git a/net/core/dev.c b/net/core/dev.c
index af0558b..5567d4f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3398,6 +3398,9 @@ static inline int get_xps_queue(struct net_device *dev, 
struct sk_buff *skb)
struct xps_map *map;
int queue_index = -1;
 
+   if (skb->sender_cpu ==  (u32)(NR_CPUS + 1))
+   return -1;
+
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_maps);
if (dev_maps) {
@@ -3459,7 +3462,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device 
*dev,
 #ifdef CONFIG_XPS
u32 sender_cpu = skb->sender_cpu - 1;
 
-   if (sender_cpu >= (u32)NR_CPUS)
+   if (sender_cpu >= (u32)NR_CPUS + 1)
skb->sender_cpu = raw_smp_processor_id() + 1;
 #endif
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 345b518..99040a0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4898,6 +4898,7 @@ void skb_scrub_packet(str

Re: [PATCH] rtnetlink: fix missing size for IFLA_IF_NETNSID

2017-11-15 Thread Flavio Leitner
On Mon, 6 Nov 2017 16:16:20 +0100
Jiri Benc  wrote:

> On Mon,  6 Nov 2017 15:04:54 +, Colin King wrote:
> > The size for IFLA_IF_NETNSID is missing from the size calculation
> > because the proceeding semicolon was not removed. Fix this by removing
> > the semicolon.  
> 
> Acked-by: Jiri Benc 
> 
> Thanks for spotting this! Looking at my initial code, I had that right,
> this was probably introduced during one of rebases, so I blame
> Flavio :-p (On a serious note, thank you, Flavio, for taking care of
> the rebases.)

That's right, ouch!
Thanks for fixing it.

-- 
Flavio



[PATCH net-next 2/3] openvswitch: reliable interface indentification in port dumps

2017-11-02 Thread Flavio Leitner
From: Jiri Benc 

This patch allows reliable identification of netdevice interfaces connected
to openvswitch bridges. In particular, user space queries the netdev
interfaces belonging to the ports for statistics, up/down state, etc.
Datapath dump needs to provide enough information for the user space to be
able to do that.

Currently, only interface names are returned. This is not sufficient, as
openvswitch allows its ports to be in different name spaces and the
interface name is valid only in its name space. What is needed and generally
used in other netlink APIs, is the pair ifindex+netnsid.

The solution is addition of the ifindex+netnsid pair (or only ifindex if in
the same name space) to vport get/dump operation.

On request side, ideally the ifindex+netnsid pair could be used to
get/set/del the corresponding vport. This is not implemented by this patch
and can be added later if needed.

Signed-off-by: Jiri Benc 
---
 include/uapi/linux/openvswitch.h |  2 ++
 net/openvswitch/datapath.c   | 47 +---
 net/openvswitch/datapath.h   |  4 ++--
 net/openvswitch/dp_notify.c  |  4 ++--
 4 files changed, 40 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 0cd6f8833147..c9b638c82941 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -257,6 +257,8 @@ enum ovs_vport_attr {
/* receiving upcalls */
OVS_VPORT_ATTR_STATS,   /* struct ovs_vport_stats */
OVS_VPORT_ATTR_PAD,
+   OVS_VPORT_ATTR_IFINDEX,
+   OVS_VPORT_ATTR_NETNSID,
__OVS_VPORT_ATTR_MAX
 };
 
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index c3aec6227c91..4d38ac044cee 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1848,7 +1848,8 @@ static struct genl_family dp_datapath_genl_family 
__ro_after_init = {
 
 /* Called with ovs_mutex or RCU read lock. */
 static int ovs_vport_cmd_fill_info(struct vport *vport, struct sk_buff *skb,
-  u32 portid, u32 seq, u32 flags, u8 cmd)
+  struct net *net, u32 portid, u32 seq,
+  u32 flags, u8 cmd)
 {
struct ovs_header *ovs_header;
struct ovs_vport_stats vport_stats;
@@ -1864,9 +1865,17 @@ static int ovs_vport_cmd_fill_info(struct vport *vport, 
struct sk_buff *skb,
if (nla_put_u32(skb, OVS_VPORT_ATTR_PORT_NO, vport->port_no) ||
nla_put_u32(skb, OVS_VPORT_ATTR_TYPE, vport->ops->type) ||
nla_put_string(skb, OVS_VPORT_ATTR_NAME,
-  ovs_vport_name(vport)))
+  ovs_vport_name(vport)) ||
+   nla_put_u32(skb, OVS_VPORT_ATTR_IFINDEX, vport->dev->ifindex))
goto nla_put_failure;
 
+   if (!net_eq(net, dev_net(vport->dev))) {
+   int id = peernet2id_alloc(net, dev_net(vport->dev));
+
+   if (nla_put_s32(skb, OVS_VPORT_ATTR_NETNSID, id))
+   goto nla_put_failure;
+   }
+
ovs_vport_get_stats(vport, &vport_stats);
if (nla_put_64bit(skb, OVS_VPORT_ATTR_STATS,
  sizeof(struct ovs_vport_stats), &vport_stats,
@@ -1896,8 +1905,8 @@ static struct sk_buff *ovs_vport_cmd_alloc_info(void)
 }
 
 /* Called with ovs_mutex, only via ovs_dp_notify_wq(). */
-struct sk_buff *ovs_vport_cmd_build_info(struct vport *vport, u32 portid,
-u32 seq, u8 cmd)
+struct sk_buff *ovs_vport_cmd_build_info(struct vport *vport, struct net *net,
+u32 portid, u32 seq, u8 cmd)
 {
struct sk_buff *skb;
int retval;
@@ -1906,7 +1915,7 @@ struct sk_buff *ovs_vport_cmd_build_info(struct vport 
*vport, u32 portid,
if (!skb)
return ERR_PTR(-ENOMEM);
 
-   retval = ovs_vport_cmd_fill_info(vport, skb, portid, seq, 0, cmd);
+   retval = ovs_vport_cmd_fill_info(vport, skb, net, portid, seq, 0, cmd);
BUG_ON(retval < 0);
 
return skb;
@@ -1920,6 +1929,8 @@ static struct vport *lookup_vport(struct net *net,
struct datapath *dp;
struct vport *vport;
 
+   if (a[OVS_VPORT_ATTR_IFINDEX])
+   return ERR_PTR(-EOPNOTSUPP);
if (a[OVS_VPORT_ATTR_NAME]) {
vport = ovs_vport_locate(net, nla_data(a[OVS_VPORT_ATTR_NAME]));
if (!vport)
@@ -1944,6 +1955,7 @@ static struct vport *lookup_vport(struct net *net,
return vport;
} else
return ERR_PTR(-EINVAL);
+
 }
 
 /* Called with ovs_mutex */
@@ -1983,6 +1995,8 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
if (!a[OVS_VPORT_ATTR_NAME] || !a[OVS_VPORT_ATTR_TYPE] ||
!a[OVS_VPORT_ATTR_UPCALL_PID])
return -EINVAL;
+   if (a[OVS_VPORT_ATTR_IFINDEX])
+   return 

[PATCH net-next 3/3] rtnetlink: use netnsid to query interface

2017-11-02 Thread Flavio Leitner
From: Jiri Benc 

Currently, when an application gets netnsid from the kernel (for example as
the result of RTM_GETLINK call on one end of the veth pair), it's not much
useful. There's no reliable way to get to the netns fd from the netnsid, nor
does any kernel API accept netnsid.

Extend the RTM_GETLINK call to also accept netnsid. It will operate on the
netns with the given netnsid in such case. Of course, the calling process
needs to have enough capabilities in the target name space; for now, require
CAP_NET_ADMIN. This can be relaxed in the future.

To signal to the calling process that the kernel understood the new
IFLA_IF_NETNSID attribute in the query, it will include it in the response.
This is needed to detect older kernels, as they will just ignore
IFLA_IF_NETNSID and query in the current name space.

This patch implemetns IFLA_IF_NETNSID only for get and dump. For set
operations, this can be extended later.

Signed-off-by: Jiri Benc 
---
 include/uapi/linux/if_link.h |   1 +
 net/core/rtnetlink.c | 103 +++
 2 files changed, 86 insertions(+), 18 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index b037e0ab1975..ba705219df40 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -159,6 +159,7 @@ enum {
IFLA_XDP,
IFLA_EVENT,
IFLA_NEW_NETNSID,
+   IFLA_IF_NETNSID,
__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index de24d394c69e..8a8c51937edf 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -921,7 +921,8 @@ static noinline size_t if_nlmsg_size(const struct 
net_device *dev,
   + nla_total_size(4)  /* IFLA_EVENT */
   + nla_total_size(4)  /* IFLA_NEW_NETNSID */
   + nla_total_size(1); /* IFLA_PROTO_DOWN */
-
+  + nla_total_size(4)  /* IFLA_IF_NETNSID */
+  + 0;
 }
 
 static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
@@ -1370,13 +1371,14 @@ static noinline_for_stack int nla_put_ifalias(struct 
sk_buff *skb,
 }
 
 static int rtnl_fill_link_netnsid(struct sk_buff *skb,
- const struct net_device *dev)
+ const struct net_device *dev,
+ struct net *src_net)
 {
if (dev->rtnl_link_ops && dev->rtnl_link_ops->get_link_net) {
struct net *link_net = dev->rtnl_link_ops->get_link_net(dev);
 
if (!net_eq(dev_net(dev), link_net)) {
-   int id = peernet2id_alloc(dev_net(dev), link_net);
+   int id = peernet2id_alloc(src_net, link_net);
 
if (nla_put_s32(skb, IFLA_LINK_NETNSID, id))
return -EMSGSIZE;
@@ -1427,10 +1429,11 @@ static int rtnl_fill_link_af(struct sk_buff *skb,
return 0;
 }
 
-static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
+static int rtnl_fill_ifinfo(struct sk_buff *skb,
+   struct net_device *dev, struct net *src_net,
int type, u32 pid, u32 seq, u32 change,
unsigned int flags, u32 ext_filter_mask,
-   u32 event, int *new_nsid)
+   u32 event, int *new_nsid, int tgt_netnsid)
 {
struct ifinfomsg *ifm;
struct nlmsghdr *nlh;
@@ -1448,6 +1451,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
ifm->ifi_flags = dev_get_flags(dev);
ifm->ifi_change = change;
 
+   if (tgt_netnsid >= 0 && nla_put_s32(skb, IFLA_IF_NETNSID, tgt_netnsid))
+   goto nla_put_failure;
+
if (nla_put_string(skb, IFLA_IFNAME, dev->name) ||
nla_put_u32(skb, IFLA_TXQLEN, dev->tx_queue_len) ||
nla_put_u8(skb, IFLA_OPERSTATE,
@@ -1513,7 +1519,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
goto nla_put_failure;
}
 
-   if (rtnl_fill_link_netnsid(skb, dev))
+   if (rtnl_fill_link_netnsid(skb, dev, src_net))
goto nla_put_failure;
 
if (new_nsid &&
@@ -1571,6 +1577,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_XDP]  = { .type = NLA_NESTED },
[IFLA_EVENT]= { .type = NLA_U32 },
[IFLA_GROUP]= { .type = NLA_U32 },
+   [IFLA_IF_NETNSID]   = { .type = NLA_S32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1674,9 +1681,28 @@ static bool link_dump_filtered(struct net_device *dev,
return false;
 }
 
+static struct net *get_target_net(struct sk_buff *skb, int netnsid)
+{
+   struct net *net;
+
+   net = get_net_ns_by_id(sock_net(skb->sk), netnsid);
+   if (!net)
+   return ERR_PTR(-EINVAL);
+
+   /* For now, the caller is required to have CA

[PATCH net-next 1/3] net: export peernet2id_alloc

2017-11-02 Thread Flavio Leitner
From: Jiri Benc 

It will be used by openvswitch.

Signed-off-by: Jiri Benc 
---
 net/core/net_namespace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6cfdc7c84c48..b797832565d3 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -234,6 +234,7 @@ int peernet2id_alloc(struct net *net, struct net *peer)
rtnl_net_notifyid(net, RTM_NEWNSID, id);
return id;
 }
+EXPORT_SYMBOL_GPL(peernet2id_alloc);
 
 /* This function returns, if assigned, the id of a peer netns. */
 int peernet2id(struct net *net, struct net *peer)
-- 
2.13.6



[PATCH net-next 0/3] Allow openvswitch to query ports in another netns.

2017-11-02 Thread Flavio Leitner
Today Open vSwitch users are moving internal ports to other namespaces and
although packets are flowing OK, the userspace daemon can't find out basic
information like if the port is UP or DOWN, for instance.

This patchset extends openvswitch API to retrieve the current netnsid of
a port. It will be used by the userspace daemon to find out in which netns
the port is located.

This patchset also extends the rtnetlink getlink call to accept and operate
on a given netnsid.  More details are available in each patch.

Jiri Benc (3):
  net: export peernet2id_alloc
  openvswitch: reliable interface indentification in port dumps
  rtnetlink: use netnsid to query interface

 include/uapi/linux/if_link.h |   1 +
 include/uapi/linux/openvswitch.h |   2 +
 net/core/net_namespace.c |   1 +
 net/core/rtnetlink.c | 103 ---
 net/openvswitch/datapath.c   |  47 +-
 net/openvswitch/datapath.h   |   4 +-
 net/openvswitch/dp_notify.c  |   4 +-
 7 files changed, 127 insertions(+), 35 deletions(-)

-- 
2.13.6



Re: [PATCH net] netlink: don't send unknown nsid

2017-06-09 Thread Flavio Leitner
On Thu, Jun 08, 2017 at 10:31:53AM +0200, Nicolas Dichtel wrote:
> Le 07/06/2017 à 21:14, Flavio Leitner a écrit :
> > Let's say the app is restarted, or another monitoring app is executed
> > with enough perms.  How will it identify the error condition?
> Your app wants to monitor a subset of netns. It means that you already have a
> way to identify those netns, something like a file stored somewhere
> (/var/run/netns/, /proc//ns/net, ...). Thus, it's easy to check if those
> netns have a nsid assigned in the netns where your app will open the socket.
> 
> This option was called NETLINK_F_LISTEN_ALL_NSID, because it only enables to
> listen netns *with* a nsid assigned, nothing more. It's up to the user to 
> ensure
> that nsid are correctly assigned.

Makes sense, thanks.
-- 
Flavio



Re: [PATCH net] netlink: don't send unknown nsid

2017-06-07 Thread Flavio Leitner
On Mon, Jun 05, 2017 at 10:40:24AM +0200, Nicolas Dichtel wrote:
> > Let me ask this instead: How do you think userspace should behave when
> > netnsid allocation fails?
> > 
> There is two ways to assign a nsid:
>  - manually with netlink ('ip netns set'). In this case, the error is reported
>to userspace via netlink.

OK.

>  - automatically when a x-netns interface is created. The link-nsid is also
>reported to userspace. If the allocation failed, NETNSA_NSID_NOT_ASSIGNED 
> is
>reported. And if you were able to create this x-netns interface, it means
>that you have access to this peer netns, thus you can try to assign the 
> nsid
>manually.

Does that prevent the interface to be created?

> So, in both cases, userland knows that something went wrong.
> Do you have another scenario in mind?

Let's say the app is restarted, or another monitoring app is executed
with enough perms.  How will it identify the error condition?

-- 
Flavio



Re: [PATCH net] netlink: don't send unknown nsid

2017-06-01 Thread Flavio Leitner
On Thu, Jun 01, 2017 at 10:42:13PM +0200, Nicolas Dichtel wrote:
> Le 01/06/2017 à 19:02, Flavio Leitner a écrit :
> > On Thu, Jun 01, 2017 at 10:00:07AM +0200, Nicolas Dichtel wrote:
> >> The NETLINK_F_LISTEN_ALL_NSID otion enables to listen all netns that have a
> >> nsid assigned into the netns where the netlink socket is opened.
> >> The nsid is sent as metadata to userland, but the existence of this nsid is
> >> checked only for netns that are different from the socket netns. Thus, if
> >> no nsid is assigned to the socket netns, NETNSA_NSID_NOT_ASSIGNED is
> >> reported to the userland. This value is confusing and useless.
> >> After this patch, only valid nsid are sent to userland.
> >>
> >> Reported-by: Flavio Leitner 
> >> Signed-off-by: Nicolas Dichtel 
> >> ---
> >>  net/netlink/af_netlink.c | 4 +++-
> >>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> >> index ee841f00a6ec..7586d446d7dc 100644
> >> --- a/net/netlink/af_netlink.c
> >> +++ b/net/netlink/af_netlink.c
> >> @@ -62,6 +62,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  
> >>  #include 
> >>  #include 
> >> @@ -1415,7 +1416,8 @@ static void do_one_broadcast(struct sock *sk,
> >>goto out;
> >>}
> >>NETLINK_CB(p->skb2).nsid = peernet2id(sock_net(sk), p->net);
> >> -  NETLINK_CB(p->skb2).nsid_is_set = true;
> >> +  if (NETLINK_CB(p->skb2).nsid != NETNSA_NSID_NOT_ASSIGNED)
> >> +  NETLINK_CB(p->skb2).nsid_is_set = true;
> >>val = netlink_broadcast_deliver(sk, p->skb2);
> >>if (val < 0) {
> >>netlink_overrun(sk);
> > 
> > If the assumption is that nsid allocation can never fail or that if it
> > does, we can't report to userspace, then the patch is good, but it
> > doesn't sound like a good long term solution.
> > 
> > Let's consider that the allocation of an id fails for whatever reason.
> > I think that should be reported to userspace to allow it to retry, or
> > do something else to handle this situation properly.  Not sending
> > anything means that it's in the same netns as the old kernels did,
> > which is incorrect.
> This is correct, because if nsid allocation fails, no netlink messages from 
> this
> netns are sent to userspace (the check is done at the beginning of
> do_one_broadcast). The only netns allowed to send netlink messages to 
> userspace
> without nsid is the netns of the socket.

I say it's incorrect because of the explanation below.

> > On the other hand, with the original patch, if the socket and the
> > device are in the same netns, we don't need to report any ID.  Previous
> > kernels did that, so we are not breaking anything.  When the netns
> > differs, then we either should report the real ID or an error.
> > 
> I don't understand. With or without my last patch, the kernel sends netlink
> messages of other netns than the netns where the socket is opened, only if an
> nsid is assigned.

"only if an nsid is assigned" that's the issue.

Let me ask this instead: How do you think userspace should behave when
netnsid allocation fails?

-- 
Flavio



Re: [PATCH net] netlink: don't send unknown nsid

2017-06-01 Thread Flavio Leitner
On Thu, Jun 01, 2017 at 10:00:07AM +0200, Nicolas Dichtel wrote:
> The NETLINK_F_LISTEN_ALL_NSID otion enables to listen all netns that have a
> nsid assigned into the netns where the netlink socket is opened.
> The nsid is sent as metadata to userland, but the existence of this nsid is
> checked only for netns that are different from the socket netns. Thus, if
> no nsid is assigned to the socket netns, NETNSA_NSID_NOT_ASSIGNED is
> reported to the userland. This value is confusing and useless.
> After this patch, only valid nsid are sent to userland.
> 
> Reported-by: Flavio Leitner 
> Signed-off-by: Nicolas Dichtel 
> ---
>  net/netlink/af_netlink.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index ee841f00a6ec..7586d446d7dc 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -62,6 +62,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1415,7 +1416,8 @@ static void do_one_broadcast(struct sock *sk,
>   goto out;
>   }
>   NETLINK_CB(p->skb2).nsid = peernet2id(sock_net(sk), p->net);
> - NETLINK_CB(p->skb2).nsid_is_set = true;
> + if (NETLINK_CB(p->skb2).nsid != NETNSA_NSID_NOT_ASSIGNED)
> + NETLINK_CB(p->skb2).nsid_is_set = true;
>   val = netlink_broadcast_deliver(sk, p->skb2);
>   if (val < 0) {
>   netlink_overrun(sk);

If the assumption is that nsid allocation can never fail or that if it
does, we can't report to userspace, then the patch is good, but it
doesn't sound like a good long term solution.

Let's consider that the allocation of an id fails for whatever reason.
I think that should be reported to userspace to allow it to retry, or
do something else to handle this situation properly.  Not sending
anything means that it's in the same netns as the old kernels did,
which is incorrect.

On the other hand, with the original patch, if the socket and the
device are in the same netns, we don't need to report any ID.  Previous
kernels did that, so we are not breaking anything.  When the netns
differs, then we either should report the real ID or an error.

-- 
Flavio



Re: [PATCH net-next] netlink: include netnsid only when netns differs.

2017-05-31 Thread Flavio Leitner
On Wed, May 31, 2017 at 03:48:06PM +0200, Nicolas Dichtel wrote:
> Le 31/05/2017 à 14:28, Flavio Leitner a écrit :
> > On Wed, May 31, 2017 at 10:38:21AM +0200, Nicolas Dichtel wrote:
> >> Le 30/05/2017 à 23:33, Flavio Leitner a écrit :
> >>> Don't include netns id for notifications broadcasts when the
> >>> socket and the skb are in the same netns because it will be
> >>> an error which can't be distinguished from a peer netns failing
> >>> to allocate an id.
> >> I don't understand the problem. peernet2id() doesn't allocate ids, it only 
> >> do a
> >> lookup. If you need an id for the current netns, you have to allocate one.
> > 
> > The issue is that if you query an interface on the same netns, the
> > error is returned, then we cannot tell if the iface is on the same
> > netns or if there was an error while allocating the ID and the
> > iface is on another netns.
> If the returned id is NETNSA_NSID_NOT_ASSIGNED, then the netns is the same.
> 
> Some lines before your patch, we call peernet_has_id() when the netns differ,
> thus we ensure that the id is available.

Right, but that's internal to the kernel.

> The principle was that netlink messages of other netns can be sent only if an 
> id
> is assigned.

OK, could you please update include/uapi/linux/net_namespace.h to reflect that?
It says NETNSA_NSID_NOT_ASSIGNED are attributes for RTM_NEWNSID or RTM_GETNSID
which makes sense, but NOT_ASSIGNED sounds little like SAME_NSID for other
message types.

-- 
Flavio



Re: [PATCH net-next] netlink: include netnsid only when netns differs.

2017-05-31 Thread Flavio Leitner
On Wed, May 31, 2017 at 10:38:21AM +0200, Nicolas Dichtel wrote:
> Le 30/05/2017 à 23:33, Flavio Leitner a écrit :
> > Don't include netns id for notifications broadcasts when the
> > socket and the skb are in the same netns because it will be
> > an error which can't be distinguished from a peer netns failing
> > to allocate an id.
> I don't understand the problem. peernet2id() doesn't allocate ids, it only do 
> a
> lookup. If you need an id for the current netns, you have to allocate one.

The issue is that if you query an interface on the same netns, the
error is returned, then we cannot tell if the iface is on the same
netns or if there was an error while allocating the ID and the
iface is on another netns.

> This patch changes the metadata exported to the userland and will break 
> existing
> tools.

It should not break because it changes only for interfaces on
the same netns where there is no ID and that value wasn't
exported until recently.

-- 
Flavio



[PATCH net-next] netlink: include netnsid only when netns differs.

2017-05-30 Thread Flavio Leitner
Don't include netns id for notifications broadcasts when the
socket and the skb are in the same netns because it will be
an error which can't be distinguished from a peer netns failing
to allocate an id.

Signed-off-by: Flavio Leitner 
---
 net/netlink/af_netlink.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ee841f0..b9f1392 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1414,8 +1414,10 @@ static void do_one_broadcast(struct sock *sk,
p->skb2 = NULL;
goto out;
}
-   NETLINK_CB(p->skb2).nsid = peernet2id(sock_net(sk), p->net);
-   NETLINK_CB(p->skb2).nsid_is_set = true;
+   if (!net_eq(sock_net(sk), p->net)) {
+   NETLINK_CB(p->skb2).nsid = peernet2id(sock_net(sk), p->net);
+   NETLINK_CB(p->skb2).nsid_is_set = true;
+   }
val = netlink_broadcast_deliver(sk, p->skb2);
if (val < 0) {
netlink_overrun(sk);
-- 
2.9.4




Re: Cannot set ageing to zero

2016-01-26 Thread Flavio Leitner
On Tue, 26 Jan 2016 18:30:41 +0100
Jiri Pirko  wrote:

> Tue, Jan 26, 2016 at 06:26:30PM CET, f...@sysclose.org wrote:
> >
> >Hi,
> >
> >After the commit[1] below, we can't set ageing on a Linux bridge
> >device to zero.  It seems rocker needs the minimum value, but we
> >can't break an old and valid Linux bridge behavior.   
> 
> The commit below adds check if the value being set is within
> BR_MIN_AGEING_TIME and BR_MAX_AGEING_TIME. I believe that the check is
> correct as it implements the standard.
>
> Why do you set ageing_time to 0? Why don't just just disable learning?

It's a documented behavior:
http://www.linuxcertif.com/man/5/ifcfg-bridge/
http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
http://comments.gmane.org/gmane.linux.network.bridge/2060

fbl

> 
> 
> >
> >[1] commit c62987bbd8a1a1664f99e89e3959339350a6131e
> >Author: Scott Feldman 
> >Date:   Thu Oct 8 19:23:19 2015 -0700
> >
> >bridge: push bridge setting ageing_time down to switchdev
> >
> >Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that
> >don't support setting ageing_time (or setting bridge attrs in
> >general). 
> >If push fails, don't update ageing_time in bridge and return err
> > to user. 
> >If push succeeds, update ageing_time in bridge and run gc_timer
> > now to recalabrate when to run gc_timer next, based on new
> > ageing_time. 
> >Signed-off-by: Scott Feldman 
> >Signed-off-by: Jiri Pirko 
> >Acked-by: Jiri Pirko 
> >Signed-off-by: David S. Miller 
> >
> >
> >-- 
> >fbl
> >  



-- 
fbl



Cannot set ageing to zero

2016-01-26 Thread Flavio Leitner

Hi,

After the commit[1] below, we can't set ageing on a Linux bridge device
to zero.  It seems rocker needs the minimum value, but we can't break
an old and valid Linux bridge behavior. 

[1] commit c62987bbd8a1a1664f99e89e3959339350a6131e
Author: Scott Feldman 
Date:   Thu Oct 8 19:23:19 2015 -0700

bridge: push bridge setting ageing_time down to switchdev

Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that
don't support setting ageing_time (or setting bridge attrs in
general). 
If push fails, don't update ageing_time in bridge and return err to
user. 
If push succeeds, update ageing_time in bridge and run gc_timer now
to recalabrate when to run gc_timer next, based on new ageing_time.

Signed-off-by: Scott Feldman 
Signed-off-by: Jiri Pirko 
Acked-by: Jiri Pirko 
Signed-off-by: David S. Miller 


-- 
fbl



Re: [PATCH net] openvswitch: fix hangup on vxlan/gre/geneve device deletion

2015-12-02 Thread Flavio Leitner
On Tue, Dec 01, 2015 at 06:33:36PM +0100, Paolo Abeni wrote:
> Each openvswitch tunnel vport (vxlan,gre,geneve) holds a reference
> to the underlying tunnel device, but never released it when such
> device is deleted.
> Deleting the underlying device via the ip tool cause the kernel to
> hangup in the netdev_wait_allrefs() loop.
> This commit ensure that on device unregistration dp_detach_port_notify()
> is called for all vports that hold the device reference, properly
> releasing it.
> 
> Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device")
> Fixes: b2acd1dc3949 ("openvswitch: Use regular GRE net_device instead of 
> vport")
> Fixes: 6b001e682e90 ("openvswitch: Use Geneve device.")
> Signed-off-by: Paolo Abeni 
> ---

LGTM and fixes the issue in my env.

Acked-by: Flavio Leitner 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] netfilter: remove dead code

2015-10-01 Thread Flavio Leitner
On Tue, Sep 29, 2015 at 09:54:35PM -0700, David Miller wrote:
> From: Florian Westphal 
> Date: Wed, 30 Sep 2015 02:45:07 +0200
> 
> > Flavio Leitner  wrote:
> >> Remove __nf_conntrack_find() from headers.
> >> Fixes: dcd93ed4cd1 ("netfilter: nf_conntrack: remove dead code"
> > 
> > For the record: netfilter patches should go to
> > netfilter-de...@vger.kernel.org .
> > 
> > That being said, in this case I doubt Pablo minds if David takes this
> > directly, patch ts obviously correct[tm] :)

That's what I thought.

> I don't want to create any unnecessary merge hassles, so please
> resubmit this properly to netfilter-devel, thanks.

Ok, done.
http://marc.info/?l=netfilter-devel&m=144361945411969&w=2
Thanks,
fbl

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] netfilter: remove dead code

2015-09-29 Thread Flavio Leitner
Remove __nf_conntrack_find() from headers.
Fixes: dcd93ed4cd1 ("netfilter: nf_conntrack: remove dead code"

Signed-off-by: Flavio Leitner 
---
 include/net/netfilter/nf_conntrack.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index d642f68..fde4068 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -183,10 +183,6 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int 
nulls);
 
 void nf_ct_free_hashtable(void *hash, unsigned int size);
 
-struct nf_conntrack_tuple_hash *
-__nf_conntrack_find(struct net *net, u16 zone,
-   const struct nf_conntrack_tuple *tuple);
-
 int nf_conntrack_hash_check_insert(struct nf_conn *ct);
 bool nf_ct_delete(struct nf_conn *ct, u32 pid, int report);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] virtio-net: add default_mtu configuration field

2015-08-20 Thread Flavio Leitner
On Sun, Aug 16, 2015 at 04:42:25PM +0300, Victor Kaplansky wrote:
> Sometimes it is essential for libvirt to be able to configure MTU
> on guest's NICs to a value different from 1500.
> 
> The change adds a new field to configuration area of network
> devices. It will be used to pass initial MTU from the device to
> the driver, and to pass modified MTU from driver to the device
> when a new MTU is assigned by the guest OS.
> 
> In addition, in order to support backward and forward
> compatibility, we introduce a new feature bit called
> VIRTIO_NET_F_DEFAULT_MTU.
> 
> Added conformance statements for a device and a driver.
> 
> Signed-off-by: Victor Kaplansky 
> 
> Signed-off-by: Victor Kaplansky 
> ---
>  content.tex | 25 +
>  1 file changed, 25 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index 342183b..439d005 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -3078,6 +3078,12 @@ features.
>  
>  \item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control
>  channel.
> +
> +\item[VIRTIO_NET_F_DEFAULT_MTU(24)] Default MTU is supported.  If
> +offered by the device, device advises driver about initial MTU to
> +be used. If negotiated, the driver uses \field{default_mtu} as
> +an initial value and reports MTU changes to the device.
> +
>  \end{description}
>  
>  \subsubsection{Feature bit requirements}\label{sec:Device Types / Network 
> Device / Feature bits / Feature bit requirements}
> @@ -3128,6 +3134,7 @@ struct virtio_net_config {
>  u8 mac[6];
>  le16 status;
>  le16 max_virtqueue_pairs;
> +le16 default_mtu;
>  };
>  \end{lstlisting}
>  
> @@ -3158,6 +3165,15 @@ by the driver after negotiation.
>  \field{max_virtqueue_pairs} is valid only if VIRTIO_NET_F_MQ is
>  set and can be read by the driver.
>  
> +\item [\field{default_mtu}] is a hint to the driver set by the
> +device. It is valid during feature negotiation only if
> +VIRTIO_NET_F_DEFAULT_MTU is offered and holds the initial value
> +of MTU to be used by the driver. If VIRTIO_NET_F_DEFAULT_MTU is
> +negotiated, the driver uses the \field{default_mtu} as an initial
> +value, and also reports MTU changes to the device by writes to
> +\field{default_mtu}.  Such reporting can be used for debugging,

As already said, it's better to change to 'mtu' since changes can
be reported back by writing to the field.

fbl

> +or it can be used for tunning MTU along the network.
> +
>  \end{description}
>  
>  \devicenormative{\subsubsection}{Device configuration layout}{Device Types / 
> Network Device / Device configuration layout}
> @@ -3165,6 +3181,9 @@ by the driver after negotiation.
>  The device MUST set \field{max_virtqueue_pairs} to between 1 and 0x8000 
> inclusive,
>  if it offers VIRTIO_NET_F_MQ.
>  
> +The device MUST set \field{default_mtu} to between 68 and 65535
> +inclusive, if it offers VIRTIO_NET_F_DEFAULT_MTU.
> +
>  \drivernormative{\subsubsection}{Device configuration layout}{Device Types / 
> Network Device / Device configuration layout}
>  
>  A driver SHOULD negotiate VIRTIO_NET_F_MAC if the device offers it.
> @@ -3177,6 +3196,12 @@ If the driver does not negotiate the 
> VIRTIO_NET_F_STATUS feature, it SHOULD
>  assume the link is active, otherwise it SHOULD read the link status from
>  the bottom bit of \field{status}.
>  
> +A driver SHOULD negotiate VIRTIO_NET_F_DEFAULT_MTU if the device
> +offers it.  If the driver negotiates the VIRTIO_NET_F_DEFAULT_MTU
> +feature, the driver MUST use \field{default_mtu} as an initial value
> +for MTU and the driver MUST report the value of MTU to
> +\field{default_mtu} when MTU is modified by the guest.
> +
>  \subsubsection{Legacy Interface: Device configuration 
> layout}\label{sec:Device Types / Network Device / Device configuration layout 
> / Legacy Interface: Device configuration layout}
>  \label{sec:Device Types / Block Device / Feature bits / Device configuration 
> layout / Legacy Interface: Device configuration layout}
>  When using the legacy interface, transitional devices and drivers
> -- 
> --Victor
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ovs-dev] [net-next RFC 00/14] Convert OVS tunnel vports to use regular net_devices

2015-06-02 Thread Flavio Leitner

It seems patch 01 didn't make it to ovs dev mailing list,
but it is available on netdev mailing list.
fbl

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] openvswitch: disable LRO

2015-05-28 Thread Flavio Leitner
On Thu, May 28, 2015 at 03:04:53PM +0200, Jiri Benc wrote:
> Currently, openvswitch tries to disable LRO from the user space. This does
> not work correctly when the device added is a vlan interface, though.
> Instead of dealing with possibly complex stacked cross name space relations
> in the user space, do the same as bridging does and call dev_disable_lro in
> the kernel.
> 
> Signed-off-by: Jiri Benc 
> ---
LGTM
Acked-by: Flavio Leitner 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ovs-dev] [PATCH net] openvswitch: disable LRO unless stated otherwise

2015-05-27 Thread Flavio Leitner
On Wed, May 27, 2015 at 10:50:21AM +0200, Jiri Benc wrote:
> On Tue, 26 May 2015 15:03:56 -0700, Pravin Shelar wrote:
> > OVS interface for generic networking device operation looks odd. have
> > you considered adding new device ioctl to do this?
> 
> New ioctls for networking configuration are generally not allowed. The
> preferred way to configure networking is netlink. And as this is very
> much ovs specific (all other users of dev_disable_lro such as bridging
> want to do this unconditionally), ovs netlink is the correct place to
> put this to.

Exactly. Team, Bonding, bridge, and when you enable forwarding on a
networking device gets LRO automatically disabled in the kernel.

I suggest to always disable LRO in the kernel as the other examples
do until there is a real need in OVS to benefit from LRO to implement
such API change.

fbl

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/2] tcp: improve REUSEADDR/NOREUSEADDR cohabitation

2015-05-21 Thread Flavio Leitner
On Wed, May 20, 2015 at 10:59:02AM -0700, Eric Dumazet wrote:
> inet_csk_get_port() randomization effort tends to spread
> sockets on all the available range (ip_local_port_range)
> 
> This is unfortunate because SO_REUSEADDR sockets have
> less requirements than non SO_REUSEADDR ones.
> 
> If an application uses SO_REUSEADDR hint, it is to try to
> allow source ports being shared.
> 
> So instead of picking a random port number in ip_local_port_range,
> lets try first in first half of the range.
> 
> This gives more chances to use upper half of the range for the
> sockets with strong requirements (not using SO_REUSEADDR)
> 
> Note this patch does not add a new sysctl, and only changes
> the way we try to pick port number.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Marcelo Ricardo Leitner 
> Cc: Flavio Leitner 
> ---

The only downside I can see is that after the patch the applications
using the SO_REUSEADDR will reuse ports more often and that could
potentially trigger some bug.

Looks like a good change to me.

Acked-by: Flavio Leitner 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 1/2] inet_hashinfo: remove bsocket counter

2015-05-21 Thread Flavio Leitner
On Wed, May 20, 2015 at 10:59:01AM -0700, Eric Dumazet wrote:
> We no longer need bsocket atomic counter, as inet_csk_get_port()
> calls bind_conflict() regardless of its value, after commit
> 2b05ad33e1e624e ("tcp: bind() fix autoselection to share ports")
> 
> This patch removes overhead of maintaining this counter and
> double inet_csk_get_port() calls under pressure.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Marcelo Ricardo Leitner 
> Cc: Flavio Leitner 
> ---

The patch looks good to me.
Acked-by: Flavio Leitner 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [NET]: fix multicast list when cloning sockets

2007-07-31 Thread Flavio Leitner
On Tue, Jul 31, 2007 at 12:00:41AM -0300, Arnaldo Carvalho de Melo wrote:
> On 7/30/07, David Miller <[EMAIL PROTECTED]> wrote:
> > Allowing non-datagram sockets to end up with a non-NULL inet->mc_list
> > in the first place is a bug.
> >
> > Multicast subscriptions cannot even be used with TCP and DCCP, which
> > are the only two users of these connection oriented socket functions.
> >
> > The first thing that TCP and DCCP do, in fact, for input packet
> > processing is drop the packet if it is not unicast.
> >
> > Therefore the fix really is for the inet layer to reject multicast
> > subscription requests on sockets for which that absolutely does not
> > make sense.  There is no reason these functions in
> > inet_connection_sock.c should need to be mindful of multicast
> > state. :-)
> 
> Well, we can add a BUG_ON there then 8)
> 
> Flavio, take a look at  do_ip_setsockopt in net/ipv4/ip_sockglue.c, in
> the IP_{ADD,DROP}_MEMBERSHIP labels.
> 
> Don't forget IPV6 (net/ipv6/ipv6_sockglue.c)

yes, right. What about the one below?

[NET]: Fix IP_ADD/DROP_MEMBERSHIP to handle only connectionless

Fix IP[V6]_ADD_MEMBERSHIP and IP[V6]_DROP_MEMBERSHIP to
return -EPROTO for connection oriented sockets.

Signed-off-by: Flavio Leitner <[EMAIL PROTECTED]>

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 4d54457..6b420ae 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -625,6 +625,10 @@ static int do_ip_setsockopt(struct sock *sk, int level,
{
struct ip_mreqn mreq;
 
+   err = -EPROTO;
+   if (inet_sk(sk)->is_icsk)
+   break;
+
if (optlen < sizeof(struct ip_mreq))
goto e_inval;
err = -EFAULT;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index d684639..350e584 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -554,6 +554,10 @@ done:
{
struct ipv6_mreq mreq;
 
+   retv = -EPROTO;
+   if (inet_sk(sk)->is_icsk)
+   break;
+
retv = -EFAULT;
if (copy_from_user(&mreq, optval, sizeof(struct ipv6_mreq)))
break;
-- 
1.5.2.4

-- 
Flavio
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [NET]: fix multicast list when cloning sockets

2007-07-30 Thread Flavio Leitner

The sock_copy() function uses memcpy() to clone the socket
including the struct ip_mc_socklist *mc_list pointer.

The ip_mc_drop_socket() function is called when socket is closed
to free these objects leaving the other sockets cloned from the
same master socket with invalid pointers.

This patch sets mc_list of cloned socket to NULL.

Signed-off-by: Flavio Leitner <[EMAIL PROTECTED]>

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fbe7714..8ee0f54 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -506,6 +506,8 @@ struct sock *inet_csk_clone(struct sock *sk, const struct 
request_sock *req,
newicsk->icsk_backoff = 0;
newicsk->icsk_probes_out  = 0;
 
+   inet_sk(inet)->mc_list = NULL;
+
/* Deinitialize accept_queue to trap illegal accesses. */
memset(&newicsk->icsk_accept_queue, 0, 
sizeof(newicsk->icsk_accept_queue));
 
-- 
1.5.2.4

-- 
Flavio
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html