date:20150928

Re: [PATCH net-next 0/4] switchdev: push bridge attributes down

2015-09-28 Thread David Miller

From: sfel...@gmail.com
Date: Thu, 24 Sep 2015 13:59:26 -0700

> From: Scott Feldman 
> 
> Push bridge-level attributes down to switchdev drivers.  This patchset
> adds the infrastructure and then pushes, as an example, ageing_time attribute
> down from bridge to switchdev (rocker) driver.  Add some range-checking
> for ageing_time.
> 
> # ip link set dev br0 type bridge ageing_time 1000
> 
> # ip link set dev br0 type bridge ageing_time 999
> RTNETLINK answers: Numerical result out of range
> 
> Up until now, switchdev attrs where port-level attrs, so the netdev used in
> switchdev_attr_set() would be a switch port or bond of switch ports.  With
> bridge-level attrs, the netdev passed to switchdev_attr_set() is the bridge
> netdev.  The same recusive algo is used to visit the leaves of the stacked
> drivers to set the attr, it's just in this case we start one layer higher in
> the stack.  One note is not all ports in the bridge may support setting a
> bridge-level attribute, so rather than failing the entire set, we'll skip over
> those ports returning -EOPNOTSUPP.

This doesn't apply cleanly to net-next, please respin.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net v2] skbuff: Fix skb checksum flag on skb pull

2015-09-28 Thread David Miller

From: Pravin Shelar 
Date: Fri, 25 Sep 2015 19:48:57 -0700

> On Thu, Sep 24, 2015 at 2:09 PM, David Miller  wrote:
>> From: Pravin B Shelar 
>> Date: Tue, 22 Sep 2015 12:57:53 -0700
>>
>>> VXLAN device can receive skb with checksum partial. But the checksum
>>> offset could be in outer header which is pulled on receive. This results
>>> in negative checksum offset for the skb. Such skb can cause the assert
>>> failure in skb_checksum_help(). Following patch fixes the bug by setting
>>> checksum-none while pulling outer header.
>>>
>>> Following is the kernel panic msg from old kernel hitting the bug.
>>  ...
>>> Reported-by: Anupam Chanda 
>>> Signed-off-by: Pravin B Shelar 
>>
>> Applied, thanks.
> 
> Thanks for applying it. Since I have seen this bug on older kernel can
> you also queue it for -stable.

Not until we resolve all of the regressions caused by it.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: fec: Remove unneeded FEATURES_NEED_QUIESCE definition

2015-09-28 Thread David Miller

From: Fabio Estevam 
Date: Fri, 25 Sep 2015 18:31:32 -0300

> From: Fabio Estevam 
> 
> There is no need to have FEATURES_NEED_QUIESCE defined as we
> can simply use NETIF_F_RXCSUM instead as done in other parts
> of the driver.
> 
> Signed-off-by: Fabio Estevam 

Applied to net-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v4] net: Fix Hisilicon Network Subsystem Support Compilation

2015-09-28 Thread David Miller

From: huangdaode 
Date: Sun, 27 Sep 2015 15:22:44 +0800

> This patch fixes the compilation error with arm allmodconfig, this error
> generated due to unavailability of readq() on 32-bit platform which was
> found during net-next daily compilation. In the same time, fix all the
> hns drivers compilation warnings.
> 
> Signed-off-by: huangdaode 
> Signed-off-by: zhaungyuzeng 
> Signed-off-by: kenneth Lee 
> Signed-off-by: yankejian 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] cxgb4: Add HW timesptamp support for RX

2015-09-28 Thread David Miller

From: Hariprasad Shenai 
Date: Mon, 28 Sep 2015 10:26:53 +0530

> Adds support for ethtool get time stamp ioctl, which is used by
> tcpdump to get the supported time stamp types
> 
> eg: tcpdump -i eth5 -J
> Time stamp types for eth5 (use option -j to set):
>   host (Host)
>   adapter_unsynced (Adapter, not synced with system time)
> 
> Adds support for adapter unsynced mode, by adding SIOCSHWTSTAMP support
> in driver. 
> 
> Signed-off-by: Hariprasad Shenai 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/mlx4: Handle return codes in mlx4_qp_attach_common

2015-09-28 Thread David Miller

From: Robb Manes 
Date: Fri, 25 Sep 2015 10:39:21 -0400

> @@ -1184,10 +1184,11 @@ out:
>  if (prot == MLX4_PROT_ETH) {
>  /* manage the steering entry for promisc mode */
>  if (new_entry)
> -new_steering_entry(dev, port, steer, index, qp->qpn);
> +err = new_steering_entry(dev, port, steer,
> + index, qp->qpn);
>  else
> -existing_steering_entry(dev, port, steer,
> -index, qp->qpn);
> +err = existing_steering_entry(dev, port, steer,
> +  index, qp->qpn);

Please indent this properly.

When a function call spans multiple lines, the second and
subsequent lines must start precisely at the first column
after the openning parenthesis of the first line.  You must
use the appropriate number of TAB then SPACE characters
necessary to achieve this.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] lan78xx: Return 0 when lan78xx_suspend() has no error.

2015-09-28 Thread David Miller

From: 
Date: Fri, 25 Sep 2015 21:13:48 +

> lan78xx_suspend() may return non-zero from lan78xx_write_reg() in some 
> scenario.
> Fix to return 0 when lan78xx_suspend() has no error.
> 
> Signed-off-by: Woojung Huh 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] net: Remove redundant oif checks in rt6_device_match

2015-09-28 Thread David Miller

From: David Ahern 
Date: Fri, 25 Sep 2015 15:22:54 -0600

> The oif has already been checked that it is non-zero; the 2 additional
> checks on oif within that if (oif) {...} block are redundant.
> 
> CC: YOSHIFUJI Hideaki 
> Signed-off-by: David Ahern 

Looks good, applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 3/3] net: irda: pxaficp_ir: dmaengine conversion

2015-09-28 Thread David Miller

From: Robert Jarzmik 
Date: Sat, 26 Sep 2015 20:49:20 +0200

> Convert pxaficp_ir to dmaengine. As pxa architecture is shifting from
> raw DMA registers access to pxa_dma dmaengine driver, convert this
> driver to dmaengine.
> 
> Signed-off-by: Robert Jarzmik 
> Tested-by: Petr Cvek 
> ---
> Since v1: removed mach/dma.h include, which is the goal

Applied to net-next.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/3] net: irda: pxaficp_ir: convert to readl and writel

2015-09-28 Thread David Miller

From: Robert Jarzmik 
Date: Sat, 26 Sep 2015 20:49:19 +0200

> Convert the pxa IRDA driver to readl and writel primitives, and remove
> another set of direct registers access. This leaves only the DMA
> registers access, which will be dealt with dmaengine conversion.
> 
> Signed-off-by: Robert Jarzmik 
> Tested-by: Petr Cvek 
> ---
> Since v1: modified __REG macro to cope with STIER, ST* registers

Applied to net-next.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] net: irda: pxaficp_ir: use sched_clock() for time management

2015-09-28 Thread David Miller

From: Robert Jarzmik 
Date: Sat, 26 Sep 2015 20:49:18 +0200

> Instead of using directly the OS timer through direct register access,
> use the standard sched_clock(), which will end up in OSCR reading
> anyway.
> 
> This is a first step for direct access register removal and machine
> specific code removal from this driver.
> 
> This commit changes the behavior, as previously the minimum turnaround
> time was counted in 76ns steps, while with this patch it is counted in
> microsecond steps. The strictly equal formula would have been :
>   while ((sched_clock() - si->last_clk) * 76 < mtt)
> 
> Signed-off-by: Robert Jarzmik 
> ---
> Since v2: fixed clock calculation as pointed out by David

Applied to net-next.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next PATCH] net: help compiler generate better code in eth_get_headlen

2015-09-28 Thread David Miller

 From: Jesper Dangaard Brouer 
Date: Mon, 28 Sep 2015 12:47:14 +0200

> Noticed that the compiler (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC))
> generated suboptimal assembler code in eth_get_headlen().
> 
> This early return coding style is usually not an issue, on super scalar CPUs,
> but the compiler choose to put the return statement after this very unlikely
> branch, thus creating larger jump down to the likely code path.
> 
> Performance wise, I could measure slightly less L1-icache-load-misses
> and less branch-misses, and an improvement of 1 nanosec with an IP-forwarding
> use-case with 257 bytes packets with ixgbe (CPU i7-4790K @ 4.00GHz).
> 
> Signed-off-by: Jesper Dangaard Brouer 

Applied, thanks Jesper.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] tcp: avoid reorders for TFO passive connections

2015-09-28 Thread David Miller

From: Eric Dumazet 
Date: Thu, 24 Sep 2015 17:16:05 -0700

> From: Eric Dumazet 
> 
> We found that a TCP Fast Open passive connection was vulnerable
> to reorders, as the exchange might look like
> 
> [1] C -> S S  
> [2] S -> C S. ack request 
> [3] S -> C . 
> 
> packets [2] and [3] can be generated at almost the same time.
> 
> If C receives the 3rd packet before the 2nd, it will drop it as
> the socket is in SYN_SENT state and expects a SYNACK.
> 
> S will have to retransmit the answer.
> 
> Current OOO avoidance in linux is defeated because SYNACK
> packets are attached to the LISTEN socket, while DATA packets
> are attached to the children. They might be sent by different cpus,
> and different TX queues might be selected.
> 
> It turns out that for TFO, we created a child, which is a
> full blown socket in TCP_SYN_RECV state, and we simply can attach
> the SYNACK packet to this socket.
> 
> This means that at the time tcp_sendmsg() pushes DATA packet,
> skb->ooo_okay will be set iff the SYNACK packet had been sent
> and TX completed.
> 
> This removes the reorder source at the host level.
> 
> We also removed the export of tcp_try_fastopen(), as it is no
> longer called from IPv6.
> 
> Signed-off-by: Eric Dumazet 
> Signed-off-by: Yuchung Cheng 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net ipv4 ipconfig: use preferred log methods

2015-09-28 Thread David Miller

From: Bastian Stender 
Date: Fri, 25 Sep 2015 11:58:30 +0200

> @@ -69,7 +69,7 @@
>  #undef IPCONFIG_DEBUG
>  
>  #ifdef IPCONFIG_DEBUG
> -#define DBG(x) printk x
> +#define DBG(x) pr_debug(x)
>  #else
>  #define DBG(x) do { } while(0)
>  #endif

I agree with Stephen, just get rid of this and use pr_debug()
unconditionally.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] l2tp: protect tunnel->del_work by ref_count

2015-09-28 Thread David Miller

From: Alexander Couzens 
Date: Mon, 28 Sep 2015 11:32:42 +0200

> There is a small chance that tunnel_free() is called before tunnel->del_work 
> scheduled
> resulting in a zero pointer dereference.
> 
> Signed-off-by: Alexander Couzens 

Applied and queued up for -stable, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/8] Mellanox mlx5 driver update

2015-09-28 Thread David Miller

From: Or Gerlitz 
Date: Fri, 25 Sep 2015 10:49:08 +0300

> Bunch of changes from the team, while warming engines for the 
> upcoming SRIOV support.

Series applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 17/19] tools: bpf_jit_disasm: make get_last_jit_image return unsigned

2015-09-28 Thread David Miller

From: Andrzej Hajda 
Date: Fri, 25 Sep 2015 08:45:43 +0200

> The function returns always non-negative values.
> 
> The problem has been detected using proposed semantic patch
> scripts/coccinelle/tests/assign_signed_to_unsigned.cocci [1].
> 
> [1]: http://permalink.gmane.org/gmane.linux.kernel/2046107
> 
> Signed-off-by: Andrzej Hajda 
> ---
> v2: fixed indentation

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] net: sctp: Don't use 64 kilobyte lookup table for four elements

2015-09-28 Thread David Miller

From: Denys Vlasenko 
Date: Mon, 28 Sep 2015 14:34:04 +0200

> Seemingly innocuous sctp_trans_state_to_prio_map[] array
> is way bigger than it looks, since
> "[SCTP_UNKNOWN] = 2" expands into "[0x] = 2" !
> 
> This patch replaces it with switch() statement.
> 
> Signed-off-by: Denys Vlasenko 

Applied, thank you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next v2 0/6][pull request] Intel Wired LAN Driver Updates 2015-09-28

2015-09-28 Thread David Miller

From: Jeff Kirsher 
Date: Mon, 28 Sep 2015 17:55:28 -0700

> This series contains updates to i40e, i40evf and igb to resolve issues
> seen and reported by Red Hat.

Pulled, thanks Jeff.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net] i40e/i40evf: check for stopped admin queue

2015-09-28 Thread David Miller

From: Jeff Kirsher 
Date: Mon, 28 Sep 2015 17:31:26 -0700

> From: Mitch Williams 
> 
> It's possible that while we are waiting for the spinlock, another
> entity (that owns the spinlock) has shut down the admin queue.
> If we then attempt to use the queue, we will panic.
> 
> Add a check for this condition on the receive side. This matches
> an existing check on the send queue side.
> 
> Signed-off-by: Mitch Williams 
> Acked-by: Jesse Brandeburg 
> Signed-off-by: Jeff Kirsher 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net] i40e: fix VLAN inside VXLAN

2015-09-28 Thread David Miller

From: Jeff Kirsher 
Date: Mon, 28 Sep 2015 11:21:48 -0700

> From: Jesse Brandeburg 
> 
> Previously to this patch, the hardware was removing
> VLAN tags from the inner header of VXLAN packets.  The
> hardware configuration can be changed to leave the
> packet alone since that is what the linux stack
> expects for this type of VLAN in VXLAN packet.
> 
> Signed-off-by: Jesse Brandeburg 
> Tested-by: Andrew Bowers 
> Signed-off-by: Jeff Kirsher 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net 0/2] sctp: Fix SCTP deadlock

2015-09-28 Thread David Miller

From: Karl Heiss 
Date: Thu, 24 Sep 2015 12:15:05 -0400

> These patches fix a deadlock during accept() of an SCTP connection.
> 
> The first patch fixes whitespace issues.
> 
> The second patch actually fixes the deadlock race.

Seems reasonable, series applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/4] switchdev: push bridge attributes down

2015-09-28 Thread Florian Fainelli

2015-09-24 21:26 GMT-07:00 Scott Feldman :
> On Thu, Sep 24, 2015 at 6:23 PM, Florian Fainelli  
> wrote:
>> On 24/09/15 13:59, sfel...@gmail.com wrote:
>>> From: Scott Feldman 
>>>
>>> Push bridge-level attributes down to switchdev drivers.  This patchset
>>> adds the infrastructure and then pushes, as an example, ageing_time 
>>> attribute
>>> down from bridge to switchdev (rocker) driver.  Add some range-checking
>>> for ageing_time.
>>>
>>> # ip link set dev br0 type bridge ageing_time 1000
>>>
>>> # ip link set dev br0 type bridge ageing_time 999
>>> RTNETLINK answers: Numerical result out of range
>>>
>>> Up until now, switchdev attrs where port-level attrs, so the netdev used in
>>> switchdev_attr_set() would be a switch port or bond of switch ports.  With
>>> bridge-level attrs, the netdev passed to switchdev_attr_set() is the bridge
>>> netdev.  The same recusive algo is used to visit the leaves of the stacked
>>> drivers to set the attr, it's just in this case we start one layer higher in
>>> the stack.  One note is not all ports in the bridge may support setting a
>>> bridge-level attribute, so rather than failing the entire set, we'll skip 
>>> over
>>> those ports returning -EOPNOTSUPP.
>>
>> So, without a better device to hold that kind of information (in the
>> future it could be a global, switch-specific device holding that
>> information), I agree with your decision to take the bridge device to
>> hold that attribute, it still feels a bit uncomfortable to have
>> switchdev_attr_port() take a bridge device parameter, but whatever, here
>> is a scenario I am wondering how we would want to proceed with:
>>
>> - suppose we have a switch which is only able to control ageing
>> globally, not per port or any other kind of logical domain
>>
>> - we have enabled two software bridges on the same physical switch, with
>> different ageing timeouts
>>
>> It does not seem to me like it hurts ageing the other bridge faster than
>> expected (even though that could be expensive for MDIO devices), but we
>> would need to have consistent reporting here for the other bridge.
>
> It could hurt a little bit by ageing out entries prematurely, forcing
> relearning.  ;)
>
> I think if the switch can't support multiple ageing timeouts (which is
> probably typical), then the switch driver should not implement this
> switchdev attr at the port level (this patchset).  So how does ageing
> timeout get set for a switch with a global timer?

For Broadcom switches, you have a global timer which is 20 bits wide,
allowing both the min and max ageing times to be set. So I would
suspect you would have to setup a combination of hardware and software
timers if you want different ageing timeouts to be made per
bridge/VLAN/port etc.

>  I believe we need
> the switch-specific device to handle those switch-global attr sets.
> I'll send out a refresh of my RFC patches for switch device class, and
> add this ageing_time attr.

Works for me, thanks!
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] net/ibm/emac: bump version numbers for correct work with ethtool

2015-09-28 Thread David Miller

From: Ivan Mikhaylov 
Date: Fri, 25 Sep 2015 11:52:27 +0400

> The size of the MAC register dump used to be the size specified by the
> reg property in the device tree.  Userland has no good way of finding
> out that size, and it was not specified consistently for each MAC type,
> so ethtool would end up printing junk at the end of the register dump
> if the device tree didn't match the size it assumed.
> 
> Using the new version numbers indicates unambiguously that the size of
> the MAC register dump is dependent only on the MAC type.
> 
> Fixes: 5369c71f7ca2 ("net/ibm/emac: fix size of emac dump memory areas")
> 
> Signed-off-by: Ivan Mikhaylov 

Applied and queued up for -stable.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next] tcp: Fix CWV being too strict on thin streams

2015-09-28 Thread David Miller

From: Eric Dumazet 
Date: Sun, 27 Sep 2015 09:33:00 -0700

> David, any idea of what happened to Bendik patch ?
> 
> https://patchwork.ozlabs.org/patch/521765
> 
> Do we need to re-submit or something ?

Sorry, applied and build testing, I don't know how this disappeared
like that :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFT] geneve: implement support for IPv6-based tunnels

2015-09-28 Thread John W. Linville

On Fri, Sep 25, 2015 at 02:08:44PM +0200, Jiri Benc wrote:
> On Thu, 24 Sep 2015 14:34:42 -0400, John W. Linville wrote:
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device 
> > *dev)
> > +{
> > +   struct geneve_dev *geneve = netdev_priv(dev);
> > +   struct geneve_sock *gs = geneve->sock;
> > +   struct ip_tunnel_info *info = NULL;
> > +   struct dst_entry *dst = NULL;
> > +   struct flowi6 fl6;
> > +   __u8 ttl;
> > +   __be16 sport;
> > +   bool udp_csum;
> > +   int err;
> > +   bool xnet = !net_eq(geneve->net, dev_net(geneve->dev));
> > +
> > +   if (geneve->collect_md) {
> > +   info = skb_tunnel_info(skb);
> > +   if (unlikely(info && info->mode != IP_TUNNEL_INFO_TX)) {
> > +   netdev_dbg(dev, "no tunnel metadata\n");
> > +   goto tx_error;
> > +   }
> > +   }
> 
> You may get IPv4 tunnel info here. Either a check whether it's really
> IPv6 is needed or, better, decide whether to use IPv4 or IPv6 for xmit
> based on the tunnel info. See below.
> 
> > +static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev)
> > +{
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +   struct geneve_dev *geneve = netdev_priv(dev);
> > +
> > +   if (geneve->remote.sa.sa_family == AF_INET6)
> > +   return geneve6_xmit_skb(skb, dev);
> > +#endif
> > +   return geneve_xmit_skb(skb, dev);
> > +}
> 
> For metadata based tunnels, there should be no requirement for the
> remote to be specified. As the consequence, you cannot decide based on
> the remote type.

Sure, that makes sense.  I'm testing something now...

> To be really useful, geneve should open both IPv4 and IPv6 socket when
> it's metadata based. Take a look at my recent patchset that does this
> for vxlan: http://thread.gmane.org/gmane.linux.network/379282

OK, that seems simple enough.  So we should just assume that a metadata
tunnel could do either protocol at any time?  Or are there more rules
than that?

John
-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Fix false positives in can_checksum_protocol()

2015-09-28 Thread Tom Herbert

On Mon, Sep 28, 2015 at 12:26 PM, David Woodhouse  wrote:
> On Mon, 2015-09-28 at 12:13 -0700, Tom Herbert wrote:
>>
>> > Perhaps a better solution would be a bit in the skbuff which indicates
>> > that it *is* a TCP or UDP checksum. That would be set by our UDP and
>> > TCP sockets, cleared by encapsulation, also set if appropriate by
>> > skb_partial_csum_set().
>> >
>> Yes I agree. What I have been thinking to do is steal two bits from
>> csum_offset that would indicate that the checksum is IPv4 or IPv6
>> (specifically that the checksum value is seeded with an IPv4 or IPv6
>> pseudo header). This information plus the csum_offset would be
>> sufficient for drivers to identify the checksum as UDP/TCP-IPv4/IPv6.
>
>> The other case that needs special handling is inner vs. outer
>> checksum, but that can be deduced by comparing (inner of outer)
>> transport offset to checksum start. With this and a couple of utility
>> functions we should be able to start deprecating NETIF_F_IP_CSUM and
>> NETIF_F_IPV6_CSUM.
>
> You mean drivers which currently set NETIF_F_IP_CSUM would need to
> provide a .ndo_features_check() which tolerates only the packets they
> can actually handle? And we'd just ensure that the bits are there for
> them to use, in the skbuff? That seems reasonable.
>
I think it's easier to just call skb_checksum_help from the driver
when the packet is actually sent to the device (should be no cost for
late binding).

> Note that 'seeded with an IPv[46] pseudo header' isn't quite
> sufficient. Some hardware like 8139cp is explicitly told to do a UDP or
> a TCP checksum with a bit in the descriptor, so any UDP-like or TCP
> -like checksum works out fine.
>
UDP or TCP can be determined from csum_offset, e.g. 16=>TCP 6=>UDP

> Other hardware works out whether to do a UDP or a TCP checksum for
> *itself*, so it *can't* cope with other protocols which just happen to
> look the same. For those it really *must* be IPPROTO_TCP or IPPROTO_UDP
> and they're going to be looking in the IP header for it.
>
In those cases driver can just look into the packet to determine the protocol.

> I do suspect we'll want a bit which says it's *actually* TCP or UDP,
> not just 'seeded with a pseudo-header'. That's the important
> distinction for NETIF_F_IP_CSUM vs. NETIF_F_HW_CSUM.
>
> --
> David WoodhouseOpen Source Technology Centre
> david.woodho...@intel.com  Intel Corporation
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/6] net/core: make sock_diag.c explicitly non-modular

2015-09-28 Thread Paul Gortmaker

The Makefile currently controlling compilation of this code lists
it under "obj-y" ...meaning that it currently is not being built as
a module by anyone.

Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.

Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.  We can
change to one of the other priority initcalls (subsys?) at any later
date, if desired.

We can't remove module.h since the file uses other module related
stuff even though it is not modular itself.

We move the information from the MODULE_LICENSE tag to the top of the
file, since that information is not captured anywhere else.  The
MODULE_ALIAS_NET_PF_PROTO becomes a no-op in the non modular case, so
it is removed.

Cc: "David S. Miller" 
Cc: Eric Dumazet 
Cc: Nicolas Dichtel 
Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Cc: Craig Gallek 
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker 
---
 net/core/sock_diag.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 817622f3dbb7..0c1d58d43f67 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -1,3 +1,5 @@
+/* License: GPL */
+
 #include 
 #include 
 #include 
@@ -323,14 +325,4 @@ static int __init sock_diag_init(void)
BUG_ON(!broadcast_wq);
return register_pernet_subsys(_net_ops);
 }
-
-static void __exit sock_diag_exit(void)
-{
-   unregister_pernet_subsys(_net_ops);
-   destroy_workqueue(broadcast_wq);
-}
-
-module_init(sock_diag_init);
-module_exit(sock_diag_exit);
-MODULE_LICENSE("GPL");
-MODULE_ALIAS_NET_PF_PROTO(PF_NETLINK, NETLINK_SOCK_DIAG);
+device_initcall(sock_diag_init);
-- 
2.6.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 0/6] make non-modular code explicitly non-modular

2015-09-28 Thread Paul Gortmaker

In a previous merge window, we made changes to allow better
delineation between modular and non-modular code in commit
0fd972a7d91d6e15393c449492a04d94c0b89351 ("module: relocate module_init
from init.h to module.h").  This allows us to now ensure module code
looks modular and non-modular code does not accidentally look modular
just to avoid suffering build breakage.

Here we target code that is, by nature of their Makefile and/or
Kconfig settings, only available to be built-in, but implicitly
presenting itself as being possibly modular by way of using modular
headers, macros, and functions.

The goal here is to remove that illusion of modularity from these
files, but in a way that leaves the actual runtime unchanged.
In doing so, we remove code that has never been tested and adds
no value to the tree.  And we continue the process of expecting a
level of consistency between the Kconfig/Makefile of code and the
code in use itself.

Fortuntately the net subsystem has relatively few instances, given
the overall amount of code and drivers it contains.  For comparison
there are over 300 instances tree wide, resulting in a possible net
removal of on the order of 5000 lines of unused code.

Build tested on net-next 34c2d9fb0498 on m68k, since that is the arch
where the three ethernet drivers changed here are available.

Paul.
--

Cc: Alexei Starovoitov 
Cc: Anish Bhatt 
Cc: Craig Gallek 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Eric Dumazet 
Cc: Jamal Hadi Salim 
Cc: John Fastabend 
Cc: Nicolas Dichtel 
Cc: Or Gerlitz 
Cc: Shani Michaeli 
Cc: linux-m...@lists.linux-m68k.org
Cc: netdev@vger.kernel.org

Paul Gortmaker (6):
  net/core: make sock_diag.c explicitly non-modular
  net/dcb: make dcbnl.c explicitly non-modular
  net/sched: make sch_blackhole.c explicitly non-modular
  net/ethernet: make amd/hplance.c driver explicitly non-modular
  net/ethernet: make 8390/mac8390.c driver explicitly non-modular
  net/ethernet: make apple/macmace.c driver explicitly non-modular

 drivers/net/ethernet/8390/mac8390.c  | 40 ++--
 drivers/net/ethernet/amd/hplance.c   | 13 ++--
 drivers/net/ethernet/apple/macmace.c | 38 +++---
 net/core/sock_diag.c | 14 +++--
 net/dcb/dcbnl.c  | 30 +++
 net/sched/sch_blackhole.c| 15 +++---
 6 files changed, 16 insertions(+), 134 deletions(-)

-- 
2.6.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set

2015-09-28 Thread David Ahern

Wolfgang reported that IPv6 stack is ignoring oif in output route lookups:

With ipv6, ip -6 route get always returns the specific route.

$ ip -6 r
2001:db8:e2::1 dev enp2s0  proto kernel  metric 256
2001:db8:e2::/64 dev enp2s0  metric 1024
2001:db8:e3::1 dev enp3s0  proto kernel  metric 256
2001:db8:e3::/64 dev enp3s0  metric 1024
fe80::/64 dev enp3s0  proto kernel  metric 256
default via 2001:db8:e3::255 dev enp3s0  metric 1024

$ ip -6 r get 2001:db8:e2::100
2001:db8:e2::100 from :: dev enp2s0  src 2001:db8:e3::1  metric 0
cache

$ ip -6 r get 2001:db8:e2::100 oif enp3s0
2001:db8:e2::100 from :: dev enp2s0  src 2001:db8:e3::1  metric 0
cache

The stack does consider the oif but a mismatch in rt6_device_match is not
considered fatal because RT6_LOOKUP_F_IFACE is not set in the flags.

Cc: Wolfgang Nothdurft 
Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f204089e854c..cb32ce250db0 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1193,7 +1193,8 @@ struct dst_entry *ip6_route_output(struct net *net, const 
struct sock *sk,
 
fl6->flowi6_iif = LOOPBACK_IFINDEX;
 
-   if ((sk && sk->sk_bound_dev_if) || rt6_need_strict(>daddr))
+   if ((sk && sk->sk_bound_dev_if) || rt6_need_strict(>daddr) ||
+   fl6->flowi6_oif)
flags |= RT6_LOOKUP_F_IFACE;
 
if (!ipv6_addr_any(>saddr))
-- 
2.3.8 (Apple Git-58)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.3-rc3 Regression: NFS access stall by commit 6ae459bdaaee

2015-09-28 Thread Pravin Shelar

On Mon, Sep 28, 2015 at 6:12 AM, Takashi Iwai  wrote:
> [I resent this since the previous mail didn't go out properly, as it
>  seems; apologies if you already read it, please disregard]
>
> Hi,
>
> I noticed that NFS access from my workstation slowed down drastically,
> almost stalls, with the fresh 4.3-rc3.  There are no particular kernel
> errors / warnings.
>

I have seen error reports related to IPv6 traffic which I am
debugging. Are you trying access NFS over IPv6?


> Then I performed git section, and it leaded to the commit:
> 6ae459bdaaeebc632b16e54dcbabb490c6931d61
> skbuff: Fix skb checksum flag on skb pull
>
> Reverting this commit from 4.3-rc3 fixed the issue indeed.
>
> Could you take a look at this?  I added Trond to Cc in case he might
> already know of it.
>
>
> thanks,
>
> Takashi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RESEND: [PATCH v3 net-next] sky2: use random address if EEPROM is bad

2015-09-28 Thread Liviu Dudau

On some embedded systems the EEPROM does not contain a valid MAC address.
In that case it is better to fallback to a generated mac address and
let init scripts fix the value later.

Reported-by: Liviu Dudau 
Signed-off-by: Stephen Hemminger 
[Changed handcoded setup to use eth_hw_addr_random() and to save new address 
into HW]
Signed-off-by: Liviu Dudau 
---
 drivers/net/ethernet/marvell/sky2.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/marvell/sky2.c 
b/drivers/net/ethernet/marvell/sky2.c
index d9f4498..5606a04 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -4819,6 +4819,18 @@ static struct net_device *sky2_init_netdev(struct 
sky2_hw *hw, unsigned port,
memcpy_fromio(dev->dev_addr, hw->regs + B2_MAC_1 + port * 8,
  ETH_ALEN);
 
+   /* if the address is invalid, use a random value */
+   if (!is_valid_ether_addr(dev->dev_addr)) {
+   struct sockaddr sa = { AF_UNSPEC };
+
+   netdev_warn(dev,
+   "Invalid MAC address, defaulting to random\n");
+   eth_hw_addr_random(dev);
+   memcpy(sa.sa_data, dev->dev_addr, ETH_ALEN);
+   if (sky2_set_mac_address(dev, ))
+   netdev_warn(dev, "Failed to set MAC address.\n");
+   }
+
return dev;
 }
 
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 10/11] net: Rename FLOWI_FLAG_VRFSRC to FLOWI_FLAG_L3MDEV_SRC

2015-09-28 Thread David Ahern

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c   | 4 ++--
 include/net/flow.h  | 2 +-
 include/net/route.h | 2 +-
 net/ipv4/udp.c  | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 64f2ab663ffe..f2f9c7091130 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -208,7 +208,7 @@ static netdev_tx_t vrf_process_v4_outbound(struct sk_buff 
*skb,
.flowi4_oif = vrf_dev->ifindex,
.flowi4_iif = LOOPBACK_IFINDEX,
.flowi4_tos = RT_TOS(ip4h->tos),
-   .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_VRFSRC |
+   .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_L3MDEV_SRC |
FLOWI_FLAG_SKIP_NH_OIF,
.daddr = ip4h->daddr,
};
@@ -545,7 +545,7 @@ static struct rtable *vrf_get_rtable(const struct 
net_device *dev,
 {
struct rtable *rth = NULL;
 
-   if (!(fl4->flowi4_flags & FLOWI_FLAG_VRFSRC)) {
+   if (!(fl4->flowi4_flags & FLOWI_FLAG_L3MDEV_SRC)) {
struct net_vrf *vrf = netdev_priv(dev);
 
rth = vrf->rth;
diff --git a/include/net/flow.h b/include/net/flow.h
index 9b85db85f13c..83969eebebf3 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -34,7 +34,7 @@ struct flowi_common {
__u8flowic_flags;
 #define FLOWI_FLAG_ANYSRC  0x01
 #define FLOWI_FLAG_KNOWN_NH0x02
-#define FLOWI_FLAG_VRFSRC  0x04
+#define FLOWI_FLAG_L3MDEV_SRC  0x04
 #define FLOWI_FLAG_SKIP_NH_OIF 0x08
__u32   flowic_secid;
struct flowi_tunnel flowic_tun_key;
diff --git a/include/net/route.h b/include/net/route.h
index e211dc167db1..7929c9c33587 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -258,7 +258,7 @@ static inline void ip_route_connect_init(struct flowi4 
*fl4, __be32 dst, __be32
flow_flags |= FLOWI_FLAG_ANYSRC;
 
if (netif_index_is_l3_master(sock_net(sk), oif))
-   flow_flags |= FLOWI_FLAG_VRFSRC | FLOWI_FLAG_SKIP_NH_OIF;
+   flow_flags |= FLOWI_FLAG_L3MDEV_SRC | FLOWI_FLAG_SKIP_NH_OIF;
 
flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
   protocol, flow_flags, dst, src, dport, sport);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 156ba75b6000..b2882cfd3136 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1024,7 +1024,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
if (netif_index_is_l3_master(net, ipc.oif)) {
flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
   RT_SCOPE_UNIVERSE, sk->sk_protocol,
-  (flow_flags | FLOWI_FLAG_VRFSRC |
+  (flow_flags | FLOWI_FLAG_L3MDEV_SRC |
FLOWI_FLAG_SKIP_NH_OIF),
   faddr, saddr, dport,
   inet->inet_sport);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 07/11] net: Remove the now unused vrf_ptr

2015-09-28 Thread David Ahern

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c | 32 ++--
 include/linux/netdevice.h |  2 --
 include/net/vrf.h |  6 --
 3 files changed, 2 insertions(+), 38 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 72f1892ebad0..df872f4efb0d 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -396,18 +396,15 @@ static void __vrf_insert_slave(struct slave_queue *queue, 
struct slave *slave)
 
 static int do_vrf_add_slave(struct net_device *dev, struct net_device 
*port_dev)
 {
-   struct net_vrf_dev *vrf_ptr = kmalloc(sizeof(*vrf_ptr), GFP_KERNEL);
struct slave *slave = kzalloc(sizeof(*slave), GFP_KERNEL);
struct net_vrf *vrf = netdev_priv(dev);
struct slave_queue *queue = >queue;
int ret = -ENOMEM;
 
-   if (!slave || !vrf_ptr)
+   if (!slave)
goto out_fail;
 
slave->dev = port_dev;
-   vrf_ptr->ifindex = dev->ifindex;
-   vrf_ptr->tb_id = vrf->tb_id;
 
/* register the packet handler for slave ports */
ret = netdev_rx_handler_register(port_dev, vrf_handle_frame, dev);
@@ -424,7 +421,6 @@ static int do_vrf_add_slave(struct net_device *dev, struct 
net_device *port_dev)
 
port_dev->flags |= IFF_SLAVE;
__vrf_insert_slave(queue, slave);
-   rcu_assign_pointer(port_dev->vrf_ptr, vrf_ptr);
cycle_netdev(port_dev);
 
return 0;
@@ -432,7 +428,6 @@ static int do_vrf_add_slave(struct net_device *dev, struct 
net_device *port_dev)
 out_unregister:
netdev_rx_handler_unregister(port_dev);
 out_fail:
-   kfree(vrf_ptr);
kfree(slave);
return ret;
 }
@@ -448,21 +443,15 @@ static int vrf_add_slave(struct net_device *dev, struct 
net_device *port_dev)
 /* inverse of do_vrf_add_slave */
 static int do_vrf_del_slave(struct net_device *dev, struct net_device 
*port_dev)
 {
-   struct net_vrf_dev *vrf_ptr = rtnl_dereference(port_dev->vrf_ptr);
struct net_vrf *vrf = netdev_priv(dev);
struct slave_queue *queue = >queue;
struct slave *slave;
 
-   RCU_INIT_POINTER(port_dev->vrf_ptr, NULL);
-
netdev_upper_dev_unlink(port_dev, dev);
port_dev->flags &= ~IFF_SLAVE;
 
netdev_rx_handler_unregister(port_dev);
 
-   /* after netdev_rx_handler_unregister for synchronize_rcu */
-   kfree(vrf_ptr);
-
cycle_netdev(port_dev);
 
slave = __vrf_find_slave_dev(queue, port_dev);
@@ -601,10 +590,6 @@ static int vrf_validate(struct nlattr *tb[], struct nlattr 
*data[])
 
 static void vrf_dellink(struct net_device *dev, struct list_head *head)
 {
-   struct net_vrf_dev *vrf_ptr = rtnl_dereference(dev->vrf_ptr);
-
-   RCU_INIT_POINTER(dev->vrf_ptr, NULL);
-   kfree_rcu(vrf_ptr, rcu);
unregister_netdevice_queue(dev, head);
 }
 
@@ -612,7 +597,6 @@ static int vrf_newlink(struct net *src_net, struct 
net_device *dev,
   struct nlattr *tb[], struct nlattr *data[])
 {
struct net_vrf *vrf = netdev_priv(dev);
-   struct net_vrf_dev *vrf_ptr;
int err;
 
if (!data || !data[IFLA_VRF_TABLE])
@@ -622,24 +606,13 @@ static int vrf_newlink(struct net *src_net, struct 
net_device *dev,
 
dev->priv_flags |= IFF_L3MDEV_MASTER;
 
-   err = -ENOMEM;
-   vrf_ptr = kmalloc(sizeof(*dev->vrf_ptr), GFP_KERNEL);
-   if (!vrf_ptr)
-   goto out_fail;
-
-   vrf_ptr->ifindex = dev->ifindex;
-   vrf_ptr->tb_id = vrf->tb_id;
-
err = register_netdevice(dev);
if (err < 0)
goto out_fail;
 
-   rcu_assign_pointer(dev->vrf_ptr, vrf_ptr);
-
return 0;
 
 out_fail:
-   kfree(vrf_ptr);
free_netdev(dev);
return err;
 }
@@ -683,10 +656,9 @@ static int vrf_device_event(struct notifier_block *unused,
 
/* only care about unregister events to drop slave references */
if (event == NETDEV_UNREGISTER) {
-   struct net_vrf_dev *vrf_ptr = rtnl_dereference(dev->vrf_ptr);
struct net_device *vrf_dev;
 
-   if (!vrf_ptr || netif_is_l3_master(dev))
+   if (netif_is_l3_master(dev))
goto out;
 
vrf_dev = netdev_master_upper_dev_get(dev);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c7f14794fe14..72bf9e37a2f0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1427,7 +1427,6 @@ enum netdev_priv_flags {
  * @dn_ptr:DECnet specific data
  * @ip6_ptr:   IPv6 specific data
  * @ax25_ptr:  AX.25 specific data
- * @vrf_ptr:   VRF specific data
  * @ieee80211_ptr: IEEE 802.11 specific data, assign before registering
  *
  * @last_rx:   Time of last Rx
@@ -1649,7 +1648,6 @@ struct net_device {
struct dn_dev __rcu *dn_ptr;
struct inet6_dev __rcu  *ip6_ptr;
void*ax25_ptr;
-

[PATCH net-next 03/11] net: Add support for l3mdev ops to VRF driver

2015-09-28 Thread David Ahern

Signed-off-by: David Ahern 
---
 drivers/net/Kconfig |  1 +
 drivers/net/vrf.c   | 29 +
 2 files changed, 30 insertions(+)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index d18eb607bee6..b9ebd0d18a52 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -299,6 +299,7 @@ config NLMON
 config NET_VRF
tristate "Virtual Routing and Forwarding (Lite)"
depends on IP_MULTIPLE_TABLES && IPV6_MULTIPLE_TABLES
+   depends on NET_L3_MASTER_DEV
---help---
  This option enables the support for mapping interfaces into VRF's. The
  support enables VRF devices.
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 2d7418e0b908..72f1892ebad0 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DRV_NAME   "vrf"
 #define DRV_VERSION"1.0"
@@ -529,6 +530,33 @@ static const struct net_device_ops vrf_netdev_ops = {
.ndo_del_slave  = vrf_del_slave,
 };
 
+static u32 vrf_fib_table(const struct net_device *dev)
+{
+   struct net_vrf *vrf = netdev_priv(dev);
+
+   return vrf->tb_id;
+}
+
+static struct rtable *vrf_get_rtable(const struct net_device *dev,
+const struct flowi4 *fl4)
+{
+   struct rtable *rth = NULL;
+
+   if (!(fl4->flowi4_flags & FLOWI_FLAG_VRFSRC)) {
+   struct net_vrf *vrf = netdev_priv(dev);
+
+   rth = vrf->rth;
+   atomic_inc(>dst.__refcnt);
+   }
+
+   return rth;
+}
+
+static const struct l3mdev_ops vrf_l3mdev_ops = {
+   .l3mdev_fib_table   = vrf_fib_table,
+   .l3mdev_get_rtable  = vrf_get_rtable,
+};
+
 static void vrf_get_drvinfo(struct net_device *dev,
struct ethtool_drvinfo *info)
 {
@@ -546,6 +574,7 @@ static void vrf_setup(struct net_device *dev)
 
/* Initialize the device structure. */
dev->netdev_ops = _netdev_ops;
+   dev->l3mdev_ops = _l3mdev_ops;
dev->ethtool_ops = _ethtool_ops;
dev->destructor = free_netdev;
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 02/11] net: Introduce L3 Master device abstraction

2015-09-28 Thread David Ahern

L3 master devices allow users of the abstraction to influence FIB lookups
for enslaved devices. Current API provides a means for the master device
to return a specific FIB table for an enslaved device, to return an
rtable/custom dst and influence the OIF used for fib lookups.

Signed-off-by: David Ahern 
---
 MAINTAINERS   |   7 +++
 include/linux/netdevice.h |   3 ++
 include/net/l3mdev.h  | 127 ++
 net/Kconfig   |   1 +
 net/Makefile  |   3 ++
 net/l3mdev/Kconfig|  10 
 net/l3mdev/Makefile   |   5 ++
 net/l3mdev/l3mdev.c   |  78 
 8 files changed, 234 insertions(+)
 create mode 100644 include/net/l3mdev.h
 create mode 100644 net/l3mdev/Kconfig
 create mode 100644 net/l3mdev/Makefile
 create mode 100644 net/l3mdev/l3mdev.c

diff --git a/MAINTAINERS b/MAINTAINERS
index bcd263de4827..3f2d7a9d0bbf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6095,6 +6095,13 @@ F:   Documentation/auxdisplay/ks0108
 F: drivers/auxdisplay/ks0108.c
 F: include/linux/ks0108.h
 
+L3MDEV
+M: David Ahern 
+L: netdev@vger.kernel.org
+S: Maintained
+F: net/l3mdev
+F: include/net/l3mdev.h
+
 LAPB module
 L: linux-...@vger.kernel.org
 S: Orphan
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 99c33e83822f..c7f14794fe14 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1587,6 +1587,9 @@ struct net_device {
 #ifdef CONFIG_NET_SWITCHDEV
const struct switchdev_ops *switchdev_ops;
 #endif
+#ifdef CONFIG_NET_L3_MASTER_DEV
+   const struct l3mdev_ops *l3mdev_ops;
+#endif
 
const struct header_ops *header_ops;
 
diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
new file mode 100644
index ..cd44d24eac57
--- /dev/null
+++ b/include/net/l3mdev.h
@@ -0,0 +1,127 @@
+/*
+ * include/net/l3mdev.h - L3 master device API
+ * Copyright (c) 2015 Cumulus Networks
+ * Copyright (c) 2015 David Ahern 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#ifndef _NET_L3MDEV_H_
+#define _NET_L3MDEV_H_
+
+#include 
+
+/**
+ * struct l3mdev_ops - l3mdev operations
+ *
+ * @l3mdev_fib_table: Get FIB table id to use for lookups
+ *
+ * @l3dev_get_rtable: Get cached IPv4 rtable (dst_entry) for device
+ */
+
+struct l3mdev_ops {
+   u32 (*l3mdev_fib_table)(const struct net_device *dev);
+   struct rtable * (*l3mdev_get_rtable)(const struct net_device *dev,
+const struct flowi4 *fl4);
+};
+
+#ifdef CONFIG_NET_L3_MASTER_DEV
+
+int l3mdev_master_ifindex_rcu(struct net_device *dev);
+static inline int l3mdev_master_ifindex(struct net_device *dev)
+{
+   int ifindex;
+
+   rcu_read_lock();
+   ifindex = l3mdev_master_ifindex_rcu(dev);
+   rcu_read_unlock();
+
+   return ifindex;
+}
+
+/* get index of an interface to use for FIB lookups. For devices
+ * enslaved to an L3 master device FIB lookups are based on the
+ * master index
+ */
+static inline int l3mdev_fib_oif_rcu(struct net_device *dev)
+{
+   return l3mdev_master_ifindex_rcu(dev) ? : dev->ifindex;
+}
+
+static inline int l3mdev_fib_oif(struct net_device *dev)
+{
+   int oif;
+
+   rcu_read_lock();
+   oif = l3mdev_fib_oif_rcu(dev);
+   rcu_read_unlock();
+
+   return oif;
+}
+
+u32 l3mdev_fib_table_rcu(const struct net_device *dev);
+u32 l3mdev_fib_table_by_index(struct net *net, int ifindex);
+static inline u32 l3mdev_fib_table(const struct net_device *dev)
+{
+   u32 tb_id;
+
+   rcu_read_lock();
+   tb_id = l3mdev_fib_table_rcu(dev);
+   rcu_read_unlock();
+
+   return tb_id;
+}
+
+static inline struct rtable *l3mdev_get_rtable(const struct net_device *dev,
+  const struct flowi4 *fl4)
+{
+   if (netif_is_l3_master(dev) && dev->l3mdev_ops->l3mdev_get_rtable)
+   return dev->l3mdev_ops->l3mdev_get_rtable(dev, fl4);
+
+   return NULL;
+}
+
+#else
+
+static inline int l3mdev_master_ifindex_rcu(struct net_device *dev)
+{
+   return 0;
+}
+static inline int l3mdev_master_ifindex(struct net_device *dev)
+{
+   return 0;
+}
+
+static inline int l3mdev_fib_oif_rcu(struct net_device *dev)
+{
+   return dev ? dev->ifindex : 0;
+}
+static inline int l3mdev_fib_oif(struct net_device *dev)
+{
+   return dev ? dev->ifindex : 0;
+}
+
+static inline u32 l3mdev_fib_table_rcu(const struct net_device *dev)
+{
+   return 0;
+}
+static inline u32 l3mdev_fib_table(const struct net_device *dev)
+{
+   return 0;
+}
+static inline u32 l3mdev_fib_table_by_index(struct net *net, int

[PATCH net-next 05/11] net: Replace vrf_dev_table and friends

2015-09-28 Thread David Ahern

Replace calls to vrf_dev_table and friends with l3mdev_fib_table
and kin.

Signed-off-by: David Ahern 
---
 include/net/vrf.h   | 80 -
 net/ipv4/af_inet.c  |  4 +--
 net/ipv4/fib_frontend.c |  7 ++---
 3 files changed, 5 insertions(+), 86 deletions(-)

diff --git a/include/net/vrf.h b/include/net/vrf.h
index 874a6c9e4217..b05b96646e2a 100644
--- a/include/net/vrf.h
+++ b/include/net/vrf.h
@@ -34,66 +34,6 @@ struct net_vrf {
 
 
 #if IS_ENABLED(CONFIG_NET_VRF)
-/* called with rcu_read_lock */
-static inline u32 vrf_dev_table_rcu(const struct net_device *dev)
-{
-   u32 tb_id = 0;
-
-   if (dev) {
-   struct net_vrf_dev *vrf_ptr;
-
-   vrf_ptr = rcu_dereference(dev->vrf_ptr);
-   if (vrf_ptr)
-   tb_id = vrf_ptr->tb_id;
-   }
-   return tb_id;
-}
-
-static inline u32 vrf_dev_table(const struct net_device *dev)
-{
-   u32 tb_id;
-
-   rcu_read_lock();
-   tb_id = vrf_dev_table_rcu(dev);
-   rcu_read_unlock();
-
-   return tb_id;
-}
-
-static inline u32 vrf_dev_table_ifindex(struct net *net, int ifindex)
-{
-   struct net_device *dev;
-   u32 tb_id = 0;
-
-   if (!ifindex)
-   return 0;
-
-   rcu_read_lock();
-
-   dev = dev_get_by_index_rcu(net, ifindex);
-   if (dev)
-   tb_id = vrf_dev_table_rcu(dev);
-
-   rcu_read_unlock();
-
-   return tb_id;
-}
-
-/* called with rtnl */
-static inline u32 vrf_dev_table_rtnl(const struct net_device *dev)
-{
-   u32 tb_id = 0;
-
-   if (dev) {
-   struct net_vrf_dev *vrf_ptr;
-
-   vrf_ptr = rtnl_dereference(dev->vrf_ptr);
-   if (vrf_ptr)
-   tb_id = vrf_ptr->tb_id;
-   }
-   return tb_id;
-}
-
 /* caller has already checked netif_is_l3_master(dev) */
 static inline struct rtable *vrf_dev_get_rth(const struct net_device *dev)
 {
@@ -108,26 +48,6 @@ static inline struct rtable *vrf_dev_get_rth(const struct 
net_device *dev)
 }
 
 #else
-static inline u32 vrf_dev_table_rcu(const struct net_device *dev)
-{
-   return 0;
-}
-
-static inline u32 vrf_dev_table(const struct net_device *dev)
-{
-   return 0;
-}
-
-static inline u32 vrf_dev_table_ifindex(struct net *net, int ifindex)
-{
-   return 0;
-}
-
-static inline u32 vrf_dev_table_rtnl(const struct net_device *dev)
-{
-   return 0;
-}
-
 static inline struct rtable *vrf_dev_get_rth(const struct net_device *dev)
 {
return ERR_PTR(-ENETUNREACH);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8a556643b874..0df3f0527648 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -119,7 +119,7 @@
 #ifdef CONFIG_IP_MROUTE
 #include 
 #endif
-#include 
+#include 
 
 
 /* The inetsw table contains everything that inet_create needs to
@@ -450,7 +450,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
goto out;
}
 
-   tb_id = vrf_dev_table_ifindex(net, sk->sk_bound_dev_if) ? : tb_id;
+   tb_id = l3mdev_fib_table_by_index(net, sk->sk_bound_dev_if) ? : tb_id;
chk_addr_ret = inet_addr_type_table(net, addr->sin_addr.s_addr, tb_id);
 
/* Not specified by any standard per-se, however it breaks too
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index b901b344f22d..fac172370276 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -45,7 +45,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -256,7 +255,7 @@ EXPORT_SYMBOL(inet_addr_type);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr)
 {
-   u32 rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL;
+   u32 rt_table = l3mdev_fib_table(dev) ? : RT_TABLE_LOCAL;
 
return __inet_dev_addr_type(net, dev, addr, rt_table);
 }
@@ -269,7 +268,7 @@ unsigned int inet_addr_type_dev_table(struct net *net,
  const struct net_device *dev,
  __be32 addr)
 {
-   u32 rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL;
+   u32 rt_table = l3mdev_fib_table(dev) ? : RT_TABLE_LOCAL;
 
return __inet_dev_addr_type(net, NULL, addr, rt_table);
 }
@@ -804,7 +803,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct 
in_ifaddr *ifa)
 {
struct net *net = dev_net(ifa->ifa_dev->dev);
-   u32 tb_id = vrf_dev_table_rtnl(ifa->ifa_dev->dev);
+   u32 tb_id = l3mdev_fib_table(ifa->ifa_dev->dev);
struct fib_table *tb;
struct fib_config cfg = {
.fc_protocol = RTPROT_KERNEL,
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

[PATCH net-next 08/11] net: Remove vrf header file

2015-09-28 Thread David Ahern

Move remaining structs to VRF driver and delete the vrf header file.

Signed-off-by: David Ahern 
---
 MAINTAINERS   |  1 -
 drivers/net/vrf.c | 16 +++-
 include/net/vrf.h | 29 -
 3 files changed, 15 insertions(+), 31 deletions(-)
 delete mode 100644 include/net/vrf.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 3f2d7a9d0bbf..fa43fa2f30e4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11273,7 +11273,6 @@ M:  Shrijeet Mukherjee 
 L: netdev@vger.kernel.org
 S: Maintained
 F: drivers/net/vrf.c
-F: include/net/vrf.h
 F: Documentation/networking/vrf.txt
 
 VT1211 HARDWARE MONITOR DRIVER
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index df872f4efb0d..64f2ab663ffe 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -34,7 +34,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define DRV_NAME   "vrf"
@@ -45,6 +44,21 @@
 #define vrf_master_get_rcu(dev) \
((struct net_device *)rcu_dereference(dev->rx_handler_data))
 
+struct slave {
+   struct list_headlist;
+   struct net_device   *dev;
+};
+
+struct slave_queue {
+   struct list_headall_slaves;
+};
+
+struct net_vrf {
+   struct slave_queue  queue;
+   struct rtable   *rth;
+   u32 tb_id;
+};
+
 struct pcpu_dstats {
u64 tx_pkts;
u64 tx_bytes;
diff --git a/include/net/vrf.h b/include/net/vrf.h
deleted file mode 100644
index e83fc38770dd..
--- a/include/net/vrf.h
+++ /dev/null
@@ -1,29 +0,0 @@
-/*
- * include/net/net_vrf.h - adds vrf dev structure definitions
- * Copyright (c) 2015 Cumulus Networks
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#ifndef __LINUX_NET_VRF_H
-#define __LINUX_NET_VRF_H
-
-struct slave {
-   struct list_headlist;
-   struct net_device   *dev;
-};
-
-struct slave_queue {
-   struct list_headall_slaves;
-};
-
-struct net_vrf {
-   struct slave_queue  queue;
-   struct rtable   *rth;
-   u32 tb_id;
-};
-
-#endif /* __LINUX_NET_VRF_H */
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 11/11] net: Add netif_is_l3_slave

2015-09-28 Thread David Ahern

IPv6 addrconf keys off of IFF_SLAVE so can not use it for L3 slave.
Add a new private flag and add netif_is_l3_slave function for checking
it.

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c | 8 +++-
 include/linux/netdevice.h | 7 +++
 net/l3mdev/l3mdev.c   | 8 
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index f2f9c7091130..277a8e0a6a4f 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -39,8 +39,6 @@
 #define DRV_NAME   "vrf"
 #define DRV_VERSION"1.0"
 
-#define vrf_is_slave(dev)   ((dev)->flags & IFF_SLAVE)
-
 #define vrf_master_get_rcu(dev) \
((struct net_device *)rcu_dereference(dev->rx_handler_data))
 
@@ -433,7 +431,7 @@ static int do_vrf_add_slave(struct net_device *dev, struct 
net_device *port_dev)
if (ret < 0)
goto out_unregister;
 
-   port_dev->flags |= IFF_SLAVE;
+   port_dev->priv_flags |= IFF_L3MDEV_SLAVE;
__vrf_insert_slave(queue, slave);
cycle_netdev(port_dev);
 
@@ -448,7 +446,7 @@ static int do_vrf_add_slave(struct net_device *dev, struct 
net_device *port_dev)
 
 static int vrf_add_slave(struct net_device *dev, struct net_device *port_dev)
 {
-   if (netif_is_l3_master(port_dev) || vrf_is_slave(port_dev))
+   if (netif_is_l3_master(port_dev) || netif_is_l3_slave(port_dev))
return -EINVAL;
 
return do_vrf_add_slave(dev, port_dev);
@@ -462,7 +460,7 @@ static int do_vrf_del_slave(struct net_device *dev, struct 
net_device *port_dev)
struct slave *slave;
 
netdev_upper_dev_unlink(port_dev, dev);
-   port_dev->flags &= ~IFF_SLAVE;
+   port_dev->priv_flags &= ~IFF_L3MDEV_SLAVE;
 
netdev_rx_handler_unregister(port_dev);
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b9450784ae06..b3374402c1ea 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1261,6 +1261,7 @@ struct net_device_ops {
  * @IFF_L3MDEV_MASTER: device is an L3 master device
  * @IFF_NO_QUEUE: device can run without qdisc attached
  * @IFF_OPENVSWITCH: device is a Open vSwitch master
+ * @IFF_L3MDEV_SLAVE: device is enslaved to an L3 master device
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1286,6 +1287,7 @@ enum netdev_priv_flags {
IFF_L3MDEV_MASTER   = 1<<20,
IFF_NO_QUEUE= 1<<21,
IFF_OPENVSWITCH = 1<<22,
+   IFF_L3MDEV_SLAVE= 1<<23,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -3830,6 +3832,11 @@ static inline bool netif_is_l3_master(const struct 
net_device *dev)
return dev->priv_flags & IFF_L3MDEV_MASTER;
 }
 
+static inline bool netif_is_l3_slave(const struct net_device *dev)
+{
+   return dev->priv_flags & IFF_L3MDEV_SLAVE;
+}
+
 static inline bool netif_is_bridge_master(const struct net_device *dev)
 {
return dev->priv_flags & IFF_EBRIDGE;
diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
index 9efc8dd1ac4d..f43fc0e816f7 100644
--- a/net/l3mdev/l3mdev.c
+++ b/net/l3mdev/l3mdev.c
@@ -15,11 +15,11 @@ int l3mdev_master_ifindex_rcu(struct net_device *dev)
 
if (netif_is_l3_master(dev)) {
ifindex = dev->ifindex;
-   } else if (dev->flags & IFF_SLAVE) {
+   } else if (netif_is_l3_slave(dev)) {
struct net_device *master;
 
master = netdev_master_upper_dev_get_rcu(dev);
-   if (master && netif_is_l3_master(master))
+   if (master)
ifindex = master->ifindex;
}
 
@@ -42,7 +42,7 @@ u32 l3mdev_fib_table_rcu(const struct net_device *dev)
if (netif_is_l3_master(dev)) {
if (dev->l3mdev_ops->l3mdev_fib_table)
tb_id = dev->l3mdev_ops->l3mdev_fib_table(dev);
-   } else if (dev->flags & IFF_SLAVE) {
+   } else if (netif_is_l3_slave(dev)) {
/* Users of netdev_master_upper_dev_get_rcu need non-const,
 * but current inet_*type functions take a const
 */
@@ -50,7 +50,7 @@ u32 l3mdev_fib_table_rcu(const struct net_device *dev)
const struct net_device *master;
 
master = netdev_master_upper_dev_get_rcu(_dev);
-   if (master && netif_is_l3_master(master) &&
+   if (master &&
master->l3mdev_ops->l3mdev_fib_table)
tb_id = master->l3mdev_ops->l3mdev_fib_table(master);
}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: unregister_netdevice warnings when deleting netns

2015-09-28 Thread Julian Anastasov


Hello,

On Mon, 28 Sep 2015, Anand Gurram wrote:

> I am currently using kernel version 3.16.7 on a linux switch.
> While creating and destroying network namespaces I am observing below logs
> on the console
> "unregister_netdevice: waiting for lo to become free. Usage count = 1"
> 
> Can you please suggest and provide instructions on how to debug this issue.
> If any fix already available can you please point me to the link.

There are two commits from Linux 4.2 that may help:

commit e9e4dd3267d0 ("net: do not process device backlog during unregistration")
commit 2c17d27c36dc ("net: call rcu_read_lock early in process_backlog")

For now I see them only in 3.2.71+ and 3.12.48+.
I think, they will appear in other stable versions too...

Regards

--
Julian Anastasov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Fix false positives in can_checksum_protocol()

2015-09-28 Thread Tom Herbert

>>  Also, this doesn't help those drivers that that can offload TCP and
>> UDP for IPv6 but only if there are no extension headers, in those
>> case the driver needs to look at the packet to see if it is a
>> "simple" UDP/TCP packet.
>
> Hm, are such devices even permitted to set NETIF_F_IPV6_CSUM?
>
Apparently this may be a problem in ixgbe. See "[net-next 05/19]
ixgbe: Add support for UDP-encapsulated tx checksum offload" thread.

>> AFAIK, the only non UDP/TCP transport IP checksum in the stack is GRE
>> checksum which as I pointed out we don't attempt to offload. So the
>> only way to trip the bug that you are seeing is probably through a
>> userspace packet interface like in the test code. I think this
>> actually might expose a much more serious issue. Looking at tun.c, I
>> don't see anything that validates that the csum_start and csum_offset
>> provided by userspace actually refers to a sane checksum offset.
>
> That's handled in skb_partial_csum_set().
>
That only checks that start and offset are within skb_headlen. It
doesn't check that checksum offset refers to TCP/UDP/GRE/ICMP
checksum, or whether to the first two bytes of the IP destination
address. Maybe there's something later in the path that would catch
this, but I didn't readily see it.

>> Not only is this a way to ask the stack to perform checksums for non
>> TCP/UDP, but it actually seems like the interface could be used by a
>> malicious application to have a device arbitrarily overwrite two
>> bytes anywhere in the packet with it's own data far below the stack,
>> netfilter, routing. To really fix this we should probably be doing
>> validation in tun, if the checksum isn't for TCP or UDP then call
>> skb_checksum_help before sending the packet into the stack.
>
> So... if it's never valid to ask for a hardware checksum on anything
> but TCP or UDP, why do we bother with NETIF_F_GEN_CSUM at all? Should
> we just be removing it entirely? That seems like something of a
> retrograde step.
>
No, we want to do the opposite! In your example the request to
checksum is being generated from outside the stack so we need to
verify that for sanity-- requests generated by the stack would be
trusted. Presumably, within the stack we want a generic checksum
offload for new protocols, new extension headers (I am almost certain
that segment routing exthdr would break some NETIF_F_IPV6_CSUM), and
new flavors of encapsulation. NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM
are not generic and are becoming impediments to protocol development--
drivers moving to NETIF_F_HW_CSUM is the answer.

> Perhaps a better solution would be a bit in the skbuff which indicates
> that it *is* a TCP or UDP checksum. That would be set by our UDP and
> TCP sockets, cleared by encapsulation, also set if appropriate by
> skb_partial_csum_set().
>
Yes I agree. What I have been thinking to do is steal two bits from
csum_offset that would indicate that the checksum is IPv4 or IPv6
(specifically that the checksum value is seeded with an IPv4 or IPv6
pseudo header). This information plus the csum_offset would be
sufficient for drivers to identify the checksum as UDP/TCP-IPv4/IPv6.
The other case that needs special handling is inner vs. outer
checksum, but that can be deduced by comparing (inner of outer)
transport offset to checksum start. With this and a couple of utility
functions we should be able to start deprecating NETIF_F_IP_CSUM and
NETIF_F_IPV6_CSUM.

Thanks,
Tom

> And then the check in can_checksum_protocol() is trivial and clearly
> correct.
>
> --
> David WoodhouseOpen Source Technology Centre
> david.woodho...@intel.com  Intel Corporation
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Fix false positives in can_checksum_protocol()

2015-09-28 Thread David Woodhouse

On Mon, 2015-09-28 at 12:13 -0700, Tom Herbert wrote:
> 
> > Perhaps a better solution would be a bit in the skbuff which indicates
> > that it *is* a TCP or UDP checksum. That would be set by our UDP and
> > TCP sockets, cleared by encapsulation, also set if appropriate by
> > skb_partial_csum_set().
> >
> Yes I agree. What I have been thinking to do is steal two bits from
> csum_offset that would indicate that the checksum is IPv4 or IPv6
> (specifically that the checksum value is seeded with an IPv4 or IPv6
> pseudo header). This information plus the csum_offset would be
> sufficient for drivers to identify the checksum as UDP/TCP-IPv4/IPv6.

> The other case that needs special handling is inner vs. outer
> checksum, but that can be deduced by comparing (inner of outer)
> transport offset to checksum start. With this and a couple of utility
> functions we should be able to start deprecating NETIF_F_IP_CSUM and
> NETIF_F_IPV6_CSUM.

You mean drivers which currently set NETIF_F_IP_CSUM would need to
provide a .ndo_features_check() which tolerates only the packets they
can actually handle? And we'd just ensure that the bits are there for
them to use, in the skbuff? That seems reasonable.

Note that 'seeded with an IPv[46] pseudo header' isn't quite
sufficient. Some hardware like 8139cp is explicitly told to do a UDP or
a TCP checksum with a bit in the descriptor, so any UDP-like or TCP
-like checksum works out fine.

Other hardware works out whether to do a UDP or a TCP checksum for
*itself*, so it *can't* cope with other protocols which just happen to
look the same. For those it really *must* be IPPROTO_TCP or IPPROTO_UDP
and they're going to be looking in the IP header for it.

I do suspect we'll want a bit which says it's *actually* TCP or UDP,
not just 'seeded with a pseudo-header'. That's the important
distinction for NETIF_F_IP_CSUM vs. NETIF_F_HW_CSUM.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation

smime.p7s
Description: S/MIME cryptographic signature

[net-next 4/5] i40e: Fix RS bit update in Tx path and disable force WB workaround

2015-09-28 Thread Jeff Kirsher

From: Anjali Singhai 

This patch fixes the issue of forcing WB too often causing us to not
benefit from NAPI.

Without this patch we were forcing WB/arming interrupt too often taking
away the benefits of NAPI and causing a performance impact.

With this patch we disable force WB in the clean routine for X710
and XL710 adapters. X722 adapters do not enable interrupt to force
a WB and benefit from WB_ON_ITR and hence force WB is left enabled
for those adapters.
For XL710 and X710 adapters if we have less than 4 packets pending
a software Interrupt triggered from service task will force a WB.

This patch also changes the conditions for setting RS bit as described
in code comments. This optimizes when the HW does a tail bump and when
it does a WB. It also optimizes when we do a wmb.

Signed-off-by: Anjali Singhai Jain 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 129 ++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   2 +
 2 files changed, 87 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index d699fc9..3ce4900 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -726,17 +726,22 @@ static bool i40e_clean_tx_irq(struct i40e_ring *tx_ring, 
int budget)
tx_ring->q_vector->tx.total_bytes += total_bytes;
tx_ring->q_vector->tx.total_packets += total_packets;
 
-   /* check to see if there are any non-cache aligned descriptors
-* waiting to be written back, and kick the hardware to force
-* them to be written back in case of napi polling
-*/
-   if (budget &&
-   !((i & WB_STRIDE) == WB_STRIDE) &&
-   !test_bit(__I40E_DOWN, _ring->vsi->state) &&
-   (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
-   tx_ring->arm_wb = true;
-   else
-   tx_ring->arm_wb = false;
+   if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+   unsigned int j = 0;
+
+   /* check to see if there are < 4 descriptors
+* waiting to be written back, then kick the hardware to force
+* them to be written back in case we stay in NAPI.
+* In this mode on X722 we do not enable Interrupt.
+*/
+   j = i40e_get_tx_pending(tx_ring);
+
+   if (budget &&
+   ((j / (WB_STRIDE + 1)) == 0) && (j != 0) &&
+   !test_bit(__I40E_DOWN, _ring->vsi->state) &&
+   (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
+   tx_ring->arm_wb = true;
+   }
 
netdev_tx_completed_queue(netdev_get_tx_queue(tx_ring->netdev,
  tx_ring->queue_index),
@@ -2500,6 +2505,9 @@ static inline void i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
u32 td_tag = 0;
dma_addr_t dma;
u16 gso_segs;
+   u16 desc_count = 0;
+   bool tail_bump = true;
+   bool do_rs = false;
 
if (tx_flags & I40E_TX_FLAGS_HW_VLAN) {
td_cmd |= I40E_TX_DESC_CMD_IL2TAG1;
@@ -2540,6 +2548,8 @@ static inline void i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
tx_desc++;
i++;
+   desc_count++;
+
if (i == tx_ring->count) {
tx_desc = I40E_TX_DESC(tx_ring, 0);
i = 0;
@@ -2559,6 +2569,8 @@ static inline void i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
tx_desc++;
i++;
+   desc_count++;
+
if (i == tx_ring->count) {
tx_desc = I40E_TX_DESC(tx_ring, 0);
i = 0;
@@ -2573,34 +2585,6 @@ static inline void i40e_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
tx_bi = _ring->tx_bi[i];
}
 
-   /* Place RS bit on last descriptor of any packet that spans across the
-* 4th descriptor (WB_STRIDE aka 0x3) in a 64B cacheline.
-*/
-   if (((i & WB_STRIDE) != WB_STRIDE) &&
-   (first <= _ring->tx_bi[i]) &&
-   (first >= _ring->tx_bi[i & ~WB_STRIDE])) {
-   tx_desc->cmd_type_offset_bsz =
-   build_ctob(td_cmd, td_offset, size, td_tag) |
-   cpu_to_le64((u64)I40E_TX_DESC_CMD_EOP <<
-I40E_TXD_QW1_CMD_SHIFT);
-   } else {
-   tx_desc->cmd_type_offset_bsz =
-   build_ctob(td_cmd, td_offset, size, td_tag) |
-   cpu_to_le64((u64)I40E_TXD_CMD <<
-

[net-next 2/5] i40e/i40evf: refactor tx timeout logic

2015-09-28 Thread Jeff Kirsher

From: Kiran Patil 

This patch modifies the driver timeout logic by issuing a writeback
request via a software interrupt to the hardware the first time the
driver detects a hang. The driver was too aggressive in resetting a hung
queue, so back that off by removing logic to down the netdevice after
too many hangs, and move the function to the service task.

Change-ID: Ife100b9d124cd08cbdb81ab659008c1b9abbedea
Signed-off-by: Kiran Patil 
Signed-off-by: Shannon Nelson 
Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h|   1 -
 drivers/net/ethernet/intel/i40e/i40e_main.c   | 272 +++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  75 +--
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  10 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  94 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h |   8 -
 6 files changed, 173 insertions(+), 287 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index e746279..4797269 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -243,7 +243,6 @@ struct i40e_pf {
struct pci_dev *pdev;
struct i40e_hw hw;
unsigned long state;
-   unsigned long link_check_timeout;
struct msix_entry *msix_entries;
bool fc_autoneg_status;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 530d8b6..c246dca 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -299,25 +299,69 @@ static void i40e_tx_timeout(struct net_device *netdev)
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_vsi *vsi = np->vsi;
struct i40e_pf *pf = vsi->back;
+   struct i40e_ring *tx_ring = NULL;
+   unsigned int i, hung_queue = 0;
+   u32 head, val;
 
pf->tx_timeout_count++;
 
+   /* find the stopped queue the same way the stack does */
+   for (i = 0; i < netdev->num_tx_queues; i++) {
+   struct netdev_queue *q;
+   unsigned long trans_start;
+
+   q = netdev_get_tx_queue(netdev, i);
+   trans_start = q->trans_start ? : netdev->trans_start;
+   if (netif_xmit_stopped(q) &&
+   time_after(jiffies,
+  (trans_start + netdev->watchdog_timeo))) {
+   hung_queue = i;
+   break;
+   }
+   }
+
+   if (i == netdev->num_tx_queues) {
+   netdev_info(netdev, "tx_timeout: no netdev hung queue found\n");
+   } else {
+   /* now that we have an index, find the tx_ring struct */
+   for (i = 0; i < vsi->num_queue_pairs; i++) {
+   if (vsi->tx_rings[i] && vsi->tx_rings[i]->desc) {
+   if (hung_queue ==
+   vsi->tx_rings[i]->queue_index) {
+   tx_ring = vsi->tx_rings[i];
+   break;
+   }
+   }
+   }
+   }
+
if (time_after(jiffies, (pf->tx_timeout_last_recovery + HZ*20)))
-   pf->tx_timeout_recovery_level = 1;
+   pf->tx_timeout_recovery_level = 1;  /* reset after some time */
+   else if (time_before(jiffies,
+ (pf->tx_timeout_last_recovery + netdev->watchdog_timeo)))
+   return;   /* don't do any new action before the next timeout */
+
+   if (tx_ring) {
+   head = i40e_get_head(tx_ring);
+   /* Read interrupt register */
+   if (pf->flags & I40E_FLAG_MSIX_ENABLED)
+   val = rd32(>hw,
+I40E_PFINT_DYN_CTLN(tx_ring->q_vector->v_idx +
+   tx_ring->vsi->base_vector - 1));
+   else
+   val = rd32(>hw, I40E_PFINT_DYN_CTL0);
+
+   netdev_info(netdev, "tx_timeout: VSI_seid: %d, Q %d, NTC: 0x%x, 
HWB: 0x%x, NTU: 0x%x, TAIL: 0x%x, INT: 0x%x\n",
+   vsi->seid, hung_queue, tx_ring->next_to_clean,
+   head, tx_ring->next_to_use,
+   readl(tx_ring->tail), val);
+   }
+
pf->tx_timeout_last_recovery = jiffies;
-   netdev_info(netdev, "tx_timeout recovery level %d\n",
-   pf->tx_timeout_recovery_level);
+   netdev_info(netdev, "tx_timeout recovery level %d, hung_queue %d\n",
+   pf->tx_timeout_recovery_level, hung_queue);
 
switch (pf->tx_timeout_recovery_level) {
-   case

[net-next 1/5] i40e: Move i40e_get_head into header file

2015-09-28 Thread Jeff Kirsher

From: Kiran Patil 

i40e_get_head needs to be called in multiple files in a further patch,
prepare by moving the function into a header file.

Signed-off-by: Kiran Patil 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 13 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.h | 14 ++
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 738aca6..9a800f9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -600,19 +600,6 @@ void i40e_free_tx_resources(struct i40e_ring *tx_ring)
}
 }
 
-/**
- * i40e_get_head - Retrieve head from head writeback
- * @tx_ring:  tx ring to fetch head of
- *
- * Returns value of Tx ring head based on value stored
- * in head write-back location
- **/
-static inline u32 i40e_get_head(struct i40e_ring *tx_ring)
-{
-   void *head = (struct i40e_tx_desc *)tx_ring->desc + tx_ring->count;
-
-   return le32_to_cpu(*(volatile __le32 *)head);
-}
 
 /**
  * i40e_get_tx_pending - how many tx descriptors not processed
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index f1385a1..51487ef 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -326,4 +326,18 @@ int i40e_xmit_descriptor_count(struct sk_buff *skb, struct 
i40e_ring *tx_ring);
 int i40e_tx_prepare_vlan_flags(struct sk_buff *skb,
   struct i40e_ring *tx_ring, u32 *flags);
 #endif
+
+/**
+ * i40e_get_head - Retrieve head from head writeback
+ * @tx_ring:  tx ring to fetch head of
+ *
+ * Returns value of Tx ring head based on value stored
+ * in head write-back location
+ **/
+static inline u32 i40e_get_head(struct i40e_ring *tx_ring)
+{
+   void *head = (struct i40e_tx_desc *)tx_ring->desc + tx_ring->count;
+
+   return le32_to_cpu(*(volatile __le32 *)head);
+}
 #endif /* _I40E_TXRX_H_ */
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 5/5] igb: assume MSI-X interrupts during initialization

2015-09-28 Thread Jeff Kirsher

From: Stefan Assmann 

In igb_sw_init() the sequence of calls was changed from
igb_init_queue_configuration()
igb_init_interrupt_scheme()
igb_probe_vfs()
to
igb_probe_vfs()
igb_init_queue_configuration()
igb_init_interrupt_scheme()

This results in adapter->flags not having the IGB_FLAG_HAS_MSIX bit set
during igb_probe_vfs()->igb_enable_sriov(). Therefore SR-IOV does not
get enabled properly and we run into a NULL pointer if the max_vfs
module parameter is specified (adapter->vf_data does not get allocated,
crash on accessing the structure).

[7.419348] BUG: unable to handle kernel NULL pointer dereference at 
0048
[7.419367] IP: [] igb_reset+0xe6/0x5d0 [igb]
[7.419370] PGD 0
[7.419373] Oops: 0002 [#1] SMP
[7.419381] Modules linked in: ahci(+) libahci igb(+) i40e(+) vxlan 
ip6_udp_tunnel udp_tunnel megaraid_sas(+) ixgbe(+) mdio
[7.419385] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 4.2.0+ #153
[7.419387] Hardware name: Dell Inc. PowerEdge R720/0C4Y3R, BIOS 1.6.0 
03/07/2013
[...]
[7.419431] Call Trace:
[7.419442]  [] igb_probe+0x8b6/0x1340 [igb]
[7.419447]  [] local_pci_probe+0x45/0xa0

Prevent this by setting the IGB_FLAG_HAS_MSIX bit before calling
igb_probe_vfs(). The real interrupt capabilities will be checked during
igb_init_interrupt_scheme() so this is safe to do.

Signed-off-by: Stefan Assmann 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index e174fbb..ba019fc 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2986,6 +2986,9 @@ static int igb_sw_init(struct igb_adapter *adapter)
}
 #endif /* CONFIG_PCI_IOV */
 
+   /* Assume MSI-X interrupts, will be checked during IRQ allocation */
+   adapter->flags |= IGB_FLAG_HAS_MSIX;
+
igb_probe_vfs(adapter);
 
igb_init_queue_configuration(adapter);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 0/5][pull request] Intel Wired LAN Driver Updates 2015-09-28

2015-09-28 Thread Jeff Kirsher

This series contains updates to i40e, i40evf and igb to resolve issues
seen and reported by Red Hat.

Kiran moves i40e_get_head() in preparation for the refactor of the Tx
timeout logic, so that it can be used in other areas of the driver.
Refactored the driver timeout logic by issuing a writeback request via
a software interrupt to the hardware the first time the driver detects
a hang.  This was due to the driver being too aggressive in resetting a
hung queue.

Shannon adds the GRE protocol to the transmit checksum encoding.

Anjali fixes an issue of forcing writeback too often, which caused us to
not benefit from NAPI.  We now disable force writeback in the clean
routine for X710 and XL710 adapters.  The X722 adapters do not enable
interrupt to force a writeback and benefit from WB_ON_ITR and so force
WB is left enabled for those adapters.

Stefan Assmann provides a fix for igb where SR-IOV was not getting
enabled properly and we ran into a NULL pointer if the max_vfs module
parameter is specified.  This is prevented by setting the
IGB_FLAG_HAS_MSIX bit before calling igb_probe_vfs().

The following are changes since commit 34c2d9fb0498c066afbe610b15e18995fd8be792:
  bridge: Allow forward delay to be cfgd when STP enabled
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue master

Anjali Singhai (1):
  i40e: Fix RS bit update in Tx path and disable force WB workaround

Kiran Patil (2):
  i40e: Move i40e_get_head into header file
  i40e/i40evf: refactor tx timeout logic

Shannon Nelson (1):
  i40e: add GRE tunnel type to csum encoding

Stefan Assmann (1):
  igb: assume MSI-X interrupts during initialization

 drivers/net/ethernet/intel/i40e/i40e.h|   1 -
 drivers/net/ethernet/intel/i40e/i40e_main.c   | 272 +++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 214 
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  26 ++-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  94 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h |   8 -
 drivers/net/ethernet/intel/igb/igb_main.c |   3 +
 7 files changed, 277 insertions(+), 341 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 3/5] i40e: add GRE tunnel type to csum encoding

2015-09-28 Thread Jeff Kirsher

From: Shannon Nelson 

Make sure the Tx checksum encoder knows about GRE protocol and sets the
descriptor flag appropriately.

Signed-off-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 4f9ff89..d699fc9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2240,6 +2240,9 @@ static void i40e_tx_enable_csum(struct sk_buff *skb, u32 
*tx_flags,
l4_tunnel = I40E_TXD_CTX_UDP_TUNNELING;
*tx_flags |= I40E_TX_FLAGS_VXLAN_TUNNEL;
break;
+   case IPPROTO_GRE:
+   l4_tunnel = I40E_TXD_CTX_GRE_TUNNELING;
+   break;
default:
return;
}
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/6] net/sched: make sch_blackhole.c explicitly non-modular

2015-09-28 Thread Paul Gortmaker

The Kconfig currently controlling compilation of this code is:

net/sched/Kconfig:menuconfig NET_SCHED
net/sched/Kconfig:  bool "QoS and/or fair queueing"

...meaning that it currently is not being built as a module by anyone.

Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.

Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.  We can
change to one of the other priority initcalls (subsys?) at any later
date, if desired.

We also delete the MODULE_LICENSE tag since all that information
is already contained at the top of the file in the comments.

Cc: Jamal Hadi Salim 
Cc: "David S. Miller" 
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker 
---
 net/sched/sch_blackhole.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_blackhole.c b/net/sched/sch_blackhole.c
index 094a874b48bc..3fee70d9814f 100644
--- a/net/sched/sch_blackhole.c
+++ b/net/sched/sch_blackhole.c
@@ -11,7 +11,7 @@
  * Note: Quantum tunneling is not supported.
  */
 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -37,17 +37,8 @@ static struct Qdisc_ops blackhole_qdisc_ops __read_mostly = {
.owner  = THIS_MODULE,
 };
 
-static int __init blackhole_module_init(void)
+static int __init blackhole_init(void)
 {
return register_qdisc(_qdisc_ops);
 }
-
-static void __exit blackhole_module_exit(void)
-{
-   unregister_qdisc(_qdisc_ops);
-}
-
-module_init(blackhole_module_init)
-module_exit(blackhole_module_exit)
-
-MODULE_LICENSE("GPL");
+device_initcall(blackhole_init)
-- 
2.6.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re

2015-09-28 Thread David Thomas



I have a proposal for you Kindly E-mail me at mrshus...@gmail.com


yours Faithfully

Mrs Huian Shao

































[http://www.cranbrook.nsw.edu.au/images/email_sig.gif]  David Thomas | Hone 
Housemaster / History Teacher
5 Victoria Road, Bellevue Hill NSW 2023 Australia
Office: +61 2 9327 9440 | Fax: +61 2 9327 9537
Email: dtho...@cranbrook.nsw.edu.au
Web: www.cranbrook.nsw.edu.au

[http://www.cranbrook.nsw.edu.au/content/emailsignatures/email_banner.jpg]

Disclaimer
This message, including any attachments, is provided without responsibility in 
law for its accuracy or otherwise and without assumption of a duty of care by 
the School.
Whilst every attempt has been made to ensure material in this e-mail message is 
free from computer viruses or other defects, the attached files are provided, 
and may only be used, on the basis that the user assumes all responsibility for 
use of the material transmitted.
This e-mail is intended for the use of the named individual or entity and may 
contain confidential and privileged information.


N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: [RFC PATCH] Fix false positives in can_checksum_protocol()

2015-09-28 Thread Tom Herbert

On Fri, Sep 25, 2015 at 5:55 AM, David Woodhouse  wrote:
> The check in harmonize_features() is supposed to match the skb against
> the features of the device in question, and prevent us from handing a
> skb to a device which can't handle it.
>
> It doesn't work correctly. A device with NETIF_F_IP_CSUM or
> NETIF_F_IPV6_CSUM capabilities is only required to checksum TCP or UDP,
> on Legacy IP and IPv6 respectively. But the existing check will allow
> us to pass it *any* ETH_P_IP/ETH_P_IPV6 packets for hardware checksum
> offload.
>
> Depending on the driver in use, this leads to a BUG, a WARNING, or just
> silent data corruption.
>
> This is one approach to fixing that, and my test program at
> http://bombadil.infradead.org/~dwmw2/raw.c can no longer trivially
> reproduce the problem.
>
> The test does now have false *negatives*, but those shouldn't happen
> for locally-generated packets; only packets injected from af_packet,
> tun, virtio_net and other places that allow us to inject
> CHECKSUM_PARTIAL packets in order to make use of hardware offload
> features. And false negatives aren't anywhere near as much of a problem
> as false positives are — we just finish the checksum in software and
> send the packet anyway.
>
> It would be possible to fix those false negatives, if we really wanted
> to — perhaps by adding an additional bit in the skbuff which indicates
> that it *is* a TCP or UDP packet, rather than using ->sk->sk_protocol.
> Then that bit could be set if appropriate in skb_partial_csum_set(), as
> well as the places where we locally generate such packets. And the
> check in can_checksum_protocol() would just check for that bit.
>
> Signed-off-by: David Woodhouse 
> ---
> Since can_checksum_protocol is inline, the compiler ought to know that
> it doesn't even need to dereference skb->sk in the case where the
> device has the NETIF_F_GEN_CSUM feature. So the additional check should
> not slow down the (hopefully) common case in the fast path.
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 2d15e38..76c8330 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3628,15 +3628,23 @@ struct sk_buff *skb_gso_segment(struct sk_buff *skb, 
> netdev_features_t features)
>  __be16 skb_network_protocol(struct sk_buff *skb, int *depth);
>
>  static inline bool can_checksum_protocol(netdev_features_t features,
> -__be16 protocol)
> -{
> -   return ((features & NETIF_F_GEN_CSUM) ||
> -   ((features & NETIF_F_V4_CSUM) &&
> -protocol == htons(ETH_P_IP)) ||
> -   ((features & NETIF_F_V6_CSUM) &&
> -protocol == htons(ETH_P_IPV6)) ||
> -   ((features & NETIF_F_FCOE_CRC) &&
> -protocol == htons(ETH_P_FCOE)));
> +__be16 protocol, u8 sk_protocol)
> +{
> +   if ((features & NETIF_F_GEN_CSUM) ||
> +   ((features & NETIF_F_FCOE_CRC) && protocol == htons(ETH_P_FCOE)))
> +   return 1;
> +
> +   /* NETIF_F_V[46]_CSUM are defined to work only on TCP and UDP.
> +* That is, when it needs to start checksumming at the transport
> +* header, and place the result at an offset of either 6 (for UDP)
> +* or 16 (for TCP).
> +*/
> +   if features & NETIF_F_V4_CSUM) && protocol == htons(ETH_P_IP)) ||
> +((features & NETIF_F_V6_CSUM) && protocol == htons(ETH_P_IPV6))) 
> &&
> +   (sk_protocol == IPPROTO_TCP || sk_protocol == IPPROTO_UDP))
> +   return 1;
> +
Relying on skb->sk->sk_protocol is problematic. This is making the
assumption that the checksum being offloaded for the packet is the
same as that of the protocol for the socket-- this may not be the case
when we are offloading an outer checksum in encapsulation. Currently
this wouldn't a be problem since we're probably only offloading outer
UDP checksums, but if we ever start trying to offload other outer
checksums (e.g. GRE) then this probably doesn't work so well. Also,
this doesn't help those drivers that that can offload TCP and UDP for
IPv6 but only if there are no extension headers, in those case the
driver needs to look at the packet to see if it is a "simple" UDP/TCP
packet.

AFAIK, the only non UDP/TCP transport IP checksum in the stack is GRE
checksum which as I pointed out we don't attempt to offload. So the
only way to trip the bug that you are seeing is probably through a
userspace packet interface like in the test code. I think this
actually might expose a much more serious issue. Looking at tun.c, I
don't see anything that validates that the csum_start and csum_offset
provided by userspace actually refers to a sane checksum offset. Not
only is this a way to ask the stack to perform checksums for non
TCP/UDP, but it actually seems like the interface could be used by a
malicious application to have a

[PATCH net-next 09/11] net: Move netif_index_is_l3_master to l3mdev.h

2015-09-28 Thread David Ahern

Change CONFIG dependency to CONFIG_NET_L3_MASTER_DEV as well.

Signed-off-by: David Ahern 
---
 include/linux/netdevice.h | 21 -
 include/net/l3mdev.h  | 24 
 include/net/route.h   |  1 +
 3 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 72bf9e37a2f0..b9450784ae06 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3840,27 +3840,6 @@ static inline bool netif_is_ovs_master(const struct 
net_device *dev)
return dev->priv_flags & IFF_OPENVSWITCH;
 }
 
-static inline bool netif_index_is_l3_master(struct net *net, int ifindex)
-{
-   bool rc = false;
-
-#if IS_ENABLED(CONFIG_NET_VRF)
-   struct net_device *dev;
-
-   if (ifindex == 0)
-   return false;
-
-   rcu_read_lock();
-
-   dev = dev_get_by_index_rcu(net, ifindex);
-   if (dev)
-   rc = netif_is_l3_master(dev);
-
-   rcu_read_unlock();
-#endif
-   return rc;
-}
-
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index cd44d24eac57..3353a0d7cae4 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -83,6 +83,25 @@ static inline struct rtable *l3mdev_get_rtable(const struct 
net_device *dev,
return NULL;
 }
 
+static inline bool netif_index_is_l3_master(struct net *net, int ifindex)
+{
+   struct net_device *dev;
+   bool rc = false;
+
+   if (ifindex == 0)
+   return false;
+
+   rcu_read_lock();
+
+   dev = dev_get_by_index_rcu(net, ifindex);
+   if (dev)
+   rc = netif_is_l3_master(dev);
+
+   rcu_read_unlock();
+
+   return rc;
+}
+
 #else
 
 static inline int l3mdev_master_ifindex_rcu(struct net_device *dev)
@@ -122,6 +141,11 @@ static inline struct rtable *l3mdev_get_rtable(const 
struct net_device *dev,
return NULL;
 }
 
+static inline bool netif_index_is_l3_master(struct net *net, int ifindex)
+{
+   return false;
+}
+
 #endif
 
 #endif /* _NET_L3MDEV_H_ */
diff --git a/include/net/route.h b/include/net/route.h
index a565d0dad12c..e211dc167db1 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 01/11] net: Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER

2015-09-28 Thread David Ahern

Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER and update the name of the
netif_is_vrf and netif_index_is_vrf macros.

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c |  6 +++---
 include/linux/netdevice.h | 14 +++---
 include/net/route.h   |  2 +-
 include/net/vrf.h |  4 ++--
 net/ipv4/ip_output.c  |  2 +-
 net/ipv4/route.c  |  2 +-
 net/ipv4/udp.c|  2 +-
 7 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 4ecb3a3e516a..2d7418e0b908 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -438,7 +438,7 @@ static int do_vrf_add_slave(struct net_device *dev, struct 
net_device *port_dev)
 
 static int vrf_add_slave(struct net_device *dev, struct net_device *port_dev)
 {
-   if (netif_is_vrf(port_dev) || vrf_is_slave(port_dev))
+   if (netif_is_l3_master(port_dev) || vrf_is_slave(port_dev))
return -EINVAL;
 
return do_vrf_add_slave(dev, port_dev);
@@ -591,7 +591,7 @@ static int vrf_newlink(struct net *src_net, struct 
net_device *dev,
 
vrf->tb_id = nla_get_u32(data[IFLA_VRF_TABLE]);
 
-   dev->priv_flags |= IFF_VRF_MASTER;
+   dev->priv_flags |= IFF_L3MDEV_MASTER;
 
err = -ENOMEM;
vrf_ptr = kmalloc(sizeof(*dev->vrf_ptr), GFP_KERNEL);
@@ -657,7 +657,7 @@ static int vrf_device_event(struct notifier_block *unused,
struct net_vrf_dev *vrf_ptr = rtnl_dereference(dev->vrf_ptr);
struct net_device *vrf_dev;
 
-   if (!vrf_ptr || netif_is_vrf(dev))
+   if (!vrf_ptr || netif_is_l3_master(dev))
goto out;
 
vrf_dev = netdev_master_upper_dev_get(dev);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d2ffeafc9998..99c33e83822f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1258,7 +1258,7 @@ struct net_device_ops {
  * @IFF_LIVE_ADDR_CHANGE: device supports hardware address
  * change when it's running
  * @IFF_MACVLAN: Macvlan device
- * @IFF_VRF_MASTER: device is a VRF master
+ * @IFF_L3MDEV_MASTER: device is an L3 master device
  * @IFF_NO_QUEUE: device can run without qdisc attached
  * @IFF_OPENVSWITCH: device is a Open vSwitch master
  */
@@ -1283,7 +1283,7 @@ enum netdev_priv_flags {
IFF_XMIT_DST_RELEASE_PERM   = 1<<17,
IFF_IPVLAN_MASTER   = 1<<18,
IFF_IPVLAN_SLAVE= 1<<19,
-   IFF_VRF_MASTER  = 1<<20,
+   IFF_L3MDEV_MASTER   = 1<<20,
IFF_NO_QUEUE= 1<<21,
IFF_OPENVSWITCH = 1<<22,
 };
@@ -1308,7 +1308,7 @@ enum netdev_priv_flags {
 #define IFF_XMIT_DST_RELEASE_PERM  IFF_XMIT_DST_RELEASE_PERM
 #define IFF_IPVLAN_MASTER  IFF_IPVLAN_MASTER
 #define IFF_IPVLAN_SLAVE   IFF_IPVLAN_SLAVE
-#define IFF_VRF_MASTER IFF_VRF_MASTER
+#define IFF_L3MDEV_MASTER  IFF_L3MDEV_MASTER
 #define IFF_NO_QUEUE   IFF_NO_QUEUE
 #define IFF_OPENVSWITCHIFF_OPENVSWITCH
 
@@ -3824,9 +3824,9 @@ static inline bool netif_supports_nofcs(struct net_device 
*dev)
return dev->priv_flags & IFF_SUPP_NOFCS;
 }
 
-static inline bool netif_is_vrf(const struct net_device *dev)
+static inline bool netif_is_l3_master(const struct net_device *dev)
 {
-   return dev->priv_flags & IFF_VRF_MASTER;
+   return dev->priv_flags & IFF_L3MDEV_MASTER;
 }
 
 static inline bool netif_is_bridge_master(const struct net_device *dev)
@@ -3839,7 +3839,7 @@ static inline bool netif_is_ovs_master(const struct 
net_device *dev)
return dev->priv_flags & IFF_OPENVSWITCH;
 }
 
-static inline bool netif_index_is_vrf(struct net *net, int ifindex)
+static inline bool netif_index_is_l3_master(struct net *net, int ifindex)
 {
bool rc = false;
 
@@ -3853,7 +3853,7 @@ static inline bool netif_index_is_vrf(struct net *net, 
int ifindex)
 
dev = dev_get_by_index_rcu(net, ifindex);
if (dev)
-   rc = netif_is_vrf(dev);
+   rc = netif_is_l3_master(dev);
 
rcu_read_unlock();
 #endif
diff --git a/include/net/route.h b/include/net/route.h
index d1bd90bb3187..a565d0dad12c 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -256,7 +256,7 @@ static inline void ip_route_connect_init(struct flowi4 
*fl4, __be32 dst, __be32
if (inet_sk(sk)->transparent)
flow_flags |= FLOWI_FLAG_ANYSRC;
 
-   if (netif_index_is_vrf(sock_net(sk), oif))
+   if (netif_index_is_l3_master(sock_net(sk), oif))
flow_flags |= FLOWI_FLAG_VRFSRC | FLOWI_FLAG_SKIP_NH_OIF;
 
flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
diff --git a/include/net/vrf.h b/include/net/vrf.h
index 593e6094ddd4..34bb3f69def2 100644
--- a/include/net/vrf.h
+++ b/include/net/vrf.h
@@ -43,7 +43,7 @@ static inline int

Re: [PATCH v2] net: sctp: Don't use 64 kilobyte lookup table for four elements

2015-09-28 Thread Denys Vlasenko

On 09/28/2015 05:32 PM, David Laight wrote:
> From: Eric Dumazet
>> Sent: 28 September 2015 15:27
>> On Mon, 2015-09-28 at 14:12 +, David Laight wrote:
>>> From: Neil Horman
 Sent: 28 September 2015 14:51
 On Mon, Sep 28, 2015 at 02:34:04PM +0200, Denys Vlasenko wrote:
> Seemingly innocuous sctp_trans_state_to_prio_map[] array
> is way bigger than it looks, since
> "[SCTP_UNKNOWN] = 2" expands into "[0x] = 2" !
>
> This patch replaces it with switch() statement.
>>>
>>> What about just adding 1 (and masking) before indexing the array?
>>> That might require a static inline function with a local static array.
>>>
>>> Or define the array as (say) [16] and just mask the state before using
>>> it as an index?
>>
>> Just let the compiler do its job, instead of obfuscating source.
>>
>> Compilers can transform a switch into an (optimal) table if it is really
>> a gain.
> 
> The compiler can choose between a jump table and nested ifs for a switch
> statement. I've never seen it convert one into a data array index.

I don't know why people are fixated on a lookup table here.

For just four possible values, the amount of generated code
is less than one Icache cacheline.

Instruction cachelines are efficiently prefetched and branches
are predicted on all modern CPUs.
Possible data access for lookup table can not be prefetched
as efficiently.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 06/11] net: Replace calls to vrf_dev_get_rth

2015-09-28 Thread David Ahern

Replace calls to vrf_dev_get_rth with l3mdev_get_rtable.
The check on the flow flags is handled in the l3mdev operation.

Signed-off-by: David Ahern 
---
 include/net/vrf.h | 22 --
 net/ipv4/route.c  |  8 +++-
 2 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/include/net/vrf.h b/include/net/vrf.h
index b05b96646e2a..5bba1535ba73 100644
--- a/include/net/vrf.h
+++ b/include/net/vrf.h
@@ -32,26 +32,4 @@ struct net_vrf {
u32 tb_id;
 };
 
-
-#if IS_ENABLED(CONFIG_NET_VRF)
-/* caller has already checked netif_is_l3_master(dev) */
-static inline struct rtable *vrf_dev_get_rth(const struct net_device *dev)
-{
-   struct rtable *rth = ERR_PTR(-ENETUNREACH);
-   struct net_vrf *vrf = netdev_priv(dev);
-
-   if (vrf) {
-   rth = vrf->rth;
-   atomic_inc(>dst.__refcnt);
-   }
-   return rth;
-}
-
-#else
-static inline struct rtable *vrf_dev_get_rth(const struct net_device *dev)
-{
-   return ERR_PTR(-ENETUNREACH);
-}
-#endif
-
 #endif /* __LINUX_NET_VRF_H */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index cf790aeb7f74..240f5ea02618 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -112,7 +112,6 @@
 #endif
 #include 
 #include 
-#include 
 #include 
 
 #define RT_FL_TOS(oldflp4) \
@@ -2128,11 +2127,10 @@ struct rtable *__ip_route_output_key(struct net *net, 
struct flowi4 *fl4)
fl4->saddr = inet_select_addr(dev_out, 0,
  RT_SCOPE_HOST);
}
-   if (netif_is_l3_master(dev_out) &&
-   !(fl4->flowi4_flags & FLOWI_FLAG_VRFSRC)) {
-   rth = vrf_dev_get_rth(dev_out);
+
+   rth = l3mdev_get_rtable(dev_out, fl4);
+   if (rth)
goto out;
-   }
}
 
if (!fl4->daddr) {
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v2 00/11] net: L3 master device

2015-09-28 Thread David Ahern

The VRF device is essentially a Layer 3 master device used to associate
netdevices with a specific routing table and to influence FIB lookups
via 'ip rules' and controlling the oif/iif used for the lookup.

This series generalizes the VRF into L3 master device, l3mdev. Similar
to switchdev it has a Kconfig option and separate set of operations
in net_device allowing it to be completely compiled out if not wanted.
The l3mdev methods rely on the 'master' aspect and use of
netdev_master_upper_dev_get_rcu to retrieve the master device from a
given netdevice if it is enslaved to an L3_MASTER.

The VRF device is converted to use the l3mdev operations. At the end the
vrf_ptr is no longer and removed, as are all direct references to VRF.
The end result is a much simpler implementation for VRF.

Thanks to Nikolay for suggestions (eg., use of the master linkage which
is the key to making this work) and to Roopa, Andy and Shrijeet for
early reviews.

v2
- rebased to top of net-next

- addressed Niks comments (checking master, removing extra lines, and
  flipping the order of patches 1 and 2)


Changes since RFC:
- Changed IFF_L3MDEV to IFF_L3MDEV_MASTER after Nikolay pointed out a problem
  with my flag changes (uniquely identifying a L3MDEV master device versus an
  enslaved device like a bond that will also be a master device)
- Rolled in icmp fix for panic when flipping from vrf functions to l3mdev
- Moved netif_is_l3_master check into l3mdev_get_rtable

*** BLURB HERE ***

David Ahern (11):
  net: Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER
  net: Introduce L3 Master device abstraction
  net: Add support for l3mdev ops to VRF driver
  net: Replace vrf_master_ifindex{,_rcu} with l3mdev equivalents
  net: Replace vrf_dev_table and friends
  net: Replace calls to vrf_dev_get_rth
  net: Remove the now unused vrf_ptr
  net: Remove vrf header file
  net: Move netif_index_is_l3_master to l3mdev.h
  net: Rename FLOWI_FLAG_VRFSRC to FLOWI_FLAG_L3MDEV_SRC
  net: Add netif_is_l3_slave

 MAINTAINERS   |   8 ++-
 drivers/net/Kconfig   |   1 +
 drivers/net/vrf.c |  89 +--
 include/linux/netdevice.h |  43 ---
 include/net/flow.h|   2 +-
 include/net/l3mdev.h  | 151 +++
 include/net/route.h   |   5 +-
 include/net/vrf.h | 178 --
 net/Kconfig   |   1 +
 net/Makefile  |   3 +
 net/ipv4/af_inet.c|   4 +-
 net/ipv4/fib_frontend.c   |  12 ++--
 net/ipv4/icmp.c   |   8 +--
 net/ipv4/ip_fragment.c|   6 +-
 net/ipv4/ip_output.c  |   2 +-
 net/ipv4/route.c  |  15 ++--
 net/ipv4/udp.c|   4 +-
 net/ipv4/xfrm4_policy.c   |   8 +--
 net/ipv6/xfrm6_policy.c   |   8 +--
 net/l3mdev/Kconfig|  10 +++
 net/l3mdev/Makefile   |   5 ++
 net/l3mdev/l3mdev.c   |  78 
 22 files changed, 357 insertions(+), 284 deletions(-)
 create mode 100644 include/net/l3mdev.h
 delete mode 100644 include/net/vrf.h
 create mode 100644 net/l3mdev/Kconfig
 create mode 100644 net/l3mdev/Makefile
 create mode 100644 net/l3mdev/l3mdev.c

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 04/11] net: Replace vrf_master_ifindex{,_rcu} with l3mdev equivalents

2015-09-28 Thread David Ahern

Replace calls to vrf_master_ifindex_rcu and vrf_master_ifindex with either
l3mdev_master_ifindex_rcu or l3mdev_master_ifindex.

The pattern:
oif = vrf_master_ifindex(dev) ? : dev->ifindex;
is replaced with
oif = l3mdev_fib_oif(dev);

And remove the now unused vrf macros.

Signed-off-by: David Ahern 
---
 include/net/vrf.h   | 41 -
 net/ipv4/fib_frontend.c |  5 +++--
 net/ipv4/icmp.c |  8 
 net/ipv4/ip_fragment.c  |  6 +++---
 net/ipv4/route.c|  7 ---
 net/ipv4/xfrm4_policy.c |  8 +++-
 net/ipv6/xfrm6_policy.c |  8 +++-
 7 files changed, 20 insertions(+), 63 deletions(-)

diff --git a/include/net/vrf.h b/include/net/vrf.h
index 34bb3f69def2..874a6c9e4217 100644
--- a/include/net/vrf.h
+++ b/include/net/vrf.h
@@ -34,37 +34,6 @@ struct net_vrf {
 
 
 #if IS_ENABLED(CONFIG_NET_VRF)
-/* called with rcu_read_lock() */
-static inline int vrf_master_ifindex_rcu(const struct net_device *dev)
-{
-   struct net_vrf_dev *vrf_ptr;
-   int ifindex = 0;
-
-   if (!dev)
-   return 0;
-
-   if (netif_is_l3_master(dev)) {
-   ifindex = dev->ifindex;
-   } else {
-   vrf_ptr = rcu_dereference(dev->vrf_ptr);
-   if (vrf_ptr)
-   ifindex = vrf_ptr->ifindex;
-   }
-
-   return ifindex;
-}
-
-static inline int vrf_master_ifindex(const struct net_device *dev)
-{
-   int ifindex;
-
-   rcu_read_lock();
-   ifindex = vrf_master_ifindex_rcu(dev);
-   rcu_read_unlock();
-
-   return ifindex;
-}
-
 /* called with rcu_read_lock */
 static inline u32 vrf_dev_table_rcu(const struct net_device *dev)
 {
@@ -139,16 +108,6 @@ static inline struct rtable *vrf_dev_get_rth(const struct 
net_device *dev)
 }
 
 #else
-static inline int vrf_master_ifindex_rcu(const struct net_device *dev)
-{
-   return 0;
-}
-
-static inline int vrf_master_ifindex(const struct net_device *dev)
-{
-   return 0;
-}
-
 static inline u32 vrf_dev_table_rcu(const struct net_device *dev)
 {
return 0;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6fcbd215cdbc..b901b344f22d 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
@@ -332,7 +333,7 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
bool dev_match;
 
fl4.flowi4_oif = 0;
-   fl4.flowi4_iif = vrf_master_ifindex_rcu(dev);
+   fl4.flowi4_iif = l3mdev_master_ifindex_rcu(dev);
if (!fl4.flowi4_iif)
fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
fl4.daddr = src;
@@ -366,7 +367,7 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
if (nh->nh_dev == dev) {
dev_match = true;
break;
-   } else if (vrf_master_ifindex_rcu(nh->nh_dev) == dev->ifindex) {
+   } else if (l3mdev_master_ifindex_rcu(nh->nh_dev) == 
dev->ifindex) {
dev_match = true;
break;
}
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e5eb8ac4089d..6b96dee2800b 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -96,7 +96,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 /*
  * Build xmit assembly blocks
@@ -309,7 +309,7 @@ static bool icmpv4_xrlim_allow(struct net *net, struct 
rtable *rt,
 
rc = false;
if (icmp_global_allow()) {
-   int vif = vrf_master_ifindex(dst->dev);
+   int vif = l3mdev_master_ifindex(dst->dev);
struct inet_peer *peer;
 
peer = inet_getpeer_v4(net->ipv4.peers, fl4->daddr, vif, 1);
@@ -427,7 +427,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct 
sk_buff *skb)
fl4.flowi4_mark = mark;
fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
fl4.flowi4_proto = IPPROTO_ICMP;
-   fl4.flowi4_oif = vrf_master_ifindex(skb->dev);
+   fl4.flowi4_oif = l3mdev_master_ifindex(skb->dev);
security_skb_classify_flow(skb, flowi4_to_flowi());
rt = ip_route_output_key(net, );
if (IS_ERR(rt))
@@ -461,7 +461,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
fl4->flowi4_proto = IPPROTO_ICMP;
fl4->fl4_icmp_type = type;
fl4->fl4_icmp_code = code;
-   fl4->flowi4_oif = vrf_master_ifindex(skb_in->dev);
+   fl4->flowi4_oif = l3mdev_master_ifindex(skb_in->dev);
 
security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
rt = __ip_route_output_key(net, fl4);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index fa7f15305f9a..9772b789adf3 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -48,7 +48,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 /* NOTE. Logic of IP

[net-next PATCH 1/3] net/ipv4: Pass proto as u8 instead of u16 in ip_check_mc_rcu

2015-09-28 Thread Alexander Duyck

This patch updates ip_check_mc_rcu so that protocol is passed as a u8
instead of a u16.

The motivation is just to avoid any unneeded type transitions since some
systems will require an instruction to zero extend a u8 field to a u16.
Also it makes it a bit more readable as to the fact that protocol is a u8
so there are no byte ordering changes needed to pass it.

Signed-off-by: Alexander Duyck 
---
 include/linux/igmp.h |2 +-
 net/ipv4/igmp.c  |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 908429216d9f..9c9de11549a7 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -110,7 +110,7 @@ struct ip_mc_list {
 #define IGMPV3_QQIC(value) IGMPV3_EXP(0x80, 4, 3, value)
 #define IGMPV3_MRC(value) IGMPV3_EXP(0x80, 4, 3, value)
 
-extern int ip_check_mc_rcu(struct in_device *dev, __be32 mc_addr, __be32 
src_addr, u16 proto);
+extern int ip_check_mc_rcu(struct in_device *dev, __be32 mc_addr, __be32 
src_addr, u8 proto);
 extern int igmp_rcv(struct sk_buff *);
 extern int ip_mc_join_group(struct sock *sk, struct ip_mreqn *imr);
 extern int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr);
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index d38b8b61eaee..de6d4c8ba600 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -2569,7 +2569,7 @@ void ip_mc_drop_socket(struct sock *sk)
 }
 
 /* called with rcu_read_lock() */
-int ip_check_mc_rcu(struct in_device *in_dev, __be32 mc_addr, __be32 src_addr, 
u16 proto)
+int ip_check_mc_rcu(struct in_device *in_dev, __be32 mc_addr, __be32 src_addr, 
u8 proto)
 {
struct ip_mc_list *im;
struct ip_mc_list __rcu **mc_hash;

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next PATCH 2/3] net: Swap ordering of tests in ip_route_input_mc

2015-09-28 Thread Alexander Duyck

This patch just swaps the ordering of one of the conditional tests in
ip_route_input_mc.  Specifically it swaps the testing for the source
address to see if it is loopback, and the test to see if we allow a
loopback source address.

The reason for swapping these two tests is because it is much faster to
test if an address is loopback than it is to dereference several pointers
to get at the net structure to see if the use of loopback is allowed.

Signed-off-by: Alexander Duyck 
---
 net/ipv4/route.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6bab84503cd9..43508c8d08e2 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1487,9 +1487,8 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
skb->protocol != htons(ETH_P_IP))
goto e_inval;
 
-   if (likely(!IN_DEV_ROUTE_LOCALNET(in_dev)))
-   if (ipv4_is_loopback(saddr))
-   goto e_inval;
+   if (ipv4_is_loopback(saddr) && !IN_DEV_ROUTE_LOCALNET(in_dev))
+   goto e_inval;
 
if (ipv4_is_zeronet(saddr)) {
if (!ipv4_is_local_multicast(daddr))

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next PATCH 3/3] net: Remove martian_source_keep_err goto label

2015-09-28 Thread Alexander Duyck

From: David Ahern 

err is initialized to -EINVAL when it is declared. It is not reset until
fib_lookup which is well after the 3 users of the martian_source jump. So
resetting err to -EINVAL at martian_source label is not needed.

Removing that line obviates the need for the martian_source_keep_err label
so delete it.

Signed-off-by: David Ahern 
Signed-off-by: Alexander Duyck 
---
 net/ipv4/route.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 43508c8d08e2..8c84a6664b30 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1759,7 +1759,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
err = fib_validate_source(skb, saddr, daddr, tos,
  0, dev, in_dev, );
if (err < 0)
-   goto martian_source_keep_err;
+   goto martian_source;
goto local_input;
}
 
@@ -1781,7 +1781,7 @@ brd_input:
err = fib_validate_source(skb, saddr, 0, tos, 0, dev,
  in_dev, );
if (err < 0)
-   goto martian_source_keep_err;
+   goto martian_source;
}
flags |= RTCF_BROADCAST;
res.type = RTN_BROADCAST;
@@ -1857,8 +1857,6 @@ e_nobufs:
goto out;
 
 martian_source:
-   err = -EINVAL;
-martian_source_keep_err:
ip_handle_martian_source(dev, in_dev, skb, daddr, saddr);
goto out;
 }

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next PATCH 0/3] Minor IPv4 routing cleanups

2015-09-28 Thread Alexander Duyck

These patches just contain some minor cleanups to address a few minor
issues.  The first and the third mostly just improve readability.  The
second patch should improve the performance for multicast destination
addresses that do not have a localhost source IP address by avoiding some
unnecessary dereferences.

---

Alexander Duyck (2):
  net/ipv4: Pass proto as u8 instead of u16 in ip_check_mc_rcu
  net: Swap ordering of tests in ip_route_input_mc

David Ahern (1):
  net: Remove martian_source_keep_err goto label


 include/linux/igmp.h |2 +-
 net/ipv4/igmp.c  |2 +-
 net/ipv4/route.c |   11 ---
 3 files changed, 6 insertions(+), 9 deletions(-)

--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Fix false positives in can_checksum_protocol()

2015-09-28 Thread David Woodhouse

On Mon, 2015-09-28 at 10:03 -0700, Tom Herbert wrote:

> > +   if features & NETIF_F_V4_CSUM) && protocol == htons(ETH_P_IP)) 
> > ||
> > +((features & NETIF_F_V6_CSUM) && protocol == 
> > htons(ETH_P_IPV6))) &&
> > +   (sk_protocol == IPPROTO_TCP || sk_protocol == IPPROTO_UDP))
> > +   return 1;
> > +
> Relying on skb->sk->sk_protocol is problematic. This is making the
> assumption that the checksum being offloaded for the packet is the
> same as that of the protocol for the socket-- this may not be the 
> case when we are offloading an outer checksum in encapsulation.

> Currently this wouldn't a be problem since we're probably only 
> offloading outer UDP checksums, but if we ever start trying to 
> offload other outer checksums (e.g. GRE) then this probably doesn't
> work so well.

That makes sense.

>  Also, this doesn't help those drivers that that can offload TCP and 
> UDP for IPv6 but only if there are no extension headers, in those 
> case the driver needs to look at the packet to see if it is a 
> "simple" UDP/TCP packet.

Hm, are such devices even permitted to set NETIF_F_IPV6_CSUM?

> AFAIK, the only non UDP/TCP transport IP checksum in the stack is GRE
> checksum which as I pointed out we don't attempt to offload. So the
> only way to trip the bug that you are seeing is probably through a
> userspace packet interface like in the test code. I think this
> actually might expose a much more serious issue. Looking at tun.c, I
> don't see anything that validates that the csum_start and csum_offset
> provided by userspace actually refers to a sane checksum offset. 

That's handled in skb_partial_csum_set().

> Not only is this a way to ask the stack to perform checksums for non
> TCP/UDP, but it actually seems like the interface could be used by a
> malicious application to have a device arbitrarily overwrite two 
> bytes anywhere in the packet with it's own data far below the stack,
> netfilter, routing. To really fix this we should probably be doing
> validation in tun, if the checksum isn't for TCP or UDP then call
> skb_checksum_help before sending the packet into the stack.

So... if it's never valid to ask for a hardware checksum on anything
but TCP or UDP, why do we bother with NETIF_F_GEN_CSUM at all? Should
we just be removing it entirely? That seems like something of a
retrograde step.

Perhaps a better solution would be a bit in the skbuff which indicates
that it *is* a TCP or UDP checksum. That would be set by our UDP and
TCP sockets, cleared by encapsulation, also set if appropriate by
skb_partial_csum_set().

And then the check in can_checksum_protocol() is trivial and clearly
correct.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation



smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH] ipvs: Don't protect ip_vs_addr_is_unicast with CONFIG_SYSCTL

2015-09-28 Thread Julian Anastasov


Hello,

On Mon, 28 Sep 2015, Eric W. Biederman wrote:

> I arranged the code so that the compiler can remove the unecessary bits
> in ip_vs_leave when CONFIG_SYSCTL is unset, and removed an explicit
> CONFIG_SYSCTL.
> 
> Unfortunately when rebasing my work on top of that of Alex Gartrell I
> missed the fact that the newly added function ip_vs_addr_is_unicast was
> surrounded by CONFIG_SYSCTL.
> 
> So remove the now unnecessary CONFIG_SYSCTL guards around
> ip_vs_addr_is_unicast.  It is causing build failures today when
> CONFIG_SYSCTL is not selected and any self respecting compiler will
> notice that sysctl_cache_bypass is always false without CONFIG_SYSCTL
> and not include the logic from the function ip_vs_addr_is_unicast in
> the compiled code.
> 
> Signed-off-by: "Eric W. Biederman" 

Acked-by: Julian Anastasov 

Simon, please apply to ipvs-next

> ---
> 
> This is a build fix for ipvs-next and nf-next.
> 
>  net/netfilter/ipvs/ip_vs_core.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 07a791ecdfba..fba73db81d2f 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -547,7 +547,6 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff 
> *skb,
>   return cp;
>  }
>  
> -#ifdef CONFIG_SYSCTL
>  static inline int ip_vs_addr_is_unicast(struct net *net, int af,
>   union nf_inet_addr *addr)
>  {
> @@ -557,7 +556,6 @@ static inline int ip_vs_addr_is_unicast(struct net *net, 
> int af,
>  #endif
>   return (inet_addr_type(net, addr->ip) == RTN_UNICAST);
>  }
> -#endif
>  
>  /*
>   *  Pass or drop the packet.
> -- 
> 2.2.1

Regards

--
Julian Anastasov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net] i40e: fix VLAN inside VXLAN

2015-09-28 Thread Jeff Kirsher

From: Jesse Brandeburg 

Previously to this patch, the hardware was removing
VLAN tags from the inner header of VXLAN packets.  The
hardware configuration can be changed to leave the
packet alone since that is what the linux stack
expects for this type of VLAN in VXLAN packet.

Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 48a52b3..5bb1e67 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -2590,7 +2590,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
rx_ctx.lrxqthresh = 2;
rx_ctx.crcstrip = 1;
rx_ctx.l2tsel = 1;
-   rx_ctx.showiv = 1;
+   /* this controls whether VLAN is stripped from inner headers */
+   rx_ctx.showiv = 0;
 #ifdef I40E_FCOE
rx_ctx.fc_ena = (vsi->type == I40E_VSI_FCOE);
 #endif
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5] ntp/pps: replace getnstime_raw_and_real with 64-bit version

2015-09-28 Thread Arnd Bergmann

There is exactly one caller of getnstime_raw_and_real in the kernel,
which is the pps_get_ts function. This changes the caller and
the implementation to work on timespec64 types rather than timespec,
to avoid the time_t overflow on 32-bit architectures.

For consistency with the other new functions (ktime_get_seconds,
ktime_get_real_*, ...), I'm renaming the function to
ktime_get_raw_and_real_ts64.

We still need to convert from the internal 64-bit type to 32 bit
types in the caller, but this conversion is now pushed out from
getnstime_raw_and_real to pps_get_ts. A follow-up patch changes
the remaining pps code to completely avoid the conversion.

Signed-off-by: Arnd Bergmann 
---
 include/linux/pps_kernel.h  |  7 ++-
 include/linux/timekeeping.h |  4 ++--
 kernel/time/timekeeping.c   | 12 ++--
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/linux/pps_kernel.h b/include/linux/pps_kernel.h
index 1d2cd21242e8..b2fbd62ab18d 100644
--- a/include/linux/pps_kernel.h
+++ b/include/linux/pps_kernel.h
@@ -115,7 +115,12 @@ static inline void timespec_to_pps_ktime(struct pps_ktime 
*kt,
 
 static inline void pps_get_ts(struct pps_event_time *ts)
 {
-   getnstime_raw_and_real(>ts_raw, >ts_real);
+   struct timespec64 raw, real;
+
+   ktime_get_raw_and_real_ts64(, );
+
+   ts->ts_raw = timespec64_to_timespec(raw);
+   ts->ts_real = timespec64_to_timespec(real);
 }
 
 #else /* CONFIG_NTP_PPS */
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 474331cd1ef8..ca2eaa9077eb 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -269,8 +269,8 @@ extern void timekeeping_inject_sleeptime64(struct 
timespec64 *delta);
 /*
  * PPS accessor
  */
-extern void getnstime_raw_and_real(struct timespec *ts_raw,
-  struct timespec *ts_real);
+extern void ktime_get_raw_and_real_ts64(struct timespec64 *ts_raw,
+   struct timespec64 *ts_real);
 
 /*
  * Persistent clock related interfaces
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3112977dfca0..ed5049ff94c5 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -849,7 +849,7 @@ EXPORT_SYMBOL_GPL(ktime_get_real_seconds);
 #ifdef CONFIG_NTP_PPS
 
 /**
- * getnstime_raw_and_real - get day and raw monotonic time in timespec format
+ * ktime_get_raw_and_real_ts64 - get day and raw monotonic time in timespec 
format
  * @ts_raw:pointer to the timespec to be set to raw monotonic time
  * @ts_real:   pointer to the timespec to be set to the time of day
  *
@@ -857,7 +857,7 @@ EXPORT_SYMBOL_GPL(ktime_get_real_seconds);
  * same time atomically and stores the resulting timestamps in timespec
  * format.
  */
-void getnstime_raw_and_real(struct timespec *ts_raw, struct timespec *ts_real)
+void ktime_get_raw_and_real_ts64(struct timespec64 *ts_raw, struct timespec64 
*ts_real)
 {
struct timekeeper *tk = _core.timekeeper;
unsigned long seq;
@@ -868,7 +868,7 @@ void getnstime_raw_and_real(struct timespec *ts_raw, struct 
timespec *ts_real)
do {
seq = read_seqcount_begin(_core.seq);
 
-   *ts_raw = timespec64_to_timespec(tk->raw_time);
+   *ts_raw = tk->raw_time;
ts_real->tv_sec = tk->xtime_sec;
ts_real->tv_nsec = 0;
 
@@ -877,10 +877,10 @@ void getnstime_raw_and_real(struct timespec *ts_raw, 
struct timespec *ts_real)
 
} while (read_seqcount_retry(_core.seq, seq));
 
-   timespec_add_ns(ts_raw, nsecs_raw);
-   timespec_add_ns(ts_real, nsecs_real);
+   timespec64_add_ns(ts_raw, nsecs_raw);
+   timespec64_add_ns(ts_real, nsecs_real);
 }
-EXPORT_SYMBOL(getnstime_raw_and_real);
+EXPORT_SYMBOL(ktime_get_raw_and_real_ts64);
 
 #endif /* CONFIG_NTP_PPS */
 
-- 
2.1.0.rc2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/5] net: sfc: avoid using timespec

2015-09-28 Thread Arnd Bergmann

The sfc driver internally uses a time format based on 32-bit (unsigned)
seconds and 32-bit nanoseconds. This means it will overflow in 2106,
but the value we pass into it is a signed 32-bit tv_sec that already
overflows in 2038 to a negative value.

This patch changes the logic to use the lower 32 bits of the timespec64
tv_sec in efx_ptp_ns_to_s_ns, which will have the correct value beyond the 
overflow.
While this does not change any of the register values, it lets us
keep using the driver after we deprecate the use of the timespec type
in the kernel.

In the efx_ptp_process_times function, the change to use timespec64
is similar, in that the tv_sec portion is ignored anyway and we only
care about the nanosecond portion that remains unchanged.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/sfc/ptp.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c
index fe849dbf9f80..c771e0af4e06 100644
--- a/drivers/net/ethernet/sfc/ptp.c
+++ b/drivers/net/ethernet/sfc/ptp.c
@@ -401,8 +401,8 @@ size_t efx_ptp_update_stats(struct efx_nic *efx, u64 *stats)
 /* For Siena platforms NIC time is s and ns */
 static void efx_ptp_ns_to_s_ns(s64 ns, u32 *nic_major, u32 *nic_minor)
 {
-   struct timespec ts = ns_to_timespec(ns);
-   *nic_major = ts.tv_sec;
+   struct timespec64 ts = ns_to_timespec64(ns);
+   *nic_major = (u32)ts.tv_sec;
*nic_minor = ts.tv_nsec;
 }
 
@@ -431,8 +431,8 @@ static ktime_t efx_ptp_s_ns_to_ktime_correction(u32 
nic_major, u32 nic_minor,
  */
 static void efx_ptp_ns_to_s27(s64 ns, u32 *nic_major, u32 *nic_minor)
 {
-   struct timespec ts = ns_to_timespec(ns);
-   u32 maj = ts.tv_sec;
+   struct timespec64 ts = ns_to_timespec64(ns);
+   u32 maj = (u32)ts.tv_sec;
u32 min = (u32)(((u64)ts.tv_nsec * NS_TO_S27_MULT +
 (1ULL << (NS_TO_S27_SHIFT - 1))) >> NS_TO_S27_SHIFT);
 
@@ -737,14 +737,14 @@ efx_ptp_process_times(struct efx_nic *efx, 
MCDI_DECLARE_STRUCT_PTR(synch_buf),
 */
for (i = 0; i < number_readings; i++) {
s32 window, corrected;
-   struct timespec wait;
+   struct timespec64 wait;
 
efx_ptp_read_timeset(
MCDI_ARRAY_STRUCT_PTR(synch_buf,
  PTP_OUT_SYNCHRONIZE_TIMESET, i),
>timeset[i]);
 
-   wait = ktime_to_timespec(
+   wait = ktime_to_timespec64(
ptp->nic_to_kernel_time(0, ptp->timeset[i].wait, 0));
window = ptp->timeset[i].window;
corrected = window - wait.tv_nsec;
@@ -803,7 +803,7 @@ efx_ptp_process_times(struct efx_nic *efx, 
MCDI_DECLARE_STRUCT_PTR(synch_buf),
  ptp->timeset[last_good].minor, 0);
 
/* Calculate delay from NIC top of second to last_time */
-   delta.tv_nsec += ktime_to_timespec(mc_time).tv_nsec;
+   delta.tv_nsec += ktime_to_timespec64(mc_time).tv_nsec;
 
/* Set PPS timestamp to match NIC top of second */
ptp->host_time_pps = *last_time;
-- 
2.1.0.rc2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/6] make non-modular code explicitly non-modular

2015-09-28 Thread Geert Uytterhoeven

Hi Paul,

On Mon, Sep 28, 2015 at 9:51 PM, Paul Gortmaker
 wrote:
> In a previous merge window, we made changes to allow better
> delineation between modular and non-modular code in commit
> 0fd972a7d91d6e15393c449492a04d94c0b89351 ("module: relocate module_init
> from init.h to module.h").  This allows us to now ensure module code
> looks modular and non-modular code does not accidentally look modular
> just to avoid suffering build breakage.
>
> Here we target code that is, by nature of their Makefile and/or
> Kconfig settings, only available to be built-in, but implicitly
> presenting itself as being possibly modular by way of using modular
> headers, macros, and functions.
>
> The goal here is to remove that illusion of modularity from these
> files, but in a way that leaves the actual runtime unchanged.
> In doing so, we remove code that has never been tested and adds
> no value to the tree.  And we continue the process of expecting a
> level of consistency between the Kconfig/Makefile of code and the
> code in use itself.
>
> Fortuntately the net subsystem has relatively few instances, given
> the overall amount of code and drivers it contains.  For comparison
> there are over 300 instances tree wide, resulting in a possible net
> removal of on the order of 5000 lines of unused code.
>
> Build tested on net-next 34c2d9fb0498 on m68k, since that is the arch
> where the three ethernet drivers changed here are available.

>   net/ethernet: make amd/hplance.c driver explicitly non-modular
>   net/ethernet: make 8390/mac8390.c driver explicitly non-modular
>   net/ethernet: make apple/macmace.c driver explicitly non-modular

Why did you choose this approach?
What about changing the "bool"s to "tristate"s in Kconfig instead?

I gave it a try, and with some small changes the three m68k ethernet drivers
build fine as modular drivers. I can send patches if you like it.

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5] ntp/pps: use y2038 safe types in pps_event_time

2015-09-28 Thread Arnd Bergmann

The pps_event_time uses two 'timespec' structures internally, which
suffer from the y2038 problem. The uses of this structure are
fairly self-contained in the pps code, so this replaces them all at
once.

Unfortunately, this includes the sfc ethernet driver aside from the
pps subsystem, so we change that one as well. Both touch the
same data structure, and there probably is no good way to split
the patch into smaller units.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/sfc/ptp.c | 16 
 drivers/pps/kapi.c |  4 ++--
 drivers/pps/kc.c   |  4 +---
 include/linux/pps_kernel.h | 21 -
 4 files changed, 19 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c
index ad62615a93dc..fe849dbf9f80 100644
--- a/drivers/net/ethernet/sfc/ptp.c
+++ b/drivers/net/ethernet/sfc/ptp.c
@@ -646,28 +646,28 @@ static void efx_ptp_send_times(struct efx_nic *efx,
   struct pps_event_time *last_time)
 {
struct pps_event_time now;
-   struct timespec limit;
+   struct timespec64 limit;
struct efx_ptp_data *ptp = efx->ptp_data;
-   struct timespec start;
+   struct timespec64 start;
int *mc_running = ptp->start.addr;
 
pps_get_ts();
start = now.ts_real;
limit = now.ts_real;
-   timespec_add_ns(, SYNCHRONISE_PERIOD_NS);
+   timespec64_add_ns(, SYNCHRONISE_PERIOD_NS);
 
/* Write host time for specified period or until MC is done */
-   while ((timespec_compare(_real, ) < 0) &&
+   while ((timespec64_compare(_real, ) < 0) &&
   ACCESS_ONCE(*mc_running)) {
-   struct timespec update_time;
+   struct timespec64 update_time;
unsigned int host_time;
 
/* Don't update continuously to avoid saturating the PCIe bus */
update_time = now.ts_real;
-   timespec_add_ns(_time, SYNCHRONISATION_GRANULARITY_NS);
+   timespec64_add_ns(_time, SYNCHRONISATION_GRANULARITY_NS);
do {
pps_get_ts();
-   } while ((timespec_compare(_real, _time) < 0) &&
+   } while ((timespec64_compare(_real, _time) < 0) &&
 ACCESS_ONCE(*mc_running));
 
/* Synchronise NIC with single word of time only */
@@ -723,7 +723,7 @@ efx_ptp_process_times(struct efx_nic *efx, 
MCDI_DECLARE_STRUCT_PTR(synch_buf),
struct efx_ptp_data *ptp = efx->ptp_data;
u32 last_sec;
u32 start_sec;
-   struct timespec delta;
+   struct timespec64 delta;
ktime_t mc_time;
 
if (number_readings == 0)
diff --git a/drivers/pps/kapi.c b/drivers/pps/kapi.c
index cdad4d95b20e..805c749ac1ad 100644
--- a/drivers/pps/kapi.c
+++ b/drivers/pps/kapi.c
@@ -179,8 +179,8 @@ void pps_event(struct pps_device *pps, struct 
pps_event_time *ts, int event,
/* check event type */
BUG_ON((event & (PPS_CAPTUREASSERT | PPS_CAPTURECLEAR)) == 0);
 
-   dev_dbg(pps->dev, "PPS event at %ld.%09ld\n",
-   ts->ts_real.tv_sec, ts->ts_real.tv_nsec);
+   dev_dbg(pps->dev, "PPS event at %lld.%09ld\n",
+   (s64)ts->ts_real.tv_sec, ts->ts_real.tv_nsec);
 
timespec_to_pps_ktime(_real, ts->ts_real);
 
diff --git a/drivers/pps/kc.c b/drivers/pps/kc.c
index a16cea2ba980..e219db1f1c84 100644
--- a/drivers/pps/kc.c
+++ b/drivers/pps/kc.c
@@ -113,12 +113,10 @@ void pps_kc_event(struct pps_device *pps, struct 
pps_event_time *ts,
int event)
 {
unsigned long flags;
-   struct timespec64 real = timespec_to_timespec64(ts->ts_real);
-   struct timespec64 raw = timespec_to_timespec64(ts->ts_raw);
 
/* Pass some events to kernel consumer if activated */
spin_lock_irqsave(_kc_hardpps_lock, flags);
if (pps == pps_kc_hardpps_dev && event & pps_kc_hardpps_mode)
-   hardpps(, );
+   hardpps(>ts_real, >ts_raw);
spin_unlock_irqrestore(_kc_hardpps_lock, flags);
 }
diff --git a/include/linux/pps_kernel.h b/include/linux/pps_kernel.h
index b2fbd62ab18d..54bf1484d41f 100644
--- a/include/linux/pps_kernel.h
+++ b/include/linux/pps_kernel.h
@@ -48,9 +48,9 @@ struct pps_source_info {
 
 struct pps_event_time {
 #ifdef CONFIG_NTP_PPS
-   struct timespec ts_raw;
+   struct timespec64 ts_raw;
 #endif /* CONFIG_NTP_PPS */
-   struct timespec ts_real;
+   struct timespec64 ts_real;
 };
 
 /* The main struct */
@@ -105,7 +105,7 @@ extern void pps_event(struct pps_device *pps,
 struct pps_device *pps_lookup_dev(void const *cookie);
 
 static inline void timespec_to_pps_ktime(struct pps_ktime *kt,
-   struct timespec ts)
+   struct timespec64 ts)
 {
kt->sec = ts.tv_sec;
kt->nsec = ts.tv_nsec;
@@ -115,29 +115,24 @@ static inline void

[PATCH 1/5] ntp/pps: use timespec64 for hardpps()

2015-09-28 Thread Arnd Bergmann

There is only one user of the hardpps function in the kernel, so
it makes sense to atomically change it over to using 64-bit
timestamps for y2038 safety. In the hardpps implementation,
we also need to change the pps_normtime structure, which is
similar to struct timespec and also requires a 64-bit
seconds portion.

This introduces two temporary variables in pps_kc_event() to
do the conversion, they will be removed again in the next step,
which seemed preferable to having a larger patch changing it
all at the same time.

Signed-off-by: Arnd Bergmann 
---
 drivers/pps/kc.c   |  4 +++-
 include/linux/timex.h  |  2 +-
 kernel/time/ntp.c  | 12 ++--
 kernel/time/ntp_internal.h |  2 +-
 kernel/time/timekeeping.c  |  2 +-
 5 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/pps/kc.c b/drivers/pps/kc.c
index e219db1f1c84..a16cea2ba980 100644
--- a/drivers/pps/kc.c
+++ b/drivers/pps/kc.c
@@ -113,10 +113,12 @@ void pps_kc_event(struct pps_device *pps, struct 
pps_event_time *ts,
int event)
 {
unsigned long flags;
+   struct timespec64 real = timespec_to_timespec64(ts->ts_real);
+   struct timespec64 raw = timespec_to_timespec64(ts->ts_raw);
 
/* Pass some events to kernel consumer if activated */
spin_lock_irqsave(_kc_hardpps_lock, flags);
if (pps == pps_kc_hardpps_dev && event & pps_kc_hardpps_mode)
-   hardpps(>ts_real, >ts_raw);
+   hardpps(, );
spin_unlock_irqrestore(_kc_hardpps_lock, flags);
 }
diff --git a/include/linux/timex.h b/include/linux/timex.h
index 9d3f1a5b6178..39c25dbebfe8 100644
--- a/include/linux/timex.h
+++ b/include/linux/timex.h
@@ -152,7 +152,7 @@ extern unsigned long tick_nsec; /* SHIFTED_HZ 
period (nsec) */
 #define NTP_INTERVAL_LENGTH (NSEC_PER_SEC/NTP_INTERVAL_FREQ)
 
 extern int do_adjtimex(struct timex *);
-extern void hardpps(const struct timespec *, const struct timespec *);
+extern void hardpps(const struct timespec64 *, const struct timespec64 *);
 
 int read_current_timer(unsigned long *timer_val);
 void ntp_notify_cmos_timer(void);
diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index df68cb875248..bd4fa6271262 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -99,7 +99,7 @@ static time64_t   ntp_next_leap_sec = 
TIME64_MAX;
 static int pps_valid;  /* signal watchdog counter */
 static long pps_tf[3]; /* phase median filter */
 static long pps_jitter;/* current jitter (ns) */
-static struct timespec pps_fbase; /* beginning of the last freq interval */
+static struct timespec64 pps_fbase; /* beginning of the last freq interval */
 static int pps_shift;  /* current interval duration (s) (shift) */
 static int pps_intcnt; /* interval counter */
 static s64 pps_freq;   /* frequency offset (scaled ns/s) */
@@ -773,13 +773,13 @@ int __do_adjtimex(struct timex *txc, struct timespec64 
*ts, s32 *time_tai)
  * pps_normtime.nsec has a range of ( -NSEC_PER_SEC / 2, NSEC_PER_SEC / 2 ]
  * while timespec.tv_nsec has a range of [0, NSEC_PER_SEC) */
 struct pps_normtime {
-   __kernel_time_t sec;/* seconds */
+   s64 sec;/* seconds */
longnsec;   /* nanoseconds */
 };
 
 /* normalize the timestamp so that nsec is in the
( -NSEC_PER_SEC / 2, NSEC_PER_SEC / 2 ] interval */
-static inline struct pps_normtime pps_normalize_ts(struct timespec ts)
+static inline struct pps_normtime pps_normalize_ts(struct timespec64 ts)
 {
struct pps_normtime norm = {
.sec = ts.tv_sec,
@@ -861,7 +861,7 @@ static long hardpps_update_freq(struct pps_normtime 
freq_norm)
pps_errcnt++;
pps_dec_freq_interval();
printk_deferred(KERN_ERR
-   "hardpps: PPSERROR: interval too long - %ld s\n",
+   "hardpps: PPSERROR: interval too long - %lld s\n",
freq_norm.sec);
return 0;
}
@@ -948,7 +948,7 @@ static void hardpps_update_phase(long error)
  * This code is based on David Mills's reference nanokernel
  * implementation. It was mostly rewritten but keeps the same idea.
  */
-void __hardpps(const struct timespec *phase_ts, const struct timespec *raw_ts)
+void __hardpps(const struct timespec64 *phase_ts, const struct timespec64 
*raw_ts)
 {
struct pps_normtime pts_norm, freq_norm;
 
@@ -969,7 +969,7 @@ void __hardpps(const struct timespec *phase_ts, const 
struct timespec *raw_ts)
}
 
/* ok, now we have a base for frequency calculation */
-   freq_norm = pps_normalize_ts(timespec_sub(*raw_ts, pps_fbase));
+   freq_norm = pps_normalize_ts(timespec64_sub(*raw_ts, pps_fbase));
 
/* check that the signal is in the range
 * [1s - MAXFREQ us, 1s + MAXFREQ us], otherwise reject it */
diff --git a/kernel/time/ntp_internal.h

Re: [PATCH net-next 0/6] make non-modular code explicitly non-modular

2015-09-28 Thread Paul Gortmaker

[Re: [PATCH net-next 0/6] make non-modular code explicitly non-modular] On 
28/09/2015 (Mon 23:09) Geert Uytterhoeven wrote:

> Hi Paul,
> 
> On Mon, Sep 28, 2015 at 9:51 PM, Paul Gortmaker
>  wrote:
> > In a previous merge window, we made changes to allow better
> > delineation between modular and non-modular code in commit
> > 0fd972a7d91d6e15393c449492a04d94c0b89351 ("module: relocate module_init
> > from init.h to module.h").  This allows us to now ensure module code
> > looks modular and non-modular code does not accidentally look modular
> > just to avoid suffering build breakage.
> >
> > Here we target code that is, by nature of their Makefile and/or
> > Kconfig settings, only available to be built-in, but implicitly
> > presenting itself as being possibly modular by way of using modular
> > headers, macros, and functions.
> >
> > The goal here is to remove that illusion of modularity from these
> > files, but in a way that leaves the actual runtime unchanged.
> > In doing so, we remove code that has never been tested and adds
> > no value to the tree.  And we continue the process of expecting a
> > level of consistency between the Kconfig/Makefile of code and the
> > code in use itself.
> >
> > Fortuntately the net subsystem has relatively few instances, given
> > the overall amount of code and drivers it contains.  For comparison
> > there are over 300 instances tree wide, resulting in a possible net
> > removal of on the order of 5000 lines of unused code.
> >
> > Build tested on net-next 34c2d9fb0498 on m68k, since that is the arch
> > where the three ethernet drivers changed here are available.
> 
> >   net/ethernet: make amd/hplance.c driver explicitly non-modular
> >   net/ethernet: make 8390/mac8390.c driver explicitly non-modular
> >   net/ethernet: make apple/macmace.c driver explicitly non-modular
> 
> Why did you choose this approach?
> What about changing the "bool"s to "tristate"s in Kconfig instead?

Long answer is here:

https://lkml.org/lkml/2015/8/24/888

To summarize, it adds functionality to code I can't test, and with 300
or so of these, it already has been a large time sink.  Add to that
extending the functionality and testing the new functionality, and it
does not scale.   Plus if something hasn't allowed tristate for over
10 years, where is the value in adding it now?

> I gave it a try, and with some small changes the three m68k ethernet drivers
> build fine as modular drivers. I can send patches if you like it.

Per above, I don't see the value in it, but if you want to do it and
test it and own submitting the patches, then I can drop the corresponding
ones from my queue.  Either way we get the code matching the Kconfig
which is what I'm after out of this.

Note that if you do decide to do this, the one driver really needs more
than just tristate one line change, it had super ancient init code that
predates module_init and probably needs an update.

Thanks,
Paul.
--

> 
> Thanks!
> 
> Gr{oetje,eeting}s,
> 
> Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> ge...@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like 
> that.
> -- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: unregister_netdevice warnings when deleting netns

2015-09-28 Thread Eric W. Biederman

Julian Anastasov  writes:

>   Hello,
>
> On Mon, 28 Sep 2015, Anand Gurram wrote:
>
>> I am currently using kernel version 3.16.7 on a linux switch.
>> While creating and destroying network namespaces I am observing below logs
>> on the console
>> "unregister_netdevice: waiting for lo to become free. Usage count = 1"
>> 
>> Can you please suggest and provide instructions on how to debug this issue.
>> If any fix already available can you please point me to the link.
>
>   There are two commits from Linux 4.2 that may help:
>
> commit e9e4dd3267d0 ("net: do not process device backlog during 
> unregistration")
> commit 2c17d27c36dc ("net: call rcu_read_lock early in process_backlog")
>
>   For now I see them only in 3.2.71+ and 3.12.48+.
> I think, they will appear in other stable versions too...

If that message repeats indefinitely it means there is a leaked
reference to the network namespaces lo device.

If the message just spits out a few times and then goes away it simply
means that something is taking a while to cleanup and drop it's
reference.

This is slightly complicated by the fact that it is not uncommon when a
network device goes away to redirect all references to itself to the lo
device.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/5] ntp: use timespec64 in sync_cmos_clock

2015-09-28 Thread Arnd Bergmann

The sync_cmos_clock has one use of struct timespec, which we want to
eventually replace with timespec64 or similar in the kernel. There
is no way this one can overflow, but the conversion to timespec64
is trivial and has no other dependencies.

Signed-off-by: Arnd Bergmann 
---
 kernel/time/ntp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index bd4fa6271262..149cc8086aea 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -509,7 +509,7 @@ static DECLARE_DELAYED_WORK(sync_cmos_work, 
sync_cmos_clock);
 static void sync_cmos_clock(struct work_struct *work)
 {
struct timespec64 now;
-   struct timespec next;
+   struct timespec64 next;
int fail = 1;
 
/*
@@ -559,7 +559,7 @@ static void sync_cmos_clock(struct work_struct *work)
next.tv_nsec -= NSEC_PER_SEC;
}
queue_delayed_work(system_power_efficient_wq,
-  _cmos_work, timespec_to_jiffies());
+  _cmos_work, timespec64_to_jiffies());
 }
 
 void ntp_notify_cmos_timer(void)
-- 
2.1.0.rc2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/5] y2038 conversion for ntp/pps and sfc driver

2015-09-28 Thread Arnd Bergmann

When trying to build a kernel with time_t commented out, I found that
the ntp subsystem still relies on timespec for its pps handling.

This series addresses this and converts all the code to use timespec64
instead, step by step. There is one device driver that interacts with
this code directly (rather than only through the ptp subsystem), so
I have to convert that driver at the same time.

The patches should ideally stay together as a series, but they do
span multiple subsystems, so I'm also looking for the right person
to merge them.

Please review.

Thanks,

Arnd

Arnd Bergmann (5):
  ntp/pps: use timespec64 for hardpps()
  ntp/pps: replace getnstime_raw_and_real with 64-bit version
  ntp: use timespec64 in sync_cmos_clock
  ntp/pps: use y2038 safe types in pps_event_time
  net: sfc: avoid using timespec

 drivers/net/ethernet/sfc/ptp.c | 30 +++---
 drivers/pps/kapi.c |  4 ++--
 include/linux/pps_kernel.h | 16 
 include/linux/timekeeping.h|  4 ++--
 include/linux/timex.h  |  2 +-
 kernel/time/ntp.c  | 16 
 kernel/time/ntp_internal.h |  2 +-
 kernel/time/timekeeping.c  | 14 +++---
 8 files changed, 44 insertions(+), 44 deletions(-)

-- 
2.1.0.rc2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Cluster-devel] [PATCH 17/23] dlm: use per-attribute show and store methods

2015-09-28 Thread David Teigland

On Fri, Sep 25, 2015 at 06:49:54AM -0700, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dlm/config.c | 288 
> +++-
>  1 file changed, 74 insertions(+), 214 deletions(-)

Looks good to me.
Dave
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ipvs: Don't protect ip_vs_addr_is_unicast with CONFIG_SYSCTL

2015-09-28 Thread Eric W. Biederman


I arranged the code so that the compiler can remove the unecessary bits
in ip_vs_leave when CONFIG_SYSCTL is unset, and removed an explicit
CONFIG_SYSCTL.

Unfortunately when rebasing my work on top of that of Alex Gartrell I
missed the fact that the newly added function ip_vs_addr_is_unicast was
surrounded by CONFIG_SYSCTL.

So remove the now unnecessary CONFIG_SYSCTL guards around
ip_vs_addr_is_unicast.  It is causing build failures today when
CONFIG_SYSCTL is not selected and any self respecting compiler will
notice that sysctl_cache_bypass is always false without CONFIG_SYSCTL
and not include the logic from the function ip_vs_addr_is_unicast in
the compiled code.

Signed-off-by: "Eric W. Biederman" 
---

This is a build fix for ipvs-next and nf-next.

 net/netfilter/ipvs/ip_vs_core.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 07a791ecdfba..fba73db81d2f 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -547,7 +547,6 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff 
*skb,
return cp;
 }
 
-#ifdef CONFIG_SYSCTL
 static inline int ip_vs_addr_is_unicast(struct net *net, int af,
union nf_inet_addr *addr)
 {
@@ -557,7 +556,6 @@ static inline int ip_vs_addr_is_unicast(struct net *net, 
int af,
 #endif
return (inet_addr_type(net, addr->ip) == RTN_UNICAST);
 }
-#endif
 
 /*
  *  Pass or drop the packet.
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk()

2015-09-28 Thread Alexander Duyck


On 09/28/2015 05:26 AM, Jesper Dangaard Brouer wrote:

For practical use-cases it is beneficial to prefetch the next freelist
object in bulk allocation loop.

Micro benchmarking show approx 1 cycle change:

bulk -  prev-patch -  this patch
1 -  49 cycles(tsc) - 49 cycles(tsc) - increase in cycles:0
2 -  30 cycles(tsc) - 31 cycles(tsc) - increase in cycles:1
3 -  23 cycles(tsc) - 25 cycles(tsc) - increase in cycles:2
4 -  20 cycles(tsc) - 22 cycles(tsc) - increase in cycles:2
8 -  18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1
   16 -  17 cycles(tsc) - 18 cycles(tsc) - increase in cycles:1
   30 -  18 cycles(tsc) - 17 cycles(tsc) - increase in cycles:-1
   32 -  18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1
   34 -  23 cycles(tsc) - 24 cycles(tsc) - increase in cycles:1
   48 -  21 cycles(tsc) - 22 cycles(tsc) - increase in cycles:1
   64 -  20 cycles(tsc) - 21 cycles(tsc) - increase in cycles:1
  128 -  27 cycles(tsc) - 27 cycles(tsc) - increase in cycles:0
  158 -  30 cycles(tsc) - 30 cycles(tsc) - increase in cycles:0
  250 -  37 cycles(tsc) - 37 cycles(tsc) - increase in cycles:0

Note, benchmark done with slab_nomerge to keep it stable enough
for accurate comparison.

Signed-off-by: Jesper Dangaard Brouer 
---
  mm/slub.c |2 ++
  1 file changed, 2 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index c25717ab3b5a..5af75a618b91 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2951,6 +2951,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
flags, size_t size,
goto error;
  
  			c = this_cpu_ptr(s->cpu_slab);

+   prefetch_freepointer(s, c->freelist);
continue; /* goto for-loop */
}
  
@@ -2960,6 +2961,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,

goto error;
  
  		c->freelist = get_freepointer(s, object);

+   prefetch_freepointer(s, c->freelist);
p[i] = object;
  
  		/* kmem_cache debug support */




I can see the prefetch in the last item case being possibly useful since 
you have time between when you call the prefetch and when you are 
accessing the next object.  However, is there any actual benefit to 
prefetching inside the loop itself?  Based on your data above it doesn't 
seem like that is the case since you are now adding one additional cycle 
to the allocation and I am not seeing any actual gain reported here.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7] slab: implement bulking for SLAB allocator

2015-09-28 Thread Christoph Lameter

On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:

> +/* Note that interrupts must be enabled when calling this function. */
>  bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> - void **p)
> +void **p)
>  {
> - return __kmem_cache_alloc_bulk(s, flags, size, p);
> + size_t i;
> +
> + local_irq_disable();
> + for (i = 0; i < size; i++) {
> + void *x = p[i] = slab_alloc(s, flags, _RET_IP_, false);
> +
> + if (!x) {
> + __kmem_cache_free_bulk(s, i, p);
> + return false;
> + }
> + }
> + local_irq_enable();
> + return true;
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_bulk);
>

Ok the above could result in excessive times when the interrupts are
kept off.  Lets say someone is freeing 1000 objects?

> +/* Note that interrupts must be enabled when calling this function. */
> +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
> + size_t i;
> +
> + local_irq_disable();
> + for (i = 0; i < size; i++)
> + __kmem_cache_free(s, p[i], false);
> + local_irq_enable();
> +}
> +EXPORT_SYMBOL(kmem_cache_free_bulk);

Same concern here. We may just have to accept this for now.

Acked-by: Christoph Lameter 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists

2015-09-28 Thread Christoph Lameter

On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:

> Not knowing SLUB as well as you, it took me several hours to realize
> init_object() didn't overwrite the freepointer in the object.  Thus, I
> think these comments make the reader aware of not-so-obvious
> side-effects of SLAB_POISON and SLAB_RED_ZONE.

>From the source:

/*
 * Object layout:
 *
 * object address
 *  Bytes of the object to be managed.
 *  If the freepointer may overlay the object then the free
 *  pointer is the first word of the object.
 *
 *  Poisoning uses 0x6b (POISON_FREE) and the last byte is
 *  0xa5 (POISON_END)
 *
 * object + s->object_size
 *  Padding to reach word boundary. This is also used for Redzoning.
 *  Padding is extended by another word if Redzoning is enabled and
 *  object_size == inuse.
 *
 *  We fill with 0xbb (RED_INACTIVE) for inactive objects and with
 *  0xcc (RED_ACTIVE) for objects in use.
 *
 * object + s->inuse
 *  Meta data starts here.
 *
 *  A. Free pointer (if we cannot overwrite object on free)
 *  B. Tracking data for SLAB_STORE_USER
 *  C. Padding to reach required alignment boundary or at mininum
 *  one word if debugging is on to be able to detect writes
 *  before the word boundary.
 *
 *  Padding is done using 0x5a (POISON_INUSE)
 *
 * object + s->size
 *  Nothing is used beyond s->size.
 *
 * If slabcaches are merged then the object_size and inuse boundaries are
mostly
 * ignored. And therefore no slab options that rely on these boundaries
 * may be used with merged slabcaches.
 */

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists

2015-09-28 Thread Christoph Lameter

On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index 1cf98d89546d..13b5f53e4840 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -675,11 +675,18 @@ static void init_object(struct kmem_cache *s, void 
> *object, u8 val)
>  {
>   u8 *p = object;
>
> + /* Freepointer not overwritten as SLAB_POISON moved it after object */
>   if (s->flags & __OBJECT_POISON) {
>   memset(p, POISON_FREE, s->object_size - 1);
>   p[s->object_size - 1] = POISON_END;
>   }
>
> + /*
> +  * If both SLAB_RED_ZONE and SLAB_POISON are enabled, then
> +  * freepointer is still safe, as then s->offset equals
> +  * s->inuse and below redzone is after s->object_size and only
> +  * area between s->object_size and s->inuse.
> +  */
>   if (s->flags & SLAB_RED_ZONE)
>   memset(p + s->object_size, val, s->inuse - s->object_size);
>  }

Are these comments really adding something? This is basic metadata
handling for SLUB that is commented on elsehwere.

> @@ -2584,9 +2646,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
>   * So we still attempt to reduce cache line usage. Just take the slab
>   * lock and free the item. If there is no additional partial page
>   * handling required then we can return immediately.
> + *
> + * Bulk free of a freelist with several objects (all pointing to the
> + * same page) possible by specifying freelist_head ptr and object as
> + * tail ptr, plus objects count (cnt).
>   */
>  static void __slab_free(struct kmem_cache *s, struct page *page,
> - void *x, unsigned long addr)
> + void *x, unsigned long addr,
> + void *freelist_head, int cnt)

Do you really need separate parameters for freelist_head? If you just want
to deal with one object pass it as freelist_head and set cnt = 1?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v2] net: sctp: Don't use 64 kilobyte lookup table for four elements

2015-09-28 Thread David Laight

From: Eric Dumazet
> Sent: 28 September 2015 15:27
> On Mon, 2015-09-28 at 14:12 +, David Laight wrote:
> > From: Neil Horman
> > > Sent: 28 September 2015 14:51
> > > On Mon, Sep 28, 2015 at 02:34:04PM +0200, Denys Vlasenko wrote:
> > > > Seemingly innocuous sctp_trans_state_to_prio_map[] array
> > > > is way bigger than it looks, since
> > > > "[SCTP_UNKNOWN] = 2" expands into "[0x] = 2" !
> > > >
> > > > This patch replaces it with switch() statement.
> >
> > What about just adding 1 (and masking) before indexing the array?
> > That might require a static inline function with a local static array.
> >
> > Or define the array as (say) [16] and just mask the state before using
> > it as an index?
> 
> Just let the compiler do its job, instead of obfuscating source.
> 
> Compilers can transform a switch into an (optimal) table if it is really
> a gain.

The compiler can choose between a jump table and nested ifs for a switch
statement. I've never seen it convert one into a data array index.

David

Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists

2015-09-28 Thread Jesper Dangaard Brouer

On Mon, 28 Sep 2015 10:16:49 -0500 (CDT)
Christoph Lameter  wrote:

> On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:
> 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 1cf98d89546d..13b5f53e4840 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -675,11 +675,18 @@ static void init_object(struct kmem_cache *s, void 
> > *object, u8 val)
> >  {
> > u8 *p = object;
> >
> > +   /* Freepointer not overwritten as SLAB_POISON moved it after object */
> > if (s->flags & __OBJECT_POISON) {
> > memset(p, POISON_FREE, s->object_size - 1);
> > p[s->object_size - 1] = POISON_END;
> > }
> >
> > +   /*
> > +* If both SLAB_RED_ZONE and SLAB_POISON are enabled, then
> > +* freepointer is still safe, as then s->offset equals
> > +* s->inuse and below redzone is after s->object_size and only
> > +* area between s->object_size and s->inuse.
> > +*/
> > if (s->flags & SLAB_RED_ZONE)
> > memset(p + s->object_size, val, s->inuse - s->object_size);
> >  }
> 
> Are these comments really adding something? This is basic metadata
> handling for SLUB that is commented on elsehwere.

Not knowing SLUB as well as you, it took me several hours to realize
init_object() didn't overwrite the freepointer in the object.  Thus, I
think these comments make the reader aware of not-so-obvious
side-effects of SLAB_POISON and SLAB_RED_ZONE.


> > @@ -2584,9 +2646,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
> >   * So we still attempt to reduce cache line usage. Just take the slab
> >   * lock and free the item. If there is no additional partial page
> >   * handling required then we can return immediately.
> > + *
> > + * Bulk free of a freelist with several objects (all pointing to the
> > + * same page) possible by specifying freelist_head ptr and object as
> > + * tail ptr, plus objects count (cnt).
> >   */
> >  static void __slab_free(struct kmem_cache *s, struct page *page,
> > -   void *x, unsigned long addr)
> > +   void *x, unsigned long addr,
> > +   void *freelist_head, int cnt)
> 
> Do you really need separate parameters for freelist_head? If you just want
> to deal with one object pass it as freelist_head and set cnt = 1?

Yes, I need it.  We need to know both the head and tail of the list to
splice it.

See:

> @@ -2612,7 +2681,7 @@ static void __slab_free(struct kmem_cache *s, struct 
> page *page,
prior = page->freelist;
counters = page->counters;
>   set_freepointer(s, object, prior);
   ^^ 
Here we update the tail ptr (object) to point to "prior" (page->freelist).

>   new.counters = counters;
>   was_frozen = new.frozen;
> - new.inuse--;
> + new.inuse -= cnt;
>   if ((!new.inuse || !prior) && !was_frozen) {
>  
>   if (kmem_cache_has_cpu_partial(s) && !prior) {
> @@ -2643,7 +2712,7 @@ static void __slab_free(struct kmem_cache *s, struct 
> page *page,
>  
>   } while (!cmpxchg_double_slab(s, page,
>   prior, counters,
> - object, new.counters,
> + new_freelist, new.counters,
>   "__slab_free"));

Here we update page->freelist ("prior") to point to the head. Thus,
splicing the list.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk()

2015-09-28 Thread Jesper Dangaard Brouer


On Mon, 28 Sep 2015 07:53:16 -0700 Alexander Duyck  
wrote:

> On 09/28/2015 05:26 AM, Jesper Dangaard Brouer wrote:
> > For practical use-cases it is beneficial to prefetch the next freelist
> > object in bulk allocation loop.
> >
> > Micro benchmarking show approx 1 cycle change:
> >
> > bulk -  prev-patch -  this patch
> > 1 -  49 cycles(tsc) - 49 cycles(tsc) - increase in cycles:0
> > 2 -  30 cycles(tsc) - 31 cycles(tsc) - increase in cycles:1
> > 3 -  23 cycles(tsc) - 25 cycles(tsc) - increase in cycles:2
> > 4 -  20 cycles(tsc) - 22 cycles(tsc) - increase in cycles:2
> > 8 -  18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1
> >16 -  17 cycles(tsc) - 18 cycles(tsc) - increase in cycles:1
> >30 -  18 cycles(tsc) - 17 cycles(tsc) - increase in cycles:-1
> >32 -  18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1
> >34 -  23 cycles(tsc) - 24 cycles(tsc) - increase in cycles:1
> >48 -  21 cycles(tsc) - 22 cycles(tsc) - increase in cycles:1
> >64 -  20 cycles(tsc) - 21 cycles(tsc) - increase in cycles:1
> >   128 -  27 cycles(tsc) - 27 cycles(tsc) - increase in cycles:0
> >   158 -  30 cycles(tsc) - 30 cycles(tsc) - increase in cycles:0
> >   250 -  37 cycles(tsc) - 37 cycles(tsc) - increase in cycles:0
> >
> > Note, benchmark done with slab_nomerge to keep it stable enough
> > for accurate comparison.
> >
> > Signed-off-by: Jesper Dangaard Brouer 
> > ---
> >   mm/slub.c |2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index c25717ab3b5a..5af75a618b91 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2951,6 +2951,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, 
> > gfp_t flags, size_t size,
> > goto error;
> >   
> > c = this_cpu_ptr(s->cpu_slab);
> > +   prefetch_freepointer(s, c->freelist);
> > continue; /* goto for-loop */
> > }
> >   
> > @@ -2960,6 +2961,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, 
> > gfp_t flags, size_t size,
> > goto error;
> >   
> > c->freelist = get_freepointer(s, object);
> > +   prefetch_freepointer(s, c->freelist);
> > p[i] = object;
> >   
> > /* kmem_cache debug support */
> >
> 
> I can see the prefetch in the last item case being possibly useful since 
> you have time between when you call the prefetch and when you are 
> accessing the next object.  However, is there any actual benefit to 
> prefetching inside the loop itself?  Based on your data above it doesn't 
> seem like that is the case since you are now adding one additional cycle 
> to the allocation and I am not seeing any actual gain reported here.

The gain will first show up, when using bulk alloc in real use-cases.

As you know, bulk alloc on RX path don't show any improvement. And I
measured (with perf-mem-record) L1 miss'es here.  I could reduce the L1
misses here by adding prefetch.  But I cannot remember if I measured
any PPS improvement with this.

As you hint, the time I have between my prefetch and use is very small,
thus the question is if this will show any benefit for real use-cases.

We can drop this patch, and then I'll include it in my network
use-case, and measure the effect? (Although I'll likely be wasting my
time, as we should likely redesign the alloc API instead).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] e1000: fix e1000e_disable_aspm_locked() warning

2015-09-28 Thread Dave Hansen

On 08/31/2015 02:26 PM, Dave Hansen wrote:
> From: Dave Hansen 
> 
> I have a .config with CONFIG_PM disabled.  I get the following whenever
> compiling the e1000 driver:
> 
> ...net/ethernet/intel/e1000e/netdev.c:6450:13: warning: 
> 'e1000e_disable_aspm_locked' defined but not used [-Wunused-function]
>  static void e1000e_disable_aspm_locked(struct pci_dev *pdev, u16 state)
> 
> Looks like we just need to move e1000e_disable_aspm_locked() to
> be underneath the CONFIG_PM #ifdef.

This patch:

[2758f9edb]: e1000e: Fix incorrect ASPM locking

established a new caller for e1000e_disable_aspm_locked() which makes my
patch useless and wrong (it breaks the compile).

I believe we should just revert my patch.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next PATCH v2] netpoll: Drop budget parameter from NAPI polling call hierarchy

2015-09-28 Thread Alexander Duyck

For some reason we were carrying the budget value around between the
various calls to napi->poll.  If for example one of the drivers called had
a bug in which it returned a non-zero value for work this could result in
the budget value becoming negative.

Rather than carry around a value of budget that is 0 or less we can instead
just loop through and pass 0 to each napi->poll call.  If any driver
returns a value for work done that is non-zero then we can report that
driver and continue rather than allowing a bad actor to make the budget
value negative and pass that negative value to napi->poll.

Note, the only actual change here is that instead of letting budget become
negative we are keeping it at 0 regardless of the value returned for work
since it should not be possible for the polling routine to do any actual
work with a budget of 0.  So if the polling routine returns a non-0 value
we are just reporting it and continuing with a budget of 0 rather than
letting that work value be subtracted from the budget of 0.

Signed-off-by: Alexander Duyck 
---

v2: Rebased patch to incorporate latest changes to poll_one_napi.

 net/core/netpoll.c |   23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 8bdada242a7d..94acfc89ad97 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -140,7 +140,7 @@ static void queue_process(struct work_struct *work)
  * case. Further, we test the poll_owner to avoid recursion on UP
  * systems where the lock doesn't exist.
  */
-static int poll_one_napi(struct napi_struct *napi, int budget)
+static void poll_one_napi(struct napi_struct *napi)
 {
int work = 0;
 
@@ -149,33 +149,33 @@ static int poll_one_napi(struct napi_struct *napi, int 
budget)
 * holding the napi->poll_lock.
 */
if (!test_bit(NAPI_STATE_SCHED, >state))
-   return budget;
+   return;
 
/* If we set this bit but see that it has already been set,
 * that indicates that napi has been disabled and we need
 * to abort this operation
 */
if (test_and_set_bit(NAPI_STATE_NPSVC, >state))
-   goto out;
+   return;
 
-   work = napi->poll(napi, budget);
-   WARN_ONCE(work > budget, "%pF exceeded budget in poll\n", napi->poll);
+   /* We explicilty pass the polling call a budget of 0 to
+* indicate that we are clearing the Tx path only.
+*/
+   work = napi->poll(napi, 0);
+   WARN_ONCE(work, "%pF exceeded budget in poll\n", napi->poll);
trace_napi_poll(napi);
 
clear_bit(NAPI_STATE_NPSVC, >state);
-
-out:
-   return budget - work;
 }
 
-static void poll_napi(struct net_device *dev, int budget)
+static void poll_napi(struct net_device *dev)
 {
struct napi_struct *napi;
 
list_for_each_entry(napi, >napi_list, dev_list) {
if (napi->poll_owner != smp_processor_id() &&
spin_trylock(>poll_lock)) {
-   budget = poll_one_napi(napi, budget);
+   poll_one_napi(napi);
spin_unlock(>poll_lock);
}
}
@@ -185,7 +185,6 @@ static void netpoll_poll_dev(struct net_device *dev)
 {
const struct net_device_ops *ops;
struct netpoll_info *ni = rcu_dereference_bh(dev->npinfo);
-   int budget = 0;
 
/* Don't do any rx activity if the dev_lock mutex is held
 * the dev_open/close paths use this to block netpoll activity
@@ -208,7 +207,7 @@ static void netpoll_poll_dev(struct net_device *dev)
/* Process pending work on NIC */
ops->ndo_poll_controller(dev);
 
-   poll_napi(dev, budget);
+   poll_napi(dev);
 
up(>dev_lock);
 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v2] net: sctp: Don't use 64 kilobyte lookup table for four elements

2015-09-28 Thread David Laight

From: Neil Horman
> Sent: 28 September 2015 14:51
> On Mon, Sep 28, 2015 at 02:34:04PM +0200, Denys Vlasenko wrote:
> > Seemingly innocuous sctp_trans_state_to_prio_map[] array
> > is way bigger than it looks, since
> > "[SCTP_UNKNOWN] = 2" expands into "[0x] = 2" !
> >
> > This patch replaces it with switch() statement.

What about just adding 1 (and masking) before indexing the array?
That might require a static inline function with a local static array.

Or define the array as (say) [16] and just mask the state before using
it as an index?

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] net: sctp: Don't use 64 kilobyte lookup table for four elements

2015-09-28 Thread Eric Dumazet

On Mon, 2015-09-28 at 14:12 +, David Laight wrote:
> From: Neil Horman
> > Sent: 28 September 2015 14:51
> > On Mon, Sep 28, 2015 at 02:34:04PM +0200, Denys Vlasenko wrote:
> > > Seemingly innocuous sctp_trans_state_to_prio_map[] array
> > > is way bigger than it looks, since
> > > "[SCTP_UNKNOWN] = 2" expands into "[0x] = 2" !
> > >
> > > This patch replaces it with switch() statement.
> 
> What about just adding 1 (and masking) before indexing the array?
> That might require a static inline function with a local static array.
> 
> Or define the array as (say) [16] and just mask the state before using
> it as an index?

Just let the compiler do its job, instead of obfuscating source.

Compilers can transform a switch into an (optimal) table if it is really
a gain.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread David Laight

From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: 28 September 2015 15:27
> On Mon, 2015-09-28 at 08:58 +, David Laight wrote:
> > From: Rafael J. Wysocki
> > > Sent: 27 September 2015 15:09
> > ...
> > > > > Say you have three adjacent fields in a structure, x, y, z, each one 
> > > > > byte long.
> > > > > Initially, all of them are equal to 0.
> > > > >
> > > > > CPU A writes 1 to x and CPU B writes 2 to y at the same time.
> > > > >
> > > > > What's the result?
> > > >
> > > > I think every CPU's  cache architecure guarantees adjacent store
> > > > integrity, even in the face of SMP, so it's x==1 and y==2.  If you're
> > > > thinking of old alpha SMP system where the lowest store width is 32 bits
> > > > and thus you have to do RMW to update a byte, this was usually fixed by
> > > > padding (assuming the structure is not packed).  However, it was such a
> > > > problem that even the later alpha chips had byte extensions.
> >
> > Does linux still support those old Alphas?
> >
> > The x86 cpus will also do 32bit wide rmw cycles for the 'bit' operations.
> 
> That's different: it's an atomic RMW operation.  The problem with the
> alpha was that the operation wasn't atomic (meaning that it can't be
> interrupted and no intermediate output states are visible).

It is only atomic if prefixed by the 'lock' prefix.
Normally the read and write are separate bus cycles.
 
> > You still have to ensure the compiler doesn't do wider rmw cycles.
> > I believe the recent versions of gcc won't do wider accesses for volatile 
> > data.
> 
> I don't understand this comment.  You seem to be implying gcc would do a
> 64 bit RMW for a 32 bit store ... that would be daft when a single
> instruction exists to perform the operation on all architectures.

Read the object code and weep...
It is most likely to happen for operations that are rmw (eg bit set).
For instance the arm cpu has limited offsets for 16bit accesses, for
normal structures the compiler is likely to use a 32bit rmw sequence
for a 16bit field that has a large offset.
The C language allows the compiler to do it for any access (IIRC including
volatiles).

David

Re: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread James Bottomley

On Mon, 2015-09-28 at 14:50 +, David Laight wrote:
> From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> > Sent: 28 September 2015 15:27
> > On Mon, 2015-09-28 at 08:58 +, David Laight wrote:
> > > From: Rafael J. Wysocki
> > > > Sent: 27 September 2015 15:09
> > > ...
> > > > > > Say you have three adjacent fields in a structure, x, y, z, each 
> > > > > > one byte long.
> > > > > > Initially, all of them are equal to 0.
> > > > > >
> > > > > > CPU A writes 1 to x and CPU B writes 2 to y at the same time.
> > > > > >
> > > > > > What's the result?
> > > > >
> > > > > I think every CPU's  cache architecure guarantees adjacent store
> > > > > integrity, even in the face of SMP, so it's x==1 and y==2.  If you're
> > > > > thinking of old alpha SMP system where the lowest store width is 32 
> > > > > bits
> > > > > and thus you have to do RMW to update a byte, this was usually fixed 
> > > > > by
> > > > > padding (assuming the structure is not packed).  However, it was such 
> > > > > a
> > > > > problem that even the later alpha chips had byte extensions.
> > >
> > > Does linux still support those old Alphas?
> > >
> > > The x86 cpus will also do 32bit wide rmw cycles for the 'bit' operations.
> > 
> > That's different: it's an atomic RMW operation.  The problem with the
> > alpha was that the operation wasn't atomic (meaning that it can't be
> > interrupted and no intermediate output states are visible).
> 
> It is only atomic if prefixed by the 'lock' prefix.
> Normally the read and write are separate bus cycles.

The essential point is that x86 has atomic bit ops and byte writes.
Early alpha did not.

> > > You still have to ensure the compiler doesn't do wider rmw cycles.
> > > I believe the recent versions of gcc won't do wider accesses for volatile 
> > > data.
> > 
> > I don't understand this comment.  You seem to be implying gcc would do a
> > 64 bit RMW for a 32 bit store ... that would be daft when a single
> > instruction exists to perform the operation on all architectures.
> 
> Read the object code and weep...
> It is most likely to happen for operations that are rmw (eg bit set).
> For instance the arm cpu has limited offsets for 16bit accesses, for
> normal structures the compiler is likely to use a 32bit rmw sequence
> for a 16bit field that has a large offset.
> The C language allows the compiler to do it for any access (IIRC including
> volatiles).

I think you might be confusing different things.  Most RISC CPUs can't
do 32 bit store immediates because there aren't enough bits in their
arsenal, so they tend to split 32 bit loads into a left and right part
(first the top then the offset).  This (and other things) are mostly
what you see in code.  However, 32 bit register stores are still atomic,
which is all we require.  It's not really the compiler's fault, it's
mostly an architectural limitation.

James


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist

2015-09-28 Thread Christoph Lameter


Acked-by: Christoph Lameter 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread David Laight

From: James Bottomley 
> Sent: 28 September 2015 16:12
> > > > The x86 cpus will also do 32bit wide rmw cycles for the 'bit' 
> > > > operations.
> > >
> > > That's different: it's an atomic RMW operation.  The problem with the
> > > alpha was that the operation wasn't atomic (meaning that it can't be
> > > interrupted and no intermediate output states are visible).
> >
> > It is only atomic if prefixed by the 'lock' prefix.
> > Normally the read and write are separate bus cycles.
> 
> The essential point is that x86 has atomic bit ops and byte writes.
> Early alpha did not.

Early alpha didn't have any byte accesses.

On x86 if you have the following:
struct {
char  a;
volatile char b;
} *foo;
foo->a |= 4;

The compiler is likely to generate a 'bis #4, 0(rbx)' (or similar)
and the cpu will do two 32bit memory cycles that read and write
the 'volatile' field 'b'.
(gcc definitely used to do this...)

A lot of fields were made 32bit (and probably not bitfields) in the linux
kernel tree a year or two ago to avoid this very problem.

> > > > You still have to ensure the compiler doesn't do wider rmw cycles.
> > > > I believe the recent versions of gcc won't do wider accesses for 
> > > > volatile data.
> > >
> > > I don't understand this comment.  You seem to be implying gcc would do a
> > > 64 bit RMW for a 32 bit store ... that would be daft when a single
> > > instruction exists to perform the operation on all architectures.
> >
> > Read the object code and weep...
> > It is most likely to happen for operations that are rmw (eg bit set).
> > For instance the arm cpu has limited offsets for 16bit accesses, for
> > normal structures the compiler is likely to use a 32bit rmw sequence
> > for a 16bit field that has a large offset.
> > The C language allows the compiler to do it for any access (IIRC including
> > volatiles).
> 
> I think you might be confusing different things.  Most RISC CPUs can't
> do 32 bit store immediates because there aren't enough bits in their
> arsenal, so they tend to split 32 bit loads into a left and right part
> (first the top then the offset).  This (and other things) are mostly
> what you see in code.  However, 32 bit register stores are still atomic,
> which is all we require.  It's not really the compiler's fault, it's
> mostly an architectural limitation.

No, I'm not talking about how 32bit constants are generated.
I'm talking about structure offsets.

David

N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists

2015-09-28 Thread Christoph Lameter

On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:

> > Do you really need separate parameters for freelist_head? If you just want
> > to deal with one object pass it as freelist_head and set cnt = 1?
>
> Yes, I need it.  We need to know both the head and tail of the list to
> splice it.

Ok so this is to avoid having to scan the list to its end? x is the end
of the list and freelist_head the beginning. That is weird.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread James Bottomley

On Mon, 2015-09-28 at 08:58 +, David Laight wrote:
> From: Rafael J. Wysocki
> > Sent: 27 September 2015 15:09
> ...
> > > > Say you have three adjacent fields in a structure, x, y, z, each one 
> > > > byte long.
> > > > Initially, all of them are equal to 0.
> > > >
> > > > CPU A writes 1 to x and CPU B writes 2 to y at the same time.
> > > >
> > > > What's the result?
> > >
> > > I think every CPU's  cache architecure guarantees adjacent store
> > > integrity, even in the face of SMP, so it's x==1 and y==2.  If you're
> > > thinking of old alpha SMP system where the lowest store width is 32 bits
> > > and thus you have to do RMW to update a byte, this was usually fixed by
> > > padding (assuming the structure is not packed).  However, it was such a
> > > problem that even the later alpha chips had byte extensions.
> 
> Does linux still support those old Alphas?
> 
> The x86 cpus will also do 32bit wide rmw cycles for the 'bit' operations.

That's different: it's an atomic RMW operation.  The problem with the
alpha was that the operation wasn't atomic (meaning that it can't be
interrupted and no intermediate output states are visible).

> > OK, thanks!
> 
> You still have to ensure the compiler doesn't do wider rmw cycles.
> I believe the recent versions of gcc won't do wider accesses for volatile 
> data.

I don't understand this comment.  You seem to be implying gcc would do a
64 bit RMW for a 32 bit store ... that would be daft when a single
instruction exists to perform the operation on all architectures.

James


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 RFC] 8139cp: Fix GSO MSS handling

2015-09-28 Thread David Woodhouse

On Sun, 2015-09-27 at 22:37 -0700, Tom Herbert wrote:
> 
> Which drivers are doing this? It is up to the driver to determine
> whether a particular packet being sent can have checksum offloaded to
> the device. If it cannot offload the checksum it must call
> skb_checksum_help.

Not so.

A driver sets the NETIF_F_IP_CSUM feature to indicate that it can do
the checksum on Legacy IP TCP or UDP frames and *nothing* else.

It most certainly does not expect to be handed any other kind of packet
for checksumming, and bad things will often happen if it is. If drivers
*do* spot that they've been given something they don't handle, I see
BUG() calls and warnings, but I don't see any of them calling
skb_checksum_help() to silently cope. Many of them just feed it to the
hardware and don't even notice at all because it's the *hardware* which
decides whether to do a TCP or a UDP checksum. So who knows what'll
happen.

The check is supposed to be done in can_checksum_protocol(), called
from harmonize_features(). But as noted, that check has false positives
and lets some inappropriate packets through — for NETIF_F_IP_CSUM it
lets through *all* skbuffs with ->protocol == ETH_P_IP instead of only
TCP and UDP.

I originally couldn't see how to deal with this except by looking at
the contents of the packet, which sucked. But I think I've found a
somewhat more acceptable approach now:
http://lists.openwall.net/netdev/2015/09/25/85

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation

smime.p7s
Description: S/MIME cryptographic signature

unregister_netdevice warnings when deleting netns

2015-09-28 Thread Anand Gurram

Hi,

I am currently using kernel version 3.16.7 on a linux switch.
While creating and destroying network namespaces I am observing below logs
on the console
"unregister_netdevice: waiting for lo to become free. Usage count = 1"

Can you please suggest and provide instructions on how to debug this issue.
If any fix already available can you please point me to the link.

Best Regards,
Anand
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread Arnd Bergmann

On Sunday 27 September 2015 16:10:48 Rafael J. Wysocki wrote:
> On Saturday, September 26, 2015 09:33:56 PM Arnd Bergmann wrote:
> > On Saturday 26 September 2015 11:40:00 Viresh Kumar wrote:
> > > On 25 September 2015 at 15:19, Rafael J. Wysocki  
> > > wrote:
> > > > So if you allow something like debugfs to update your structure, how
> > > > do you make sure there is the proper locking?
> > > 
> > > Not really sure at all.. Isn't there some debugfs locking that will
> > > jump in, to avoid updation of fields to the same device?
> > 
> > No, if you need any locking to access variable, you cannot use the
> > simple debugfs helpers but have to provide your own functions.
> > 
> > > >> Anyway, that problem isn't here for sure as its between two
> > > >> unsigned-longs. So, should I just move it to bool and resend ?
> > > >
> > > > I guess it might be more convenient to fold this into the other patch,
> > > > because we seem to be splitting hairs here.
> > > 
> > > I can and that's what I did. But then Arnd asked me to separate it
> > > out. I can fold it back if that's what you want.
> > 
> > It still makes sense to keep it separate I think, the patch is clearly
> > different from the other parts.
> 
> I just don't see much point in going from unsigned long to u32 and then
> from 32 to bool if we can go directly to bool in one go.

It's only important to keep the 34-file multi-subsystem trivial cleanup
that doesn't change any functionality separate from the bugfix. If you
like to avoid patching one of the files twice, the alternative would
be to first change the API for all other instances from u32 to bool
and leave ACPI alone, and then do the second patch that changes ACPI
from long to bool.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH bluetooth-next 4/4] mac802154: add comments for llsec issues

2015-09-28 Thread Alexander Aring

While doing a little test with the llsec implementation I saw these
issues. We should move decryption and encruption somewhere else,
otherwise while capturing with wireshark the mac header shows secuirty
fields but the payload is plaintext.

A complete other issue is what doing with HardMAC drivers where the
payload is always plaintext. I think we need a special handling then in
userspace. We currently doesn't support any HardMAC transceivers, so we
should fix the first issue for SoftMAC transceivers.

Signed-off-by: Alexander Aring 
---
 net/mac802154/rx.c | 4 
 net/mac802154/tx.c | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/mac802154/rx.c b/net/mac802154/rx.c
index d1c33c1..42e9672 100644
--- a/net/mac802154/rx.c
+++ b/net/mac802154/rx.c
@@ -87,6 +87,10 @@ ieee802154_subif_frame(struct ieee802154_sub_if_data *sdata,
 
skb->dev = sdata->dev;
 
+   /* TODO this should be moved after netif_receive_skb call, otherwise
+* wireshark will show a mac header with security fields and the
+* payload is already decrypted.
+*/
rc = mac802154_llsec_decrypt(>sec, skb);
if (rc) {
pr_debug("decryption failed: %i\n", rc);
diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
index 5ee596e..b205bbe 100644
--- a/net/mac802154/tx.c
+++ b/net/mac802154/tx.c
@@ -129,6 +129,10 @@ ieee802154_subif_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct ieee802154_sub_if_data *sdata = IEEE802154_DEV_TO_SUB_IF(dev);
int rc;
 
+   /* TODO we should move it to wpan_dev_hard_header and dev_hard_header
+* functions. The reason is wireshark will show a mac header which is
+* with security fields but the payload is not encrypted.
+*/
rc = mac802154_llsec_encrypt(>sec, skb);
if (rc) {
netdev_warn(dev, "encryption failed: %i\n", rc);
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH bluetooth-next 3/4] nl802154: add support for security layer

2015-09-28 Thread Alexander Aring

This patch adds support for accessing mac802154 llsec implementation
over nl802154. I added for a new Kconfig entry to provide this
functionality CONFIG_IEEE802154_NL802154_EXPERIMENTAL. This interface is
still in development. It provides to change security parameters and
add/del/dump entries of security tables. Later we can add also a get to
get an entry by unique identifier.

Cc: Phoebe Buckheister 
Signed-off-by: Alexander Aring 
---
 include/net/cfg802154.h |  131 
 include/net/ieee802154_netdev.h |   75 ---
 include/net/nl802154.h  |  191 ++
 net/ieee802154/Kconfig  |5 +
 net/ieee802154/core.c   |   12 +
 net/ieee802154/core.h   |1 +
 net/ieee802154/nl802154.c   | 1316 ---
 net/ieee802154/rdev-ops.h   |  109 
 net/mac802154/cfg.c |  205 ++
 9 files changed, 1876 insertions(+), 169 deletions(-)

diff --git a/include/net/cfg802154.h b/include/net/cfg802154.h
index 242273c..171cd76 100644
--- a/include/net/cfg802154.h
+++ b/include/net/cfg802154.h
@@ -27,6 +27,16 @@
 struct wpan_phy;
 struct wpan_phy_cca;
 
+#ifdef CONFIG_IEEE802154_NL802154_EXPERIMENTAL
+struct ieee802154_llsec_device_key;
+struct ieee802154_llsec_seclevel;
+struct ieee802154_llsec_params;
+struct ieee802154_llsec_device;
+struct ieee802154_llsec_table;
+struct ieee802154_llsec_key_id;
+struct ieee802154_llsec_key;
+#endif /* CONFIG_IEEE802154_NL802154_EXPERIMENTAL */
+
 struct cfg802154_ops {
struct net_device * (*add_virtual_intf_deprecated)(struct wpan_phy 
*wpan_phy,
   const char *name,
@@ -65,6 +75,51 @@ struct cfg802154_ops {
struct wpan_dev *wpan_dev, bool mode);
int (*set_ackreq_default)(struct wpan_phy *wpan_phy,
  struct wpan_dev *wpan_dev, bool ackreq);
+#ifdef CONFIG_IEEE802154_NL802154_EXPERIMENTAL
+   void(*get_llsec_table)(struct wpan_phy *wpan_phy,
+  struct wpan_dev *wpan_dev,
+  struct ieee802154_llsec_table **table);
+   void(*lock_llsec_table)(struct wpan_phy *wpan_phy,
+   struct wpan_dev *wpan_dev);
+   void(*unlock_llsec_table)(struct wpan_phy *wpan_phy,
+ struct wpan_dev *wpan_dev);
+   /* TODO remove locking/get table callbacks, this is part of the
+* nl802154 interface and should be accessible from ieee802154 layer.
+*/
+   int (*get_llsec_params)(struct wpan_phy *wpan_phy,
+   struct wpan_dev *wpan_dev,
+   struct ieee802154_llsec_params *params);
+   int (*set_llsec_params)(struct wpan_phy *wpan_phy,
+   struct wpan_dev *wpan_dev,
+   const struct ieee802154_llsec_params 
*params,
+   int changed);
+   int (*add_llsec_key)(struct wpan_phy *wpan_phy,
+struct wpan_dev *wpan_dev,
+const struct ieee802154_llsec_key_id *id,
+const struct ieee802154_llsec_key *key);
+   int (*del_llsec_key)(struct wpan_phy *wpan_phy,
+struct wpan_dev *wpan_dev,
+const struct ieee802154_llsec_key_id *id);
+   int (*add_seclevel)(struct wpan_phy *wpan_phy,
+struct wpan_dev *wpan_dev,
+const struct ieee802154_llsec_seclevel *sl);
+   int (*del_seclevel)(struct wpan_phy *wpan_phy,
+struct wpan_dev *wpan_dev,
+const struct ieee802154_llsec_seclevel *sl);
+   int (*add_device)(struct wpan_phy *wpan_phy,
+ struct wpan_dev *wpan_dev,
+ const struct ieee802154_llsec_device *dev);
+   int (*del_device)(struct wpan_phy *wpan_phy,
+ struct wpan_dev *wpan_dev, __le64 extended_addr);
+   int (*add_devkey)(struct wpan_phy *wpan_phy,
+ struct wpan_dev *wpan_dev,
+ __le64 extended_addr,
+ const struct ieee802154_llsec_device_key *key);
+   int (*del_devkey)(struct wpan_phy *wpan_phy,
+ struct wpan_dev *wpan_dev,
+ __le64 extended_addr,
+ const struct ieee802154_llsec_device_key *key);
+#endif /* CONFIG_IEEE802154_NL802154_EXPERIMENTAL */
 };
 
 static inline bool
@@ -176,6 +231,82 @@ struct ieee802154_addr {
};
 };
 
+struct ieee802154_llsec_key_id {
+   u8 mode;
+   u8 id;

[PATCH bluetooth-next 2/4] nl802154: use nla_get_le64 for get extended addr

2015-09-28 Thread Alexander Aring

This patch uses the nla_get_le64 function instead of doing a force
converting to le64.

Signed-off-by: Alexander Aring 
---
 net/ieee802154/nl802154.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/ieee802154/nl802154.c b/net/ieee802154/nl802154.c
index 3f89c0a..51110a6 100644
--- a/net/ieee802154/nl802154.c
+++ b/net/ieee802154/nl802154.c
@@ -753,10 +753,8 @@ static int nl802154_new_interface(struct sk_buff *skb, 
struct genl_info *info)
return -EINVAL;
}
 
-   /* TODO add nla_get_le64 to netlink */
if (info->attrs[NL802154_ATTR_EXTENDED_ADDR])
-   extended_addr = (__force __le64)nla_get_u64(
-   info->attrs[NL802154_ATTR_EXTENDED_ADDR]);
+   extended_addr = 
nla_get_le64(info->attrs[NL802154_ATTR_EXTENDED_ADDR]);
 
if (!rdev->ops->add_virtual_intf)
return -EOPNOTSUPP;
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH bluetooth-next 1/4] netlink: add nla_get for le32 and le64

2015-09-28 Thread Alexander Aring

This patch adds missing inline wrappers for nla_get_le32 and
nla_get_le64. The 802.15.4 MAC byteorder is little endian and we keep
the byteorder for fields like address configuration in the same
byteorder as it comes from the MAC layer.

To provide these fields for nl802154 userspace applications, we need
these inline wrappers for netlink.

Cc: David S. Miller 
Signed-off-by: Alexander Aring 
---
 include/net/netlink.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 2a5dbcc..0e31727 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -1004,6 +1004,15 @@ static inline __be32 nla_get_be32(const struct nlattr 
*nla)
 }
 
 /**
+ * nla_get_le32 - return payload of __le32 attribute
+ * @nla: __le32 netlink attribute
+ */
+static inline __le32 nla_get_le32(const struct nlattr *nla)
+{
+   return *(__le32 *) nla_data(nla);
+}
+
+/**
  * nla_get_u16 - return payload of u16 attribute
  * @nla: u16 netlink attribute
  */
@@ -1066,6 +1075,15 @@ static inline __be64 nla_get_be64(const struct nlattr 
*nla)
 }
 
 /**
+ * nla_get_le64 - return payload of __le64 attribute
+ * @nla: __le64 netlink attribute
+ */
+static inline __le64 nla_get_le64(const struct nlattr *nla)
+{
+   return *(__le64 *) nla_data(nla);
+}
+
+/**
  * nla_get_s32 - return payload of s32 attribute
  * @nla: s32 netlink attribute
  */
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH bluetooth-next 0/4] ieee802154: add llsec support over nl802154

2015-09-28 Thread Alexander Aring

Hi,

this patch series will add llsec support for nl802154.

What is "llsec"?

The llsec (I suppose it stands for linklayer security) is part of the SoftMAC
implementation of 802.15.4 "net/mac802154/llsec.c". The 802.15.4 standard
describes an security mechanism over ACL's. The encryption/decryption will do
llsec. To access llsec we need an interface for nl802154. The 802.15.4 standard
describes PHY/MAC layer and we have "possible" similar paradigms like wireless
with SoftMAC and HardMAC drivers. (We don't support HardMAC transceivers right
now, I never had some HardMAC transceivers, are really expensive and there are
only few some which can also run in a "raw" mode.) Anyway the nl802154 should
access SoftMAC/HardMAC drivers to abstract "one interface to userspace".

These ACL's are known as "security tables" inside the mac information base
(MIB) of 802.15.4 standard, security MIB.

The final goal we have to provide these tables in userspace is an "iptables"
handling "store" and "restore", over the userspace application "iwpan" which
contains the general "framework mechanism" like wireless "iw" tool, you can
add/del entries on these security tables, then.

I don't looked right now how iptables userspace application do "exactly" the
store and restore mechanism. The current way is a very KISS handling:

 We add netlink cmd's to add/del the table entries. Over the dump callback
 it's possible to get all information which is printed out as the command
 line string "iwpan dev $WPAN_DEV $TABLE add ...". The restore script will
 simple export $WPAN_DEV variable to restore these configuration for a
 specific interface.

 I will send the userspace patches as well to netdev, maybe somebody wants
 to know what I did there for first support.

This sounds weird but is to support llsec somehow a acceptable use-case. The
final goal is to lookup how iptables works and make a nicer C implementation.
There is currently no "official supported" userspace tool which support
accessing the "llsec".

I added several TODO's to the current implementation and added a new:

CONFIG_IEEE802154_NL802154_EXPERIMENTAL

This config will not build the nl802154 llsec layer and reduce the MAX_ATTR
attribute of nl802154 interface. With this config I explicit say this interface
over nl802154 is still in development and will be changed later.

The 802.15.4 subsystem is still in EXPERIMENTAL state, there was some commit
f4671a90c418b5aae14b61a9fc9d79c629403ca0 ("net/ieee802154: remove depends on
CONFIG_EXPERIMENTAL") which is fine but no maintainer ever said it's not
experimental anymore.

Checkpatch will complain about some above 80-chars width, at these places I
ignore these warning otherwise the code looks awful in my opinion.

My current working repository is still bluetooth-next/master. David if
everything is fine, then please ack patch "[PATCH bluetooth-next 1/4]
netlink: add nla_get for le32 and le64", so Marcel can apply it. Thanks.

- Alex

Alexander Aring (4):
  netlink: add nla_get for le32 and le64
  nl802154: use nla_get_le64 for get extended addr
  nl802154: add support for security layer
  mac802154: add comments for llsec issues

 include/net/cfg802154.h |  131 
 include/net/ieee802154_netdev.h |   75 ---
 include/net/netlink.h   |   18 +
 include/net/nl802154.h  |  191 ++
 net/ieee802154/Kconfig  |5 +
 net/ieee802154/core.c   |   12 +
 net/ieee802154/core.h   |1 +
 net/ieee802154/nl802154.c   | 1320 ---
 net/ieee802154/rdev-ops.h   |  109 
 net/mac802154/cfg.c |  205 ++
 net/mac802154/rx.c  |4 +
 net/mac802154/tx.c  |4 +
 12 files changed, 1903 insertions(+), 172 deletions(-)

-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 190 matches

Mail list logo