from:"David Miller"

Re: [PATCH net-next] ip: silence udp zerocopy smatch false positive

2018-12-08 Thread David Miller

From: Willem de Bruijn 
Date: Sat,  8 Dec 2018 06:22:46 -0500

> From: Willem de Bruijn 
> 
> extra_uref is used in __ip(6)_append_data only if uarg is set.
> 
> Smatch sees that the variable is passed to sock_zerocopy_put_abort.
> This function accesses it only when uarg is set, but smatch cannot
> infer this.
> 
> Make this dependency explicit.
> 
> Fixes: 52900d22288e ("udp: elide zerocopy operation in hot path")
> Signed-off-by: Willem de Bruijn 

I looked and can't figure out a better way to fix this :)

Applied, thanks Willem.

Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread David Miller

From: Ilias Apalodimas 
Date: Sat, 8 Dec 2018 16:57:28 +0200

> The patchset speeds up the mvneta driver on the default network
> stack. The only change that was needed was to adapt the driver to
> using the page_pool API. The speed improvements we are seeing on
> specific workloads (i.e 256b < packet < 400b) are almost 3x.
> 
> Lots of high speed drivers are doing similar recycling tricks themselves (and
> there's no common code, everyone is doing something similar though). All we 
> are
> trying to do is provide a unified API to make that easier for the rest. 
> Another
> advantage is that if the some drivers switch to the API, adding XDP
> functionality on them is pretty trivial.

Yeah this is a very important point moving forward.

Jesse Brandeberg brought the following up to me at LPC and I'd like to
develop it further.

Right now we tell driver authors to write a new driver as SKB based,
and once they've done all of that work we tell them to basically
shoe-horn XDP support into that somewhat different framework.

Instead, the model should be the other way around, because with a raw
meta-data free set of data buffers we can always construct an SKB or
pass it to XDP.

So drivers should be targetting some raw data buffer kind of interface
which takes care of all of this stuff.  If the buffers get wrapped
into an SKB and get pushed into the traditional networking stack, the
driver shouldn't know or care.  Likewise if it ends up being processed
with XDP, it should not need to know or care.

All of those details should be behind a common layer.  Then we can
control:

1) Buffer handling, recycling, "fast paths"

2) Statistics

3) XDP feature sets

We can consolidate behavior and semantics across all of the drivers
if we do this.  No more talk about "supporting all XDP features",
and the inconsistencies we have because of that.

The whole common statistics discussion could be resolved with this
common layer as well.

We'd be able to control and properly optimize everything.

Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread David Miller

From: Jesper Dangaard Brouer 
Date: Sat, 8 Dec 2018 12:36:10 +0100

> The annoying part is actually that depending on the kernel config
> options CONFIG_XFRM, CONFIG_NF_CONNTRACK and CONFIG_BRIDGE_NETFILTER,
> whether there is a cache-line split, where mem_info gets moved into the
> next cacheline.

Note that Florian Westphal's work (trying to help MP-TCP) would
eliminate this variability.

Re: [net-next PATCH RFC 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-07 Thread David Miller

From: Jesper Dangaard Brouer 
Date: Fri, 07 Dec 2018 00:25:47 +0100

> @@ -744,6 +745,10 @@ struct sk_buff {
>   head_frag:1,
>   xmit_more:1,
>   pfmemalloc:1;
> + /* TODO: Future idea, extend mem_info with __u8 flags, and
> +  * move bits head_frag and pfmemalloc there.
> +  */
> + struct xdp_mem_info mem_info;

This is 4 bytes right?

I guess I can live with this.

Please do some microbenchmarks to make sure this doesn't show any
obvious regressions.

Thanks.

Re: [net-next PATCH RFC 1/8] page_pool: add helper functions for DMA

2018-12-07 Thread David Miller

From: Jesper Dangaard Brouer 
Date: Fri, 07 Dec 2018 00:25:32 +0100

> From: Ilias Apalodimas 
> 
> Add helper functions for retreiving dma_addr_t stored in page_private and
> unmapping dma addresses, mapped via the page_pool API.
> 
> Signed-off-by: Ilias Apalodimas 
> Signed-off-by: Jesper Dangaard Brouer 

This isn't going to work on 32-bit platforms where dma_addr_t is a u64,
because the page private is unsigned long.

Grep for PHY_ADDR_T_64BIT under arch/ to see the vast majority of the
cases where this happens, then ARCH_DMA_ADDR_T_64BIT.

Re: [PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"

2018-12-07 Thread David Miller

From: Benjamin Herrenschmidt 
Date: Fri, 07 Dec 2018 15:05:04 +1100

> This reverts commit 624ca9c33c8a853a4a589836e310d776620f4ab9.
> 
> This commit is completely bogus. The STACR register has two formats, old
> and new, depending on the version of the IP block used. There's a pair of
> device-tree properties that can be used to specify the format used:
> 
>   has-inverted-stacr-oc
>   has-new-stacr-staopc
> 
> What this commit did was to change the bit definition used with the old
> parts to match the new parts. This of course breaks the driver on all
> the old ones.
> 
> Instead, the author should have set the appropriate properties in the
> device-tree for the variant used on his board.
> 
> Signed-off-by: Benjamin Herrenschmidt 
> ---
> 
> Found while setting up some old ppc440 boxes for test/CI

Applied, thanks.

Re: [PATCH] net-udp: deprioritize cpu match for udp socket lookup

2018-12-07 Thread David Miller

From: Maciej Żenczykowski 
Date: Fri, 7 Dec 2018 16:46:36 -0800

>> This doesn't apply to the current net tree.
>>
>> Also "net-udp: " is a weird subsystem prefix, just use "udp: ".
>>
>> Thank you.
> 
> Interesting... this patch was on top of net-next/master, and it still
> rebases cleanly on current net-next/master.
> 
> Would you like it on net/master instead?  It indeed doesn't apply
> cleanly there...

Well, it is a bug fix isn't it?  Or is this more like a behavioral feature?

Re: [PATCH net-next 0/4] tc-testing: implement command timeouts and better results tracking

2018-12-07 Thread David Miller

From: Lucas Bates 
Date: Thu,  6 Dec 2018 17:42:23 -0500

> Patch 1 adds a timeout feature for any command tdc launches in a subshell.
> This prevents tdc from hanging indefinitely.
> 
> Patches 2-4 introduce a new method for tracking and generating test case
> results, and implements it across the core script and all applicable
> plugins.

Series applied.

Re: [PATCH net v2 0/2] Fix slab out-of-bounds on insufficient headroom for IPv6 packets

2018-12-07 Thread David Miller

From: Stefano Brivio 
Date: Thu,  6 Dec 2018 19:30:35 +0100

> Patch 1/2 fixes a slab out-of-bounds occurring with short SCTP packets over
> IPv4 over L2TP over IPv6 on a configuration with relatively low HEADER_MAX.
> 
> Patch 2/2 makes sure we avoid writing before the allocated buffer in
> neigh_hh_output() in case the headroom is enough for the unaligned hardware
> header size, but not enough for the aligned one, and that we warn if we hit
> this condition.

Series applied and queued up for -stable, thanks.

Re: [PATCH net] tcp: lack of available data can also cause TSO defer

2018-12-07 Thread David Miller

From: Eric Dumazet 
Date: Thu,  6 Dec 2018 09:58:24 -0800

> tcp_tso_should_defer() can return true in three different cases :
> 
>  1) We are cwnd-limited
>  2) We are rwnd-limited
>  3) We are application limited.
> 
> Neal pointed out that my recent fix went too far, since
> it assumed that if we were not in 1) case, we must be rwnd-limited
> 
> Fix this by properly populating the is_cwnd_limited and
> is_rwnd_limited booleans.
> 
> After this change, we can finally move the silly check for FIN
> flag only for the application-limited case.
> 
> The same move for EOR bit will be handled in net-next,
> since commit 1c09f7d073b1 ("tcp: do not try to defer skbs
> with eor mark (MSG_EOR)") is scheduled for linux-4.21
> 
> Tested by running 200 concurrent netperf -t TCP_RR -- -r 6,100
> and checking none of them was rwnd_limited in the chrono_stat
> output from "ss -ti" command.
> 
> Fixes: 41727549de3e ("tcp: Do not underestimate rwnd_limited")
> Signed-off-by: Eric Dumazet 
> Suggested-by: Neal Cardwell 
> Reviewed-by: Neal Cardwell 
> Acked-by: Soheil Hassas Yeganeh 
> Reviewed-by: Yuchung Cheng 

Applied.

Re: [PATCH] net-udp: deprioritize cpu match for udp socket lookup

2018-12-07 Thread David Miller

From: Maciej Żenczykowski 
Date: Wed,  5 Dec 2018 12:59:17 -0800

> From: Maciej Żenczykowski 
> 
> During udp socket lookup cpu match should be lowest priority,
> hence it should increase score by only 1.
> 
> The next priority is delivering v4 to v4 sockets, and v6 to v6 sockets.
> The v6 code path doesn't have to deal with this so it always gets
> a score of '4'.  The v4 code path uses '4' or '2' depending on
> whether we're delivering to a v4 socket or a dualstack v6 socket.
> 
> This is more important than cpu match, so has to be greater than
> the '1' bump in score from cpu match.
> 
> All other matches (src/dst ip, src port) are even *more* important,
> so need to bump score by 4 for ipv4.
> 
> For ipv6 we could simply bump by 2, but let's keep the two code
> paths as similar as possible.
> 
> (also, while at it, remove two unnecessary unconditional score bumps)
> 
> Signed-off-by: Maciej Żenczykowski 

This doesn't apply to the current net tree.

Also "net-udp: " is a weird subsystem prefix, just use "udp: ".

Thank you.

Re: [Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-07 Thread David Miller

From: yupeng 
Date: Wed,  5 Dec 2018 18:56:28 -0800

> after set SO_DONTROUTE to 1, the IP layer should not route packets if
> the dest IP address is not in link scope. But if the socket has cached
> the dst_entry, such packets would be routed until the sk_dst_cache
> expires. So we should clean the sk_dst_cache when a user set
> SO_DONTROUTE option. Below are server/client python scripts which
> could reprodue this issue:
 ...
> Signed-off-by: yupeng 

Applied.

Re: [PATCH v2 net-next] neighbor: Improve garbage collection

2018-12-07 Thread David Miller

From: David Ahern 
Date: Fri,  7 Dec 2018 12:24:57 -0800

> From: David Ahern 
> 
> The existing garbage collection algorithm has a number of problems:
 ...
> This patch addresses these problems as follows:
> 
> 1. Use of a separate list_head to track entries that can be garbage
>collected along with a separate counter. PERMANENT entries are not
>added to this list.
> 
>The gc_thresh parameters are only compared to the new counter, not the
>total entries in the table. The forced_gc function is updated to only
>walk this new gc_list looking for entries to evict.
> 
> 2. Entries are added to the list head at the tail and removed from the
>front.
> 
> 3. Entries are only evicted if they were last updated more than 5 seconds
>ago, adhering to the original intent of gc_thresh2.
> 
> 4. Forced gc is stopped once the number of gc_entries drops below
>gc_thresh2.
> 
> 5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
>when allocating a new neighbor for a PERMANENT entry. By extension this
>means there are no explicit limits on the number of PERMANENT entries
>that can be created, but this is no different than FIB entries or FDB
>entries.
> 
> Signed-off-by: David Ahern 
> ---
> v2
> - remove on_gc_list boolean in favor of !list_empty
> - fix neigh_alloc to add new entry to tail of list_head

Again, looks great, applied.

Re: [PATCH V2] net: dsa: ksz: Add reset GPIO handling

2018-12-07 Thread David Miller

From: Marek Vasut 
Date: Fri, 7 Dec 2018 23:59:58 +0100

> On 12/07/2018 11:24 PM, Andrew Lunn wrote:
>> On Fri, Dec 07, 2018 at 10:51:36PM +0100, Marek Vasut wrote:
>>> Add code to handle optional reset GPIO in the KSZ switch driver. The switch
>>> has a reset GPIO line which can be controlled by the CPU, so make sure it is
>>> configured correctly in such setups.
>> 
>> Hi Marek
> 
> Hi Andrew,
> 
>> Please make this a patch series, not two individual patches.
> 
> This actually is an individual patch, it doesn't depend on anything.
> Or do you mean a series with the DT documentation change ?

Yes, but all of this stuff is building up for one single purpose,
and that is to support a new mode of operation with DSA or whatever.

So please group them together in a series with an appropriate
header posting.

Re: [PATCH net-next] neighbor: Add protocol attribute

2018-12-07 Thread David Miller

From: Eric Dumazet 
Date: Fri, 7 Dec 2018 15:03:04 -0800

> On 12/07/2018 02:24 PM, David Ahern wrote:
>> On 12/7/18 3:20 PM, Eric Dumazet wrote:
>> 
>> /* --- cacheline 3 boundary (192 bytes) --- */
>> struct hh_cachehh;   /*   19248 */
>> 
>> ...
>> 
>> but does not change the actual allocation size which is rounded to 512.
>> 
> 
> I have not talked about the allocation size, but alignment of ->ha field,
> which is kind of assuming long alignment, in a strange way.

Right, neigh->ha[] should probably be kept 8-byte aligned.

Re: [PATCH 1/5] net: dsa: ksz: Add MIB counter reading support

2018-12-07 Thread David Miller



Every patch series should have a header posting with Subject of
the form "[PATCH 0/N] ..." explaining what the series does at
a high level, how it does it, and why it does it that way.

Re: [PATCH v2 net-next 0/4] net: aquantia: add RSS configuration

2018-12-07 Thread David Miller

From: Igor Russkikh 
Date: Fri, 7 Dec 2018 14:00:09 +

> In this patchset few bugs related to RSS are fixed and RSS table and
> hash key configuration is added.
> 
> We also do increase max number of HW rings upto 8.
> 
> v2: removed extra arg check

Series applied.

Re: [PATCH net] ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output

2018-12-07 Thread David Miller

From: Shmulik Ladkani 
Date: Fri,  7 Dec 2018 09:50:17 +0200

> In 'seg6_output', stack variable 'struct flowi6 fl6' was missing
> initialization.
> 
> Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and 
> injection with lwtunnels")
> Signed-off-by: Shmulik Ladkani 

Applied and queued up for -stable, thanks.

Re: [PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"

2018-12-06 Thread David Miller



Looks like your posting was empty?

Re: [PATCH net-next] neighbour: Improve garbage collection

2018-12-06 Thread David Miller

From: David Ahern 
Date: Thu,  6 Dec 2018 14:38:44 -0800

> The existing garbage collection algorithm has a number of problems:

Thanks for working on this!

I totally agree with what you are doing, especially the separate
gc_list.

But why do you need the on_gc_list boolean state?  That's equivalent
to "!list_empty(>gc_list)" and seems redundant.

Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads

2018-12-06 Thread David Miller

From: Andrew Lunn 
Date: Thu, 6 Dec 2018 21:48:46 +0100

> David has already accepted the patchset, so i will add a followup
> patch.

Yeah sorry for jumping the gun, the changes looked pretty
straightforward to me. :-/

Re: [PATCH net-next v2 0/8] Pass extack to NETDEV_PRE_UP

2018-12-06 Thread David Miller

From: Petr Machata 
Date: Thu, 6 Dec 2018 17:05:35 +

> Drivers may need to validate configuration of a device that's about to
> be upped. An example is mlxsw, which needs to check the configuration of
> a VXLAN device attached to an offloaded bridge. Should the validation
> fail, there's currently no way to communicate details of the failure to
> the user, beyond an error number.
> 
> Therefore this patch set extends the NETDEV_PRE_UP event to include
> extack, if available.
 ...

Series applied, thank you.

Re: [PATCH net 0/4] mlxsw: Various fixes

2018-12-06 Thread David Miller

From: Ido Schimmel 
Date: Thu, 6 Dec 2018 17:44:48 +

> Patches #1 and #2 fix two VxLAN related issues. The first patch removes
> warnings that can currently be triggered from user space. Second patch
> avoids leaking a FID in an error path.
> 
> Patch #3 fixes a too strict check that causes certain host routes not to
> be promoted to perform GRE decapsulation in hardware.
> 
> Last patch avoids a use-after-free when deleting a VLAN device via an
> ioctl when it is enslaved to a bridge. I have a patchset for net-next
> that reworks this code and makes the driver more robust.

Series applied.

Re: mv88e6060: Turn e6060 driver into e6065 driver

2018-12-06 Thread David Miller

From: Pavel Machek 
Date: Thu, 6 Dec 2018 14:03:45 +0100

> @@ -79,7 +82,7 @@ static enum dsa_tag_protocol 
> mv88e6060_get_tag_protocol(struct dsa_switch *ds,
>  {
>//return DSA_TAG_PROTO_QCA;
>//return DSA_TAG_PROTO_TRAILER;

These C++ style comments are not in any of my tree(s).

Your patch submission really needs to shape up if you want your patches
to be considered seriously.

Thank you.

Re: [PATCH] mv88e6060: Warn about errors

2018-12-06 Thread David Miller



Plain "printk" are never appropriate.

Please explicitly use pr_warn() or similar.  If there is a device context
available, either a generic device or a netdev, use one of the dev_*()
or netdev_*() variants.

Re: [PATCH] tcp: fix code style in tcp_recvmsg()

2018-12-06 Thread David Miller

From: Pedro Tammela 
Date: Thu,  6 Dec 2018 10:45:28 -0200

> 2 goto labels are indented with a tab. remove the tabs and
> keep the code style consistent.
> 
> Signed-off-by: Pedro Tammela 

Applied to net-next.

Re: [PATCH net-next 0/2] Adjust MTU of DSA master interface

2018-12-06 Thread David Miller

From: Andrew Lunn 
Date: Thu,  6 Dec 2018 11:36:03 +0100

> DSA makes use of additional headers to direct a frame in/out of a
> specific port of the switch. When the slave interfaces uses an MTU of
> 1500, the master interface can be asked to handle frames with an MTU
> of 1504, or 1508 bytes. Some Ethernet interfaces won't
> transmit/receive frames which are bigger than their MTU.
> 
> Automate the increasing of the MTU on the master interface, by adding
> to each tagging driver how much overhead they need, and then calling
> dev_set_mtu() of the master interface to increase its MTU as needed.

Series applied, thanks Andrew.

Re: [PATCH][net-next] tun: align write-heavy flow entry members to a cache line

2018-12-06 Thread David Miller

From: Li RongQing 
Date: Thu,  6 Dec 2018 16:08:17 +0800

> tun flow entry 'updated' fields are written when receive
> every packet. Thus if a flow is receiving packets from a
> particular flow entry, it'll cause false-sharing with
> all the other who has looked it up, so move it in its own
> cache line
> 
> and update 'queue_index' and 'update' field only when
> they are changed to reduce the cache false-sharing.
> 
> Signed-off-by: Zhang Yu 
> Signed-off-by: Wang Li 
> Signed-off-by: Li RongQing 

Applied.

Re: [PATCH][net-next] tun: remove unnecessary check in tun_flow_update

2018-12-06 Thread David Miller

From: Li RongQing 
Date: Thu,  6 Dec 2018 16:28:11 +0800

> caller has guaranted that rxhash is not zero
> 
> Signed-off-by: Li RongQing 

Applied.

Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-06 Thread David Miller

From: Jouke Witteveen 
Date: Thu, 6 Dec 2018 09:59:20 +0100

> On Thu, Dec 6, 2018 at 1:34 AM David Miller  wrote:
>>
>> From: Jouke Witteveen 
>> Date: Wed, 5 Dec 2018 23:38:17 +0100
>>
>> > Can you elaborate a bit? I may not be aware of the policy you have in
>> > mind.
>>
>> When we have a user facing interface to do something, we don't create
>> another one unless it is absolutely, positively, unavoidable.
> 
> Obviously, if I would have known this I would not have gone through
> the trouble of investigating and proposing this patch. It was an
> honest attempt at making the kernel better.
> Where could I have found this policy? I have looked on kernel.org/doc,
> but couldn't find it.

It is not formally documented but it is a concern we raise every time
a duplicate piece of user facing functionality is proposed.

Re: [PATCH net] sctp: fix pr_warn max_data argument type mismatch

2018-12-06 Thread David Miller

From: Jakub Audykowicz 
Date: Thu,  6 Dec 2018 08:58:37 +0100

> My previous patch introduced a compilation warning regarding a type
> mismatch (int vs size_t). This is a one-letter fix for good housekeeping.
> 
> Signed-off-by: Jakub Audykowicz 

Still wrong and I fixed it when I applied your patch.

You need to use the 'Z' prefix for size_t, so %Zu in this case.

Re: [PATCH net-next] neighbor: Add extack messages for add and delete commands

2018-12-05 Thread David Miller

From: David Ahern 
Date: Wed,  5 Dec 2018 20:02:29 -0800

> From: David Ahern 
> 
> Add extack messages for failures in neigh_add and neigh_delete.
> 
> Signed-off-by: David Ahern 

Looks good, applied, thanks David.

Re: [PATCH net] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

2018-12-05 Thread David Miller

From: Jiri Wiesner 
Date: Wed, 5 Dec 2018 16:55:29 +0100

> The *_frag_reasm() functions are susceptible to miscalculating the byte
> count of packet fragments in case the truesize of a head buffer changes.
> The truesize member may be changed by the call to skb_unclone(), leaving
> the fragment memory limit counter unbalanced even if all fragments are
> processed. This miscalculation goes unnoticed as long as the network
> namespace which holds the counter is not destroyed.
> 
> Should an attempt be made to destroy a network namespace that holds an
> unbalanced fragment memory limit counter the cleanup of the namespace
> never finishes. The thread handling the cleanup gets stuck in
> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
> thread is usually in running state with a stacktrace similar to:
> 
>  PID: 1073   TASK: 880626711440  CPU: 1   COMMAND: "kworker/u48:4"
>   #5 [880621563d48] _raw_spin_lock at 815f5480
>   #6 [880621563d48] inet_evict_bucket at 8158020b
>   #7 [880621563d80] inet_frags_exit_net at 8158051c
>   #8 [880621563db0] ops_exit_list at 814f5856
>   #9 [880621563dd8] cleanup_net at 814f67c0
>  #10 [880621563e38] process_one_work at 81096f14
> 
> It is not possible to create new network namespaces, and processes
> that call unshare() end up being stuck in uninterruptible sleep state
> waiting to acquire the net_mutex.
> 
> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
> I thank him for his analysis of the problem. The parts of this patch
> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
> 
> Signed-off-by: Jiri Wiesner 
> Reported-by: Per Sundstrom 

Nice catch.

Applied and queued up for -stable, thanks!

Re: [PATCH net 1/3] flex_array: make FLEX_ARRAY_BASE_SIZE the same value of FLEX_ARRAY_PART_SIZE

2018-12-05 Thread David Miller

From: Xin Long 
Date: Wed,  5 Dec 2018 14:49:40 +0800

> This patch is to separate the base data memory from struct flex_array and
> save it into a page. With this change, total_nr_elements of a flex_array
> can grow or shrink without having the old element's memory changed when
> the new size of the flex_arry crosses FLEX_ARRAY_BASE_SIZE, which will
> be added in the next patch.
> 
> Suggested-by: Neil Horman 
> Signed-off-by: Xin Long 

This needs to be reviewed by the flex array hackers and lkml.

It can't just get reviewed on netdev alone.

Re: [PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree

2018-12-05 Thread David Miller

From: Peter Oskolkov 
Date: Tue,  4 Dec 2018 11:55:56 -0800

> When testing high-bandwidth TCP streams with large windows,
> high latency, and low jitter, netem consumes a lot of CPU cycles
> doing rbtree rebalancing.
> 
> This patch uses a linear list/queue in addition to the rbtree:
> if an incoming packet is past the tail of the linear queue, it is
> added there, otherwise it is inserted into the rbtree.
> 
> Without this patch, perf shows netem_enqueue, netem_dequeue,
> and rb_* functions among the top offenders. With this patch,
> only netem_enqueue is noticeable if jitter is low/absent.
> 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Peter Oskolkov 

Applied, thanks.

Re: [PATCH net] sctp: frag_point sanity check

2018-12-05 Thread David Miller

From: Jakub Audykowicz 
Date: Tue,  4 Dec 2018 20:27:41 +0100

> If for some reason an association's fragmentation point is zero,
> sctp_datamsg_from_user will try to endlessly try to divide a message
> into zero-sized chunks. This eventually causes kernel panic due to
> running out of memory.
> 
> Although this situation is quite unlikely, it has occurred before as
> reported. I propose to add this simple last-ditch sanity check due to
> the severity of the potential consequences.
> 
> Signed-off-by: Jakub Audykowicz 

Applied.

Re: [PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Miller

From: David Ahern 
Date: Wed, 5 Dec 2018 17:46:37 -0700

> ok. patches 5-7 are not dependent on 1-4. Should I re-send outside of
> this set?

Yes, please respin.

Thanks David.

Re: [pull request][net-next V2 0/7] Mellanox, mlx5e updates 2018-12-04

2018-12-05 Thread David Miller

From: Saeed Mahameed 
Date: Wed,  5 Dec 2018 16:12:58 -0800

> The following series is for mlx5e netdevice driver, it adds ethtool
> support for RX hash fields configuration and some misc updates, please
> see tag log below.
> 
> Please pull and let me know if there's any problem.
> 
> v1->v2:
>  - Move static const array to c file.
>  - Remove unnecessary blank line
>  - Add #include 
>  - Print priv flag name rather than its hex value

Pulled, thanks Saeed.

Re: [PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Miller

From: David Ahern 
Date: Wed,  5 Dec 2018 15:34:09 -0800

> @@ -270,37 +270,25 @@ static inline bool neigh_key_eq128(const struct 
> neighbour *n, const void *pkey)
>   (n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
>  }
>  
> -static inline struct neighbour *___neigh_lookup_noref(
> - struct neigh_table *tbl,
> - bool (*key_eq)(const struct neighbour *n, const void *pkey),
> - __u32 (*hash)(const void *pkey,
> -   const struct net_device *dev,
> -   __u32 *hash_rnd),
> - const void *pkey,
> - struct net_device *dev)
> +static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
> +  const void *pkey,
> +  struct net_device *dev)
>  {

Sorry, we can't do this.

The whole point of how this is laid out is so that the entire hash traversal,
including the hash function, is expanded inline.

This demux is extremely critical on the output side, it must be the
smallest number of cycles possible.  It was the only way I could justify
not caching neigh entries in the routes any more when I wrote this code.

Even before retpoline, putting an indirect call here is painful.  With
retpoline it is deadly.

Please avoid removing the full inline expansion of the neigh lookup in the ipv6
and ipv4 data paths.

Thank you.

Re: [PATCH net] tcp: fix NULL ref in tail loss probe

2018-12-05 Thread David Miller

From: Yuchung Cheng 
Date: Wed,  5 Dec 2018 14:38:38 -0800

> TCP loss probe timer may fire when the retranmission queue is empty but
> has a non-zero tp->packets_out counter. tcp_send_loss_probe will call
> tcp_rearm_rto which triggers NULL pointer reference by fetching the
> retranmission queue head in its sub-routines.
> 
> Add a more detailed warning to help catch the root cause of the inflight
> accounting inconsistency.
> 
> Reported-by: Rafael Tinoco 
> Signed-off-by: Yuchung Cheng 
> Signed-off-by: Eric Dumazet 
> Signed-off-by: Neal Cardwell 

Applied, thanks for working to diagnose this so quickly.

Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-05 Thread David Miller

From: Jouke Witteveen 
Date: Wed, 5 Dec 2018 23:38:17 +0100

> Can you elaborate a bit? I may not be aware of the policy you have in
> mind.

When we have a user facing interface to do something, we don't create
another one unless it is absolutely, positively, unavoidable.

Re: [PATCH net] tcp: Do not underestimate rwnd_limited

2018-12-05 Thread David Miller

From: Eric Dumazet 
Date: Wed,  5 Dec 2018 14:24:31 -0800

> If available rwnd is too small, tcp_tso_should_defer()
> can decide it is worth waiting before splitting a TSO packet.
> 
> This really means we are rwnd limited.
> 
> Fixes: 5615f88614a4 ("tcp: instrument how long TCP is limited by receive 
> window")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks Eric.

Re: pull-request: bpf 2018-12-05

2018-12-05 Thread David Miller

From: Alexei Starovoitov 
Date: Wed, 5 Dec 2018 13:23:22 -0800

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) fix bpf uapi pointers for 32-bit architectures, from Daniel.
> 
> 2) improve verifer ability to handle progs with a lot of branches, from 
> Alexei.
> 
> 3) strict btf checks, from Yonghong.
> 
> 4) bpf_sk_lookup api cleanup, from Joe.
> 
> 5) other misc fixes
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thank you.

Re: [PATCH net-next 0/6] u32 to linkmode fixes

2018-12-05 Thread David Miller

From: Andrew Lunn 
Date: Wed,  5 Dec 2018 21:49:39 +0100

> This patchset fixes issues found in the last patchset which converted
> the phydev advertise etc, from a u32 to a linux bitmap. Most of the
> issues are the result of clearing bits which should not of been
> cleared. To make the API clearer, the idea from Heiner Kallweit was
> used, with _mod_ to indicate the function modifies just the bits it
> needs to, or _to_ to clear all bits and just set bit that need to be
> set.

Series applied, thanks Andrew.

Please always list the Fixes tag first in the future.  I fixed if up
for you this time.

Thanks again.

Re: [PATCH net] net: use skb_list_del_init() to remove from RX sublists

2018-12-05 Thread David Miller

From: Edward Cree 
Date: Tue, 4 Dec 2018 17:37:57 +

> list_del() leaves the skb->next pointer poisoned, which can then lead to
>  a crash in e.g. OVS forwarding.  For example, setting up an OVS VXLAN
>  forwarding bridge on sfc as per:
 ...
> So, in all listified-receive handling, instead pull skbs off the lists with
>  skb_list_del_init().
> 
> Fixes: 9af86f933894 ("net: core: fix use-after-free in 
> __netif_receive_skb_list_core")
> Fixes: 7da517a3bc52 ("net: core: Another step of skb receive list processing")
> Fixes: a4ca8b7df73c ("net: ipv4: fix drop handling in ip_list_rcv() and 
> ip_list_rcv_finish()")
> Fixes: d8269e2cbf90 ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
> Signed-off-by: Edward Cree 

Applied and queued up for -stable

> I'm not sure if these are the right Fixes tags, or if I should instead be
>  fingering some commit that made dev_hard_start_xmit() more sensitive to
>  skb->next.
> Also, I only saw a crash from the list_del() in 
> __netif_receive_skb_list_core()
>  but I converted all of them in the listified RX path, in case any others
>  have similar ways to escape into paths that care about skb->next.

I think we should use skb_list_del_init() on in all cases skb->list except
where we immediately queue it onto another list in a trivially auditable
way.

Therefore I think what you did is the way to go.

Thanks.

Re: [PATCH net] macvlan: remove duplicate check

2018-12-05 Thread David Miller

From: Matteo Croce 
Date: Tue,  4 Dec 2018 18:05:42 +0100

> Following commit 59f997b088d2 ("macvlan: return correct error value"),
> there is a duplicate check for mac addresses both in macvlan_sync_address()
> and macvlan_set_mac_address().
> As the former calls the latter, remove the one in macvlan_set_mac_address()
> and move the one in macvlan_sync_address() before any other check.
> 
> Signed-off-by: Matteo Croce 

Hmmm, doesn't this change behavior?

For the handling of the NETDEV_CHANGEADDR event in macvlan_device_event()
we would make it to macvlan_sync_address(), and if IFF_UP is false,
we would elide the macvlan_addr_busy() check and just copy the MAC addres
over and return.

Now, we would always perform the macvlan_addr_busy() check.

Please, if this is OK, explain and document this behavioral chance in
the commit message.

Thank you.

Re: [PATCH 0/3] net: macb: DMA race condition fixes

2018-12-05 Thread David Miller

From: Anssi Hannula 
Date: Fri, 30 Nov 2018 20:21:34 +0200

> Here are a couple of race condition fixes for the macb driver. The first
> two are issues observed on real HW.

It looks like there is still an active discussion about the memory
barriers in patch #3 being excessive.

Once that is sorted out to everyone's satisfaction, would you
please repost this series with appropriate ACKs, reviewed-by's,
tested-by's, etc. added?

Thank you.

Re: [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA

2018-12-05 Thread David Miller

From: Anssi Hannula 
Date: Fri, 30 Nov 2018 20:21:35 +0200

> @@ -682,6 +682,11 @@ static void macb_set_addr(struct macb *bp, struct 
> macb_dma_desc *desc, dma_addr_
>   if (bp->hw_dma_cap & HW_DMA_CAP_64B) {
>   desc_64 = macb_64b_desc(bp, desc);
>   desc_64->addrh = upper_32_bits(addr);
> + /* The low bits of RX address contain the RX_USED bit, clearing
> +  * of which allows packet RX. Make sure the high bits are also
> +  * visible to HW at that point.
> +  */
> + dma_wmb();
>   }

I agree with that dma_wmb() is what should be used here.

We are ordering CPU stores with DMA visibility, which is exactly what
the dma_*() are for.

If it doesn't work properly on some architecture's implementation of dma_*(),
those should be fixed rather than papering over it in the drivers.

Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-05 Thread David Miller

From: Jouke Witteveen 
Date: Wed, 5 Dec 2018 14:50:31 +0100

> For example, I maintain a network manager that delegates the actual
> networking work to specialized programs.

Basically "I've implemented things using separate programs"

> Basically, it is an implementation of network manager logic in shell
> script. For such a shell script, it is easy to respond to uevents
> (via udev, or alternatives), but responding to rtnetlink messages
> would require a separate program.

And "In order to use rtnetlink I'll need a separate program!"

(╯°□°）╯︵ ┻━┻

So it's ok to use the separate program paradigm for dividing up
the tasks, but not for processing events?

I'm not convinced.

Either use the facility we have or extend it to fill a valid missing
need.

I'm not applying these patches, your logic doesn't add up and it's
inconsistent with our clear goals of not duplicating functionality.

Re: [PATCH bpf-next 2/7] ppc: bpf: implement jitting of BPF_ALU | BPF_ARSH | BPF_*

2018-12-05 Thread David Miller

From: Jiong Wang 
Date: Wed, 05 Dec 2018 11:28:32 +

> Indeed. Doubled checked the ISA doc,"Bit 32 of RS is replicated to fill
> RA0:31.".
> 
> Will fix both places in v2.

See, sparc64 isn't so weird :-)

Re: [PATCH net-next] tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT

2018-12-04 Thread David Miller

From: Eric Dumazet 
Date: Tue,  4 Dec 2018 07:58:17 -0800

> TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
> as a step to enable bigger tcp sndbuf limits.
> 
> It works reasonably well, but the following happens :
> 
> Once the limit is reached, TCP stack generates
> an [E]POLLOUT event for every incoming ACK packet.
> 
> This causes a high number of context switches.
> 
> This patch implements the strategy David Miller added
> in sock_def_write_space() :
> 
>  - If TCP socket has a notsent_lowat constraint of X bytes,
>allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
>only if number of notsent bytes is below X/2
> 
> This considerably reduces TCP_NOTSENT_LOWAT overhead,
> while allowing to keep the pipe full.
 ...
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 

Applied, thanks Eric.

Re: [PATCH v2 2/2] net: mvpp2: fix phylink handling of invalid PHY modes

2018-12-04 Thread David Miller

From: Baruch Siach 
Date: Tue,  4 Dec 2018 16:03:53 +0200

> The .validate phylink callback should empty the supported bitmap when
> the interface mode is invalid.
> 
> Cc: Maxime Chevallier 
> Cc: Antoine Tenart 
> Reported-by: Russell King 
> Signed-off-by: Baruch Siach 

Applied.

Re: [PATCH v2 1/2] net: mvpp2: fix detection of 10G SFP modules

2018-12-04 Thread David Miller

From: Baruch Siach 
Date: Tue,  4 Dec 2018 16:03:52 +0200

> The mvpp2_phylink_validate() relies on the interface field of
> phylink_link_state to determine valid link modes. However, when called
> from phylink_sfp_module_insert() this field in not initialized. The
> default switch case then excludes 10G link modes. This allows 10G SFP
> modules that are detected correctly to be configured at max rate of
> 2.5G.
> 
> Catch the uninitialized PHY mode case, and allow 10G rates.
> 
> Fixes: d97c9f4ab000b ("net: mvpp2: 1000baseX support")
> Cc: Maxime Chevallier 
> Cc: Antoine Tenart 
> Acked-by: Russell King 
> Signed-off-by: Baruch Siach 

Applied.

Re: [PATCH v2 net-next] ip6_tunnel: Adding support of mapping rules for MAP-E tunnel

2018-12-04 Thread David Miller

From: Felix Jia 
Date: Mon,  3 Dec 2018 16:39:31 +1300

> +int
> +ip6_get_addrport(struct iphdr *iph, __be32 *saddr4, __be32 *daddr4,
> +  __be16 *sport4, __be16 *dport4, __u8 *proto, int *icmperr)
> +{

This looks like something the flow dissector can do alreayd, please look into
utilizing that common piece of infrastructure instead of reimplementing it.

> + u8 *ptr;
> + struct iphdr *icmpiph = NULL;
> + struct tcphdr *tcph, *icmptcph;
> + struct udphdr *udph, *icmpudph;
> + struct icmphdr *icmph, *icmpicmph;

Please always order local variables from longest to shortest line.

Please audit your entire submission for this problem.

> +static struct ip6_tnl_rule *ip6_tnl_rule_find(struct net_device *dev,
> +   __be32 _dst)
> +{
> + u32 dst = ntohl(_dst);
> + struct ip6_rule_list *pos = NULL;
> + struct ip6_tnl *t = netdev_priv(dev);
> +
> + list_for_each_entry(pos, >rules.list, list) {
> + int mask =
> + 0x ^ ((1 << (32 - pos->data.ipv4_prefixlen)) - 1);
> + if ((dst & mask) == ntohl(pos->data.ipv4_subnet.s_addr))
> + return >data;
> + }
> + return NULL;
> +}

How will this scale with large numbers of rules?

This rule facility seems to be designed in a way that sophisticated
(at least as fast as "O(log N)") lookup schemes aren't even possible,
and that even worse the ordering matters.

Re: [PATCH net-next V2 0/2] net/sched: act_tunnel_key: support key-less tunnels

2018-12-04 Thread David Miller

From: Or Gerlitz 
Date: Sun,  2 Dec 2018 14:55:19 +0200

> This short series from Adi Nissim allows to support key-less tunnels
> by the tc tunnel key actions, which is needed for some GRE use-cases.
> 
> changes from V0:
>  - addresses build warning spotted by kbuild, make sure to always init
>to zero the tunnel key

Series applied to net-next, thank you.

Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-04 Thread David Miller

From: Jouke Witteveen 
Date: Sat, 1 Dec 2018 17:00:21 +0100

> Make it easy for userspace to respond to acquisition/loss of carrier.
> The uevent is picked up by udev and, on systems with systemd, the
> device unit of the interface announces a configuration reload.
> 
> Signed-off-by: Jouke Witteveen 
> ---
> I did not want to change the commit message into a systemd-howto, but
> subscribing to udev events can be done through a line like
> ReloadPropagatedFrom=sys-subsystem-net-devices-%i.device
> in a systemd unit file.

I want to hear more about "why".

If we have the rtnetlink message that can be listened for, userspace
ought to use that.  That's what it is there for.

Re: [PATCH net] rtnetlink: ndo_dflt_fdb_dump() only work for ARPHRD_ETHER devices

2018-12-04 Thread David Miller

From: Eric Dumazet 
Date: Tue,  4 Dec 2018 09:40:35 -0800

> kmsan was able to trigger a kernel-infoleak using a gre device [1]
> 
> nlmsg_populate_fdb_fill() has a hard coded assumption
> that dev->addr_len is ETH_ALEN, as normally guaranteed
> for ARPHRD_ETHER devices.
> 
> A similar issue was fixed recently in commit da71577545a5
> ("rtnetlink: Disallow FDB configuration for non-Ethernet device")
 ...
> Fixes: d83b06036048 ("net: add fdb generic dump routine")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks Eric.

Re: [PATCH net-next v2 0/3] net: bridge: convert multicast to generic rhashtable

2018-12-04 Thread David Miller

From: Nikolay Aleksandrov 
Date: Wed, 5 Dec 2018 01:45:16 +0200

> On a related note I saw Paul's call_rcu patches hit, so I'll wait for those
> to go in and will rebase on top of them before sending the v3 as the bridge
> change will have a conflict with this set.

They aren't going in via my tree, so I wouldn't wait for that before
you respin.

Re: [RFC bpf-next 1/7] bpf: interpreter support BPF_ALU | BPF_ARSH

2018-12-04 Thread David Miller

From: Alexei Starovoitov 
Date: Tue, 4 Dec 2018 12:16:04 -0800

> You already did :)

Amazing, I'll take the rest of the day off, thanks! :)

Re: [RFC bpf-next 1/7] bpf: interpreter support BPF_ALU | BPF_ARSH

2018-12-04 Thread David Miller

From: Jiong Wang 
Date: Tue, 4 Dec 2018 20:14:11 +

> On 04/12/2018 20:10, David Miller wrote:
>> From: Alexei Starovoitov 
>> Date: Tue, 4 Dec 2018 11:29:55 -0800
>>
>>> I guess sparc doesn't really have 32 subregisters.  All registers
>>> are considered 64-bit. It has 32-bit alu ops on 64-bit registers
>>> instead.
>> Right.
>>
>> Anyways, sparc will require two instructions because of this, the
>> 'sra' then a 'srl' by zero bits to clear the top 32-bits.
>>
>> I'll code up the sparc JIT part when this goes in.
> 
> Hmm, I had been going through all JIT backends, and saw there is
> do_alu32_trunc after jitting sra for BPF_ALU. That's what needed?

Yes, it clears the top 32-bits of a register after a 32-bit
ALU op beccause BPF's semantics require it.

In fact, we call it too much, we even call it for 32-bit
shift right instructions which automatically clear those
top bits.  I've been meaning to optimize that.

Meanwhile, again the answer to your question is yes.

Re: [RFC bpf-next 1/7] bpf: interpreter support BPF_ALU | BPF_ARSH

2018-12-04 Thread David Miller

From: Alexei Starovoitov 
Date: Tue, 4 Dec 2018 11:29:55 -0800

> I guess sparc doesn't really have 32 subregisters.  All registers
> are considered 64-bit. It has 32-bit alu ops on 64-bit registers
> instead.

Right.

Anyways, sparc will require two instructions because of this, the
'sra' then a 'srl' by zero bits to clear the top 32-bits.

I'll code up the sparc JIT part when this goes in.

Re: [PATCH net-next] net: netem: use a list in addition to rbtree

2018-12-04 Thread David Miller

From: Peter Oskolkov 
Date: Tue, 4 Dec 2018 11:10:55 -0800

> Thanks, Stephen!
> 
> I don't care much about braces either. David, do you want me to send a
> new patch with braces moved around?

Single statement basic blocks definitely must not have curly braces,
please remove them and repost.

Thank you.

Re: [PATCH net-next 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-12-04 Thread David Miller

From: Paolo Abeni 
Date: Tue, 04 Dec 2018 12:27:51 +0100

> On Mon, 2018-12-03 at 10:04 -0800, Eric Dumazet wrote:
>> On 12/03/2018 03:40 AM, Paolo Abeni wrote:
>> > This header define a bunch of helpers that allow avoiding the
>> > retpoline overhead when calling builtin functions via function pointers.
>> > It boils down to explicitly comparing the function pointers to
>> > known builtin functions and eventually invoke directly the latter.
>> > 
>> > The macros defined here implement the boilerplate for the above schema
>> > and will be used by the next patches.
>> > 
>> > rfc -> v1:
>> >  - use branch prediction hint, as suggested by Eric
>> > 
>> > Suggested-by: Eric Dumazet 
>> > Signed-off-by: Paolo Abeni 
>> > ---
>> >  include/linux/indirect_call_wrapper.h | 77 +++
>> >  1 file changed, 77 insertions(+)
>> >  create mode 100644 include/linux/indirect_call_wrapper.h
>> 
>> This needs to be discussed more broadly, please include lkml 
> 
> Agreed. @David: please let me know if you prefer a repost or a v2 with
> the expanded recipients list.

v2 probably works better and will help me better keep track of things.

Thanks for asking.

Re: [RFC bpf-next 1/7] bpf: interpreter support BPF_ALU | BPF_ARSH

2018-12-04 Thread David Miller

From: Jiong Wang 
Date: Tue,  4 Dec 2018 04:56:29 -0500

> This patch implements interpreting BPF_ALU | BPF_ARSH. Do arithmetic right
> shift on low 32-bit sub-register, and zero the high 32 bits.
> 
> Reviewed-by: Jakub Kicinski 
> Signed-off-by: Jiong Wang 

I just want to say that this behavior is interesting because on most
cpus that have a 32-bit and 64-bit variant, the 32-bit arithmetic
right shift typically sign extends to 64-bit rather than zero extends
which is what is being defined here.

Well, definitely, sparc64 behaves this way.

Re: [PATCH net-next 0/4] mlxsw: Add one-armed router support

2018-12-04 Thread David Miller

From: Ido Schimmel 
Date: Tue, 4 Dec 2018 08:15:09 +

> Up until now, when a packet was routed by the ASIC through the same
> router interface (RIF) from which it ingressed from, the ASIC passed the
> sole copy of the packet to the kernel. This allowed the kernel to route
> the packet and also potentially generate an ICMP redirect.
> 
> There are scenarios (e.g., "one-armed router") where packets are
> intentionally routed this way and are therefore not deemed as
> exceptions. In such scenarios the current method of trapping packets to
> the CPU is problematic, as it results in major packet loss.
> 
> This patchset solves the problem by having the ASIC forward the packet,
> but also send a copy to the CPU, which gives the kernel the opportunity
> to generate required exceptions.
> 
> To prevent the kernel from forwarding such packets again, the driver
> marks them with 'offload_l3_fwd_mark', which causes the kernel to
> consume them in ip{,6}_forward_finish().
> 
> Patch #1 renames 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark'. When
> set, the field indicates that a packet was already forwarded in L3
> (unicast / multicast) by a capable device.
> 
> Patch #2 teaches the kernel to consume unicast packets that have
> 'offload_l3_fwd_mark' set.
> 
> Patch #3 changes mlxsw to mirror loopbacked (iRIF == eRIF) packets,
> instead of trapping them.
> 
> Patch #4 adds a test case for above mentioned scenario.

Series applied, thank you.

Re: consistency for statistics with XDP mode

2018-12-03 Thread David Miller

From: David Ahern 
Date: Mon, 3 Dec 2018 17:15:03 -0700

> So, instead of a program tag which the program writer controls, how
> about some config knob that an admin controls that says at attach time
> use standard stats?

How about, instead of replacing it is in addition to, and admin can
override?

I'm all for choice so how can I object? :)

Re: [PATCH net-next 0/2] mlx4_core cleanups

2018-12-03 Thread David Miller

From: Tariq Toukan 
Date: Sun,  2 Dec 2018 17:40:24 +0200

> This patchset by Erez contains cleanups to the mlx4_core driver.
> 
> Patch 1 replaces -EINVAL with -EOPNOTSUPP for unsupported operations.
> Patch 2 fixes some coding style issues.
> 
> Series generated against net-next commit:
> 97e6c858a26e net: usb: aqc111: Initialize wol_cfg with memset in 
> aqc111_suspend

Series applied, thanks.

Re: [PATCH net-next v2] net: phy: Also request modules for C45 IDs

2018-12-03 Thread David Miller

From: Jose Abreu 
Date: Sun,  2 Dec 2018 16:33:14 +0100

> Logic of phy_device_create() requests PHY modules according to PHY ID
> but for C45 PHYs we use different field for the IDs.
> 
> Let's also request the modules for these IDs.
> 
> Changes from v1:
> - Only request C22 modules if C45 are not present (Andrew)
> 
> Signed-off-by: Jose Abreu 

Applied, thanks Jose.

Florian, just for the record, I actually like the changelogs to be in
the commit messages.  It can help people understand that something
was deliberately implemented a certain way and alternative approaches
were considered.

Re: [PATCH net-next v2 00/14] octeontx2-af: NIX and NPC enhancements

2018-12-03 Thread David Miller

From: Jerin Jacob 
Date: Sun,  2 Dec 2018 18:17:35 +0530

> This patchset is a continuation to earlier submitted four patch
> series to add a new driver for Marvell's OcteonTX2 SOC's
> Resource virtualization unit (RVU) admin function driver.
> 
> 1. octeontx2-af: Add RVU Admin Function driver
>https://www.spinics.net/lists/netdev/msg528272.html
> 2. octeontx2-af: NPA and NIX blocks initialization
>https://www.spinics.net/lists/netdev/msg529163.html
> 3. octeontx2-af: NPC parser and NIX blocks initialization
>https://www.spinics.net/lists/netdev/msg530252.html
> 4. octeontx2-af: NPC MCAM support and FLR handling
>https://www.spinics.net/lists/netdev/msg534392.html
> 
> This patch series adds support for below
> 
> NPC block:
> - Add NPC(mkex) profile support for various Key extraction configurations
> 
> NIX block:
> - Enable dynamic RSS flow key algorithm configuration
> - Enhancements on Rx checksum and error checks
> - Add support for Tx packet marking support
> - TL1 schedule queue allocation enhancements
> - Add LSO format configuration mbox
> - VLAN TPID configuration
> - Skip multicast entry init for broadcast tables
 ...

Series applied, thanks.

Re: [PATCH net 0/2] mlx4 fixes for 4.20-rc

2018-12-03 Thread David Miller

From: Tariq Toukan 
Date: Sun,  2 Dec 2018 14:34:35 +0200

> This patchset includes small fixes for the mlx4_en driver.
> 
> First patch by Eran fixes the value used to init the netdevice's
> min_mtu field.
> Please queue it to -stable >= v4.10.
> 
> Second patch by Saeed adds missing Kconfig build dependencies.
> 
> Series generated against net commit:
> 35b827b6d061 tun: forbid iface creation with rtnl ops

Series applied and patch #1 queued up for -stable, thanks.

Re: [PATCH net] macvlan: return correct error value

2018-12-03 Thread David Miller

From: Matteo Croce 
Date: Sat,  1 Dec 2018 00:26:27 +0100

> A MAC address must be unique among all the macvlan devices with the same
> lower device. The only exception is the passthru [sic] mode,
> which shares the lower device address.
> 
> When duplicate addresses are detected, EBUSY is returned when bringing
> the interface up:
> 
> # ip link add macvlan0 link eth0 type macvlan
> # read addr  # ip link set macvlan0 address $addr
> # ip link set macvlan0 up
> RTNETLINK answers: Device or resource busy
> 
> Use correct error code which is EADDRINUSE, and do the check also
> earlier, on address change:
> 
> # ip link set macvlan0 address $addr
> RTNETLINK answers: Address already in use
> 
> Signed-off-by: Matteo Croce 

Applied, thanks Matteo.

Re: consistency for statistics with XDP mode

2018-12-03 Thread David Miller

From: Toke Høiland-Jørgensen 
Date: Mon, 03 Dec 2018 22:00:32 +0200

> I wonder if it would be possible to support both the "give me user
> normal stats" case and the "let me do whatever I want" case by a
> combination of userspace tooling and maybe a helper or two?
> 
> I.e., create a "do_stats()" helper (please pick a better name), which
> will either just increment the desired counters, or set a flag so the
> driver can do it at napi poll exit. With this, the userspace tooling
> could have a "--give-me-normal-stats" switch (or some other interface),
> which would inject a call instruction to that helper at the start of the
> program.
> 
> This would enable the normal counters in a relatively painless way,
> while still letting people opt out if they don't want to pay the cost in
> terms of overhead. And having the userspace tooling inject the helper
> call helps support the case where the admin didn't write the XDP
> programs being loaded.
> 
> Any reason why that wouldn't work?

I think this is a good idea, or even an attribute tag that gets added
to the XDP program that controls stats handling.

Re: [PATCH net-next v4 0/3] udp msg_zerocopy

2018-12-03 Thread David Miller

From: Willem de Bruijn 
Date: Fri, 30 Nov 2018 15:32:38 -0500

> Enable MSG_ZEROCOPY for udp sockets

Series applied, thanks for keeping up with this.

Re: [PATCH 0/3] net: macb: DMA race condition fixes

2018-12-03 Thread David Miller

From: 
Date: Mon, 3 Dec 2018 08:26:52 +

> Can you please delay a bit the acceptance of this series, I would like 
> that we assess these findings with tests on our hardware before applying 
> them.

Sure.

Re: [PATCH net] sctp: kfree_rcu asoc

2018-12-03 Thread David Miller

From: Xin Long 
Date: Sat,  1 Dec 2018 01:36:59 +0800

> In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences
> a transport's asoc under rcu_read_lock while asoc is freed not after
> a grace period, which leads to a use-after-free panic.
> 
> This patch fixes it by calling kfree_rcu to make asoc be freed after
> a grace period.
> 
> Note that only the asoc's memory is delayed to free in the patch, it
> won't cause sk to linger longer.
> 
> Thanks Neil and Marcelo to make this clear.
> 
> Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
> rhashtable")
> Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new 
> transport")
> Reported-by: syzbot+0b05d8aa7cb185107...@syzkaller.appspotmail.com
> Reported-by: syzbot+aad231d51b1923158...@syzkaller.appspotmail.com
> Suggested-by: Neil Horman 
> Signed-off-by: Xin Long 

Applied and queued up for -stable, thanks.

Re: [PATCH net] net/ibmvnic: Fix RTNL deadlock during device reset

2018-12-03 Thread David Miller

From: Thomas Falcon 
Date: Fri, 30 Nov 2018 10:59:08 -0600

> Commit a5681e20b541 ("net/ibmnvic: Fix deadlock problem
> in reset") made the change to hold the RTNL lock during
> driver reset but still calls netdev_notify_peers, which
> results in a deadlock. Instead, use call_netdevice_notifiers,
> which is functionally the same except that it does not
> take the RTNL lock again.
> 
> Fixes: a5681e20b541 ("net/ibmnvic: Fix deadlock problem in reset")
> 
> Signed-off-by: Thomas Falcon 

Applied.

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-12-03 Thread David Miller

From: Ido Schimmel 
Date: Fri, 30 Nov 2018 19:00:24 +0200

> Yes, agree. Patch is good. I'll tag your v2.

This means, I assume, that a new version of this fix is coming.

Eric, is this correct?

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-12-03 Thread David Miller

From: Cong Wang 
Date: Tue, 27 Nov 2018 22:10:13 -0800

> When an ethernet frame is padded to meet the minimum ethernet frame
> size, the padding octets are not covered by the hardware checksum.
> Fortunately the padding octets are ususally zero's, which don't affect
> checksum. However, we have a switch which pads non-zero octets, this
> causes kernel hardware checksum fault repeatedly.
> 
> Prior to commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE 
> are friends"),
> skb checksum is forced to be CHECKSUM_NONE when padding is detected.
> After it, we need to keep skb->csum updated, like what we do for FCS.
> 
> The logic is a bit complicated when dealing with both FCS and padding,
> so I wrap it up in a helper function mlx5e_csum_padding().
> 
> I tested this patch with RXFCS on and off, it works fine without any
> warning in both cases.
> 
> Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
> friends"),
> Cc: Saeed Mahameed 
> Cc: Eric Dumazet 
> Signed-off-by: Cong Wang 

Saeed, are you going to take care of this?

Re: [PATCH net-next] net: phy: improve generic EEE ethtool functions

2018-12-03 Thread David Miller

From: Heiner Kallweit 
Date: Tue, 27 Nov 2018 22:30:14 +0100

> So far the two functions consider neither member eee_enabled nor
> eee_active. Therefore network drivers have to do this in some kind
> of glue code. I think this can be avoided.
> 
> Getting EEE parameters:
> When not advertising any EEE mode, we can't consider EEE to be enabled.
> Therefore interpret "EEE enabled" as "we advertise at least one EEE
> mode". It's similar with "EEE active": interpret it as "EEE modes
> advertised by both link partner have at least one mode in common".
> 
> Setting EEE parameters:
> If eee_enabled isn't set, don't advertise any EEE mode and restart
> aneg if needed to switch off EEE. If eee_enabled is set and
> data->advertised is empty (e.g. because EEE was disabled), advertise
> everything we support as default. This way EEE can easily switched
> on/off by doing ethtool --set-eee  eee on/off, w/o any additional
> parameters.
> 
> The changes to both functions shouldn't break any existing user.
> Once the changes have been applied, at least some users can be
> simplified.
> 
> Signed-off-by: Heiner Kallweit 

Applied.

Re: [PATCH v7 0/4] Add VRF support for VXLAN underlay

2018-12-03 Thread David Miller

From: Alexis Bauvin 
Date: Mon,  3 Dec 2018 10:54:37 +0100

> We are trying to isolate the VXLAN traffic from different VMs with VRF as 
> shown
> in the schemas below:
 ...a
> We faced some issue in the datapath, here are the details:
> 
> * Egress traffic:
> The vxlan packets are sent directly to the default VRF because it's where the
> socket is bound, therefore the traffic has a default route via eth0. the
> workarount is to force this traffic to VRF green with ip rules.
> 
> * Ingress traffic:
> When receiving the traffic on eth0.2030 the vxlan socket is unreachable from
> VRF green. The workaround is to enable *udp_l3mdev_accept* sysctl, but
> this breaks isolation between overlay and underlay: packets sent from
> blue or red by e.g. a guest VM will be accepted by the socket, allowing
> injection of VXLAN packets from the overlay.
> 
> This patch serie fixes the issues describe above by allowing VXLAN socket to 
> be
> bound to a specific VRF device therefore looking up in the correct table.

Series applied to net-next, thanks.

Re: [PATCH v2 net] tun: remove skb access after netif_receive_skb

2018-12-03 Thread David Miller

From: Prashant Bhole 
Date: Mon,  3 Dec 2018 18:09:24 +0900

> In tun.c skb->len was accessed while doing stats accounting after a
> call to netif_receive_skb. We can not access skb after this call
> because buffers may be dropped.
> 
> The fix for this bug would be to store skb->len in local variable and
> then use it after netif_receive_skb(). IMO using xdp data size for
> accounting bytes will be better because input for tun_xdp_one() is
> xdp_buff.
> 
> Hence this patch:
> - fixes a bug by removing skb access after netif_receive_skb()
> - uses xdp data size for accounting bytes
 ...
> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
> Reviewed-by: Toshiaki Makita 
> Signed-off-by: Prashant Bhole 
> Acked-by: Jason Wang 
> ---
> v1 -> v2:
> No change. Reposted due to patchwork status.

Applied, thanks.

Re: [PATCH net-next v2 0/3] mlxsw: Add 'fw_load_policy' devlink parameter

2018-12-03 Thread David Miller

From: Ido Schimmel 
Date: Mon, 3 Dec 2018 07:58:57 +

> Shalom says:
> 
> Currently, drivers do not have the ability to control the firmware
> loading policy and they always use their own fixed policy. This prevents
> drivers from running the device with a different firmware version for
> testing and/or debugging purposes. For example, testing a firmware bug
> fix.
> 
> For these situations, the new devlink generic parameter,
> 'fw_load_policy', gives the ability to control this option and allows
> drivers to run with a different firmware version than required by the
> driver.
> 
> Patch #1 adds the new parameter to devlink. The other two patches, #2
> and #3, add support for this parameter in the mlxsw driver.
> 
> Example:
 ...
> iproute2 patches available here:
> https://github.com/tshalom/iproute2-next
> 
> v2:
> * Change 'fw_version_check' to 'fw_load_policy' with values 'driver' and
>   'flash' (Jakub)

Series applied.

Re: [PATCH net] net: phy: don't allow __set_phy_supported to add unsupported modes

2018-12-03 Thread David Miller

From: Heiner Kallweit 
Date: Mon, 3 Dec 2018 08:19:33 +0100

> Currently __set_phy_supported allows to add modes w/o checking whether
> the PHY supports them. This is wrong, it should never add modes but
> only remove modes we don't want to support.
> 
> The commit marked as fixed didn't do anything wrong, it just copied
> existing functionality to the helper which is being fixed now.
> 
> Fixes: f3a6bd393c2c ("phylib: Add phy_set_max_speed helper")
> Signed-off-by: Heiner Kallweit 
> ---
> This will cause a merge conflict once net is merged into net-next.
> And the fix will need minor tweaking when being backported to
> older kernel versions.

Applied and queued up for -stable.

I'll let you know if I need help with those backports.

Re: [PATCH net-next] net: phy: don't allow __set_phy_supported to add unsupported modes

2018-12-03 Thread David Miller

From: Heiner Kallweit 
Date: Mon, 3 Dec 2018 08:04:57 +0100

> Currently __set_phy_supported allows to add modes w/o checking whether
> the PHY supports them. This is wrong, it should never add modes but
> only remove modes we don't want to support.
> 
> Signed-off-by: Heiner Kallweit 

Applied.

Re: consistency for statistics with XDP mode

2018-12-03 Thread David Miller

From: David Ahern 
Date: Mon, 3 Dec 2018 08:56:28 -0700

> I do not want Linux + XDP to require custom anything to debug basic
> functionality - such as isolating a packet drop to a specific point.

Definitely we need to let people choose to arrange things that way
if they want to do so.  It is not our place to make to make that kind
of decision for them.

If they want custom statistics and a custom representation of objects
in their tools, etc.  That is completely fine.

XDP is not about forcing a model of these things upon people.

Re: consistency for statistics with XDP mode

2018-12-03 Thread David Miller

From: David Ahern 
Date: Mon, 3 Dec 2018 08:45:12 -0700

> On 12/1/18 4:22 AM, Jesper Dangaard Brouer wrote:
>> IMHO XDP_DROP should not be accounted as netdev stats drops, this is a
>> user installed program like tc/iptables, that can also choose to drop
>> packets.
> 
> sure and both tc and iptables have counters that can see the dropped
> packets. A counter in the driver level stats ("xdp_drop" is fine with
> with me).

Part of the problem I have with this kind of logic is we take the choice
away from the XDP program.

If I feel that the xdp_drop counter bump is too much overhead during a
DDoS attack and I want to avoid it, you don't give me a choice in the
matter.

If I want to represent the statistics for that event differently, you
also give me no choice about it.

Really, if XDP_DROP is returned, zero resources should be devoted to
the frame past that point.

I know you want to live in this magical world where XDP stuff behaves
like the existing stack and give you all of the visibility to events
and objects.

But that is your choice.

Please give others the choice to not live in that world and allow XDP
programs to live in their own entirely different environment, with
custom statistics and complete control over how counters are
incremented and how objects are used and represented, if they choose
to do so.

XDP is about choice.

quick mailing list test...

2018-12-01 Thread David Miller



Tweaking some things on vger, please just ignore.

Re: [PATCH v1 net-next 05/14] octeontx2-af: Restrict TL1 allocation and configuration

2018-12-01 Thread David Miller

From: Jerin Jacob 
Date: Sat,  1 Dec 2018 14:43:54 +0530

> @@ -987,6 +997,76 @@ static void nix_reset_tx_linkcfg(struct rvu *rvu, int 
> blkaddr,
>   NIX_AF_TL3_TL2X_LINKX_CFG(schq, link), 0x00);
>  }
>  
> +static int
> +rvu_get_tl1_schqs(struct rvu *rvu,
> +   int blkaddr,
> +   u16 pcifunc,
> +   u16 *schq_list,
> +   u16 *schq_cnt)

Please pack these lines as densely with arguments as 80 columns will allow.

Re: [PATCH v1 net-next 02/14] octeontx2-af: Add response for RSS flow key cfg message

2018-12-01 Thread David Miller

From: Jerin Jacob 
Date: Sat,  1 Dec 2018 14:43:51 +0530

> +#define FLOW_KEY_TYPE_PORT   BIT(0)
> +#define FLOW_KEY_TYPE_IPV4   BIT(1)
> +#define FLOW_KEY_TYPE_IPV6   BIT(2)
> +#define FLOW_KEY_TYPE_TCPBIT(3)
> +#define FLOW_KEY_TYPE_UDPBIT(4)
> +#define FLOW_KEY_TYPE_SCTP   BIT(5)

This is asking for serious global namespace collisions, as we have
various FLOW_* definitions coming from include/uapi/linux/pkt_cls.h
for example.

Please put some appropriate driver prefix to these macro names.

Thanks.

Re: [Bug 201829] New: Failed build kernel with nat

2018-12-01 Thread David Miller



Stephen please actually read the reports your forward here.

This is a report about arch/x86/lib/inat.c which is a set of CPU
instruction attributes and has nothing to do with networkig.

Re: [PATCH v2 bpf-next 0/4] bpf: Improve verifier test coverage on sparc64 et al.

2018-11-30 Thread David Miller

From: Alexei Starovoitov 
Date: Fri, 30 Nov 2018 21:55:02 -0800

> The patch 2 didn't apply as-is to bpf-next, since I applied your
> earlier fix "bpf: Fix verifier log string check for bad alignment"
> to bpf tree.  So I applied that fix to bpf-next as well and then
> pushed your series on top.

That seems best, thanks for resolving this.

[PATCH v2 bpf-next 4/4] bpf: Apply F_NEEDS_EFFICIENT_UNALIGNED_ACCESS to more ACCEPT test cases.

2018-11-30 Thread David Miller



If a testcase has alignment problems but is expected to be ACCEPT,
verify it using F_NEEDS_EFFICIENT_UNALIGNED_ACCESS too.

Maybe in the future if we add some architecture specific code to elide
the unaligned memory access warnings during the test, we can execute
these as well.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 5d97fb8..beec10b 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3886,6 +3886,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test21 (x += pkt_ptr, 2)",
@@ -3911,6 +3912,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test22 (x += pkt_ptr, 3)",
@@ -3941,6 +3943,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test23 (x += pkt_ptr, 4)",
@@ -3993,6 +3996,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test25 (marking on <, good access)",
@@ -7675,6 +7679,7 @@ static struct bpf_test tests[] = {
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.retval = 0 /* csum_diff of 64-byte packet */,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"helper access to variable memory: size = 0 not allowed on NULL 
(!ARG_PTR_TO_MEM_OR_NULL)",
@@ -9637,6 +9642,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data' > pkt_end, bad access 1",
@@ -9808,6 +9814,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end < pkt_data', bad access 1",
@@ -9920,6 +9927,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end >= pkt_data', bad access 1",
@@ -9977,6 +9985,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data' <= pkt_end, bad access 1",
@@ -10089,6 +10098,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_meta' > pkt_data, bad access 1",
@@ -10260,6 +10270,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data < pkt_meta', bad access 1",
@@ -10372,6 +10383,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data >= pkt_meta', bad access 1",
@@ -10429,6 +10441,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_meta' <= pkt_data, bad access 1",
@@ -12348,6 +12361,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
.retval = 1,
+   .flags =

[PATCH v2 bpf-next 3/4] bpf: Make more use of 'any' alignment in test_verifier.c

2018-11-30 Thread David Miller



Use F_NEEDS_EFFICIENT_UNALIGNED_ACCESS in more tests where the
expected result is REJECT.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 44 +
 1 file changed, 44 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 428a84d..5d97fb8 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -1823,6 +1823,7 @@ static struct bpf_test tests[] = {
.errstr = "invalid bpf_context access",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SK_MSG,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet read for SK_MSG",
@@ -2187,6 +2188,8 @@ static struct bpf_test tests[] = {
},
.errstr = "invalid bpf_context access",
.result = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"check cb access: half, wrong type",
@@ -3249,6 +3252,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "R0 invalid mem access 'inv'",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"raw_stack: skb_load_bytes, spilled regs corruption 2",
@@ -3279,6 +3283,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "R3 invalid mem access 'inv'",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"raw_stack: skb_load_bytes, spilled regs + data",
@@ -3778,6 +3783,7 @@ static struct bpf_test tests[] = {
.errstr = "R2 invalid mem access 'inv'",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test16 (arith on data_end)",
@@ -3961,6 +3967,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = REJECT,
.errstr = "invalid access to packet, off=0 size=8, 
R5(id=1,off=0,r=0)",
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test24 (x += pkt_ptr, 5)",
@@ -5117,6 +5124,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid access to map value, value_size=64 off=-2 
size=4",
.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"invalid cgroup storage access 5",
@@ -5233,6 +5241,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid access to map value, value_size=64 off=-2 
size=4",
.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"invalid per-cpu cgroup storage access 5",
@@ -7149,6 +7158,7 @@ static struct bpf_test tests[] = {
.errstr = "invalid mem access 'inv'",
.result = REJECT,
.result_unpriv = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"map element value illegal alu op, 5",
@@ -7171,6 +7181,7 @@ static struct bpf_test tests[] = {
.fixup_map_hash_48b = { 3 },
.errstr = "R0 invalid mem access 'inv'",
.result = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"map element value is preserved across register spilling",
@@ -9663,6 +9674,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end > pkt_data', good access",
@@ -9701,6 +9713,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end > pkt_data', bad access 2",
@@ -9719,6 +9732,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,

[PATCH v2 bpf-next 2/4] bpf: Adjust F_NEEDS_EFFICIENT_UNALIGNED_ACCESS handling in test_verifier.c

2018-11-30 Thread David Miller



Make it set the flag argument to bpf_verify_program() which will relax
the alignment restrictions.

Now all such test cases will go properly through the verifier even on
inefficient unaligned access architectures.

On inefficient unaligned access architectures do not try to run such
programs, instead mark the test case as passing but annotate the
result similarly to how it is done now in the presence of this flag.

So, we get complete full coverage for all REJECT test cases, and at
least verifier level coverage for ACCEPT test cases.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 42 -
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 4224133..428a84d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -14200,13 +14200,14 @@ static int set_admin(bool admin)
 static void do_test_single(struct bpf_test *test, bool unpriv,
   int *passes, int *errors)
 {
-   int fd_prog, expected_ret, reject_from_alignment;
+   int fd_prog, expected_ret, alignment_prevented_execution;
int prog_len, prog_type = test->prog_type;
struct bpf_insn *prog = test->insns;
int map_fds[MAX_NR_MAPS];
const char *expected_err;
uint32_t expected_val;
uint32_t retval;
+   __u32 pflags;
int i, err;
 
for (i = 0; i < MAX_NR_MAPS; i++)
@@ -14217,9 +14218,12 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
do_test_fixup(test, prog_type, prog, map_fds);
prog_len = probe_filter_length(prog);
 
-   fd_prog = bpf_verify_program(prog_type, prog, prog_len,
-test->flags & F_LOAD_WITH_STRICT_ALIGNMENT 
?
-BPF_F_STRICT_ALIGNMENT : 0,
+   pflags = 0;
+   if (test->flags & F_LOAD_WITH_STRICT_ALIGNMENT)
+   pflags |= BPF_F_STRICT_ALIGNMENT;
+   if (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)
+   pflags |= BPF_F_ANY_ALIGNMENT;
+   fd_prog = bpf_verify_program(prog_type, prog, prog_len, pflags,
 "GPL", 0, bpf_vlog, sizeof(bpf_vlog), 1);
 
expected_ret = unpriv && test->result_unpriv != UNDEF ?
@@ -14229,28 +14233,27 @@ static void do_test_single(struct bpf_test *test, 
bool unpriv,
expected_val = unpriv && test->retval_unpriv ?
   test->retval_unpriv : test->retval;
 
-   reject_from_alignment = fd_prog < 0 &&
-   (test->flags & 
F_NEEDS_EFFICIENT_UNALIGNED_ACCESS) &&
-   strstr(bpf_vlog, "misaligned");
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-   if (reject_from_alignment) {
-   printf("FAIL\nFailed due to alignment despite having efficient 
unaligned access: '%s'!\n",
-  strerror(errno));
-   goto fail_log;
-   }
-#endif
+   alignment_prevented_execution = 0;
+
if (expected_ret == ACCEPT) {
-   if (fd_prog < 0 && !reject_from_alignment) {
+   if (fd_prog < 0) {
printf("FAIL\nFailed to load prog '%s'!\n",
   strerror(errno));
goto fail_log;
}
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+   if (fd_prog >= 0 &&
+   (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)) {
+   alignment_prevented_execution = 1;
+   goto test_ok;
+   }
+#endif
} else {
if (fd_prog >= 0) {
printf("FAIL\nUnexpected success to load!\n");
goto fail_log;
}
-   if (!strstr(bpf_vlog, expected_err) && !reject_from_alignment) {
+   if (!strstr(bpf_vlog, expected_err)) {
printf("FAIL\nUnexpected error message!\n\tEXP: 
%s\n\tRES: %s\n",
  expected_err, bpf_vlog);
goto fail_log;
@@ -14278,9 +14281,12 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
goto fail_log;
}
}
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+test_ok:
+#endif
(*passes)++;
-   printf("OK%s\n", reject_from_alignment ?
-  " (NOTE: reject due to unknown alignment)" : "");
+   printf("OK%s\n", alignment_prevented_execution ?
+  " (NOTE: not executed due to unknown alignment)" : "");
 close_fds:
close(fd_prog);
for (i = 0; i < MAX_NR_MAPS; i++)
-- 
2.1.2.532.g19b5d50

[PATCH v2 bpf-next 0/4] bpf: Improve verifier test coverage on sparc64 et al.

2018-11-30 Thread David Miller



On sparc64 a ton of test cases in test_verifier.c fail because
the memory accesses in the test case are unaligned (or cannot
be proven to be aligned by the verifier).

Perhaps we can eventually try to (carefully) modify each test case
which has this problem to not use unaligned accesses but:

1) That is delicate work.

2) The changes might not fully respect the original
   intention of the testcase.

3) In some cases, such a transformation might not even
   be feasible at all.

So add an "any alignment" flag to tell the verifier to forcefully
disable it's alignment checks completely.

test_verifier.c is then annotated to use this flag when necessary.

The presence of the flag in each test case is good documentation to
anyone who wants to actually tackle the job of eliminating the
unaligned memory accesses in the test cases.

I've also seen several weird things in test cases, like trying to
access __skb->mark in a packet buffer.

This gets rid of 104 test_verifier.c failures on sparc64.

Changes since v1:

1) Explain the new BPF_PROG_LOAD flag in easier to understand terms.
   Suggested by Alexei.

2) Make bpf_verify_program() just take a __u32 prog_flags instead of
   just accumulating boolean arguments over and over.  Also suggested
   by Alexei.

Changes since RFC:

1) Only the admin can allow the relaxation of alignment restrictions
   on inefficient unaligned access architectures.

2) Use F_NEEDS_EFFICIENT_UNALIGNED_ACCESS instead of making a new
   flag.

3) Annotate in the output, when we have a test case that the verifier
   accepted but we did not try to execute because we are on an
   inefficient unaligned access platform.  Maybe with some arch
   machinery we can avoid this in the future.

Signed-off-by: David S. Miller

[PATCH v2 bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-30 Thread David Miller



Often we want to write tests cases that check things like bad context
offset accesses.  And one way to do this is to use an odd offset on,
for example, a 32-bit load.

This unfortunately triggers the alignment checks first on platforms
that do not set CONFIG_EFFICIENT_UNALIGNED_ACCESS.  So the test
case see the alignment failure rather than what it was testing for.

It is often not completely possible to respect the original intention
of the test, or even test the same exact thing, while solving the
alignment issue.

Another option could have been to check the alignment after the
context and other validations are performed by the verifier, but
that is a non-trivial change to the verifier.

Signed-off-by: David S. Miller 
---
 include/uapi/linux/bpf.h| 14 ++
 kernel/bpf/syscall.c|  7 ++-
 kernel/bpf/verifier.c   |  2 ++
 tools/include/uapi/linux/bpf.h  | 14 ++
 tools/lib/bpf/bpf.c |  8 
 tools/lib/bpf/bpf.h |  2 +-
 tools/testing/selftests/bpf/test_align.c|  4 ++--
 tools/testing/selftests/bpf/test_verifier.c |  3 ++-
 8 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 426b5c8..6620c37 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -232,6 +232,20 @@ enum bpf_attach_type {
  */
 #define BPF_F_STRICT_ALIGNMENT (1U << 0)
 
+/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
+ * verifier will allow any alignment whatsoever.  On platforms
+ * with strict alignment requirements for loads ands stores (such
+ * as sparc and mips) the verifier validates that all loads and
+ * stores provably follow this requirement.  This flag turns that
+ * checking and enforcement off.
+ *
+ * It is mostly used for testing when we want to validate the
+ * context and memory access aspects of the verifier, but because
+ * of an unaligned access the alignment check would trigger before
+ * the one we are interested in.
+ */
+#define BPF_F_ANY_ALIGNMENT(1U << 1)
+
 /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
 #define BPF_PSEUDO_MAP_FD  1
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cf5040f..cae65bb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1450,9 +1450,14 @@ static int bpf_prog_load(union bpf_attr *attr)
if (CHECK_ATTR(BPF_PROG_LOAD))
return -EINVAL;
 
-   if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT)
+   if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT))
return -EINVAL;
 
+   if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
+   (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
+   !capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
/* copy eBPF program license from user space */
if (strncpy_from_user(license, u64_to_user_ptr(attr->license),
  sizeof(license) - 1) < 0)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6dd4195..2fefeae 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6350,6 +6350,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)
env->strict_alignment = !!(attr->prog_flags & BPF_F_STRICT_ALIGNMENT);
if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS))
env->strict_alignment = true;
+   if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
+   env->strict_alignment = false;
 
ret = replace_map_fd_with_map_ptr(env);
if (ret < 0)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 426b5c8..6620c37 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -232,6 +232,20 @@ enum bpf_attach_type {
  */
 #define BPF_F_STRICT_ALIGNMENT (1U << 0)
 
+/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
+ * verifier will allow any alignment whatsoever.  On platforms
+ * with strict alignment requirements for loads ands stores (such
+ * as sparc and mips) the verifier validates that all loads and
+ * stores provably follow this requirement.  This flag turns that
+ * checking and enforcement off.
+ *
+ * It is mostly used for testing when we want to validate the
+ * context and memory access aspects of the verifier, but because
+ * of an unaligned access the alignment check would trigger before
+ * the one we are interested in.
+ */
+#define BPF_F_ANY_ALIGNMENT(1U << 1)
+
 /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
 #define BPF_PSEUDO_MAP_FD  1
 
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 03f9bcc..b9b8c5a 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -231,9 +231,9 @@ int bpf_load_program(enum bpf_prog_type type, const struct 
bpf_insn *insns,
 }
 
 int bpf_verify_program(enum bpf_prog_type

Re: [PATCH bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-30 Thread David Miller

From: Alexei Starovoitov 
Date: Fri, 30 Nov 2018 13:58:20 -0800

> On Thu, Nov 29, 2018 at 07:32:41PM -0800, David Miller wrote:
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 426b5c8..c9647ea 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -232,6 +232,16 @@ enum bpf_attach_type {
>>   */
>>  #define BPF_F_STRICT_ALIGNMENT  (1U << 0)
>>  
>> +/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
>> + * verifier will allow any alignment whatsoever.  This bypasses
>> + * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
> 
> I think majority of user space folks who read uapi/bpf.h have no idea
> what that kernel config does.
> Could you reword the comment here to say that this flag is only
> effective on architectures and like sparc and mips that don't
> have efficient unaligned access and ignored on x86/arm64 ?

I just want to point out in passing that your feeback applies also to
the comment above BPF_F_STRICT_ALIGNMENT, which I used as a model for
my comment.

Re: [PATCH net] tun: forbid iface creation with rtnl ops

2018-11-30 Thread David Miller

From: Nicolas Dichtel 
Date: Thu, 29 Nov 2018 14:45:39 +0100

> It's not supported right now (the goal of the initial patch was to support
> 'ip link del' only).
> 
> Before the patch:
> $ ip link add foo type tun
> [  239.632660] BUG: unable to handle kernel NULL pointer dereference at 
> 
> [snip]
> [  239.636410] RIP: 0010:register_netdevice+0x8e/0x3a0
> 
> This panic occurs because dev->netdev_ops is not set by tun_setup(). But to
> have something usable, it will require more than just setting
> netdev_ops.
> 
> Fixes: f019a7a594d9 ("tun: Implement ip link del tunXXX")
> CC: Eric W. Biederman 
> Signed-off-by: Nicolas Dichtel 

Super old bug, scary.

Applied, thanks.

Re: [PATCH net-next] net: qualcomm: rmnet: Remove set but not used variable 'cmd'

2018-11-30 Thread David Miller

From: YueHaibing 
Date: Thu, 29 Nov 2018 02:36:32 +

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c: In function 
> 'rmnet_map_do_flow_control':
> drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:23:36: warning:
>  variable 'cmd' set but not used [-Wunused-but-set-variable]
>   struct rmnet_map_control_command *cmd;
> 
> 'cmd' not used anymore now, should also be removed.
> 
> Signed-off-by: YueHaibing 

Applied.

Re: [PATCH net 0/3] fixes in timeout and retransmission accounting

2018-11-30 Thread David Miller

From: Yuchung Cheng 
Date: Wed, 28 Nov 2018 16:06:42 -0800

> This patch set has assorted fixes of minor accounting issues in
> timeout, window probe, and retransmission stats.

Looks good, series applied.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 25771 matches

Mail list logo