Can NFS work with VRF?
Hello, I was trying to improve my old series of patches that binds NFS to a particular source IP address so that it could work with VRF in a 4.16 kernel. But, it seems a huge tangle to try to make NFS (and rpc, etc) able to bind to a local netdevice, which I think is what would be needed to make it work with VRF. Has anyone already worked on VRF support for NFS? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Anyone know if strongswan works with vrf?
Hello, We're trying to create lots of strongswan VPN tunnels on network devices bound to different VRFs. We are using Fedora-24 on the client side, with a 4.16.15+ kernel and updated 'ip' package, etc. So far, no luck getting it to work. Any idea if this is supported or not? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On 06/10/2018 10:10 AM, Michał Kazior wrote: Ben, The patch is symptomatic. fq_tin_dequeue() already checks if the list is empty before it tries to access first entry. I see no point in using the _or_null() + WARN_ON. The 0x3c deref is likely an offset off of NULL base pointer. Did you check gdb/addr2line of the ieee80211_tx_dequeue+0xfb? Where did it point to? gdb pointed to one line above the flow dereference, which is why I was going to put some debugging in there. I suspect there's not enough synchronization between quescing the device/ath10k after fw crashes and performing mac80211's reconfig procedure. I am already running this patch which helps with some of that. That patch never made it upstream, but it fixed problems for me earlier. https://patchwork.kernel.org/patch/9457639/ Could easily be there are some more issues in that logic. Someone else posted a patch to disable mac-80211 tx when FW crashes, I think...I have not tried to backport that. https://patchwork.kernel.org/patch/10411967/ Thanks, Ben Michał On 8 June 2018 at 23:40, Arend van Spriel wrote: On 6/8/2018 5:17 PM, Ben Greear wrote: I recalled an email from Michał leaving tieto so adding his alternate email he provided back then. Gr. AvS On 06/07/2018 04:59 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 4:48 PM, wrote: diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..cb911f0 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, return NULL; } - flow = list_first_entry(head, struct fq_flow, flowchain); + flow = list_first_entry_or_null(head, struct fq_flow, flowchain); + + if (WARN_ON_ONCE(!flow)) + return NULL; This does not make sense either. list_first_entry_or_null() returns NULL only when the list is empty, but we already check list_empty() right before this code, and it is protected by fq->lock. Hello Michal, git blame shows you as the author of the fq_impl.h code. I saw a crash when debugging funky ath10k firmware in a 4.16 + hacks kernel. There was an apparent mostly-null deref in the fq_tin_dequeue method. According to gdb, it was within 1 line of the dereference of 'flow'. My hack above is probably not that useful. Cong thinks maybe the locking is bad. If you get a chance, please review this thread and see if you have any ideas for a better fix (or better debugging code). As always, if you would like me to generate you a buggy firmware that will crash in the tx path and cause all sorts of mayhem in the ath10k driver and wifi stack, I will be happy to do so. https://www.mail-archive.com/netdev@vger.kernel.org/msg239738.html Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 05:13 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 4:48 PM, wrote: From: Ben Greear While testing an ath10k firmware that often crashed under load, I was seeing kernel crashes as well. One of them appeared to be a dereference of a NULL flow object in fq_tin_dequeue. I have since fixed the firmware flaw, but I think it would be worth adding the WARN_ON in case the problem appears again. BUG: unable to handle kernel NULL pointer dereference at 003c IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] Instead of adding WARN_ON(), you need to think about the locking there, it is suspicious: fq is from struct ieee80211_local: struct fq *fq = >fq; tin is from struct txq_info: struct fq_tin *tin = >tin; I don't know if fq and tin are supposed to be 1:1, if not there is a bug in the locking, because ->new_flows and ->old_flows are both inside tin instead of fq, but they are protected by fq->lock Maybe whoever put this code together can take a stab at it. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 04:59 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 4:48 PM, wrote: diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..cb911f0 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, return NULL; } - flow = list_first_entry(head, struct fq_flow, flowchain); + flow = list_first_entry_or_null(head, struct fq_flow, flowchain); + + if (WARN_ON_ONCE(!flow)) + return NULL; This does not make sense either. list_first_entry_or_null() returns NULL only when the list is empty, but we already check list_empty() right before this code, and it is protected by fq->lock. Nevermind then. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 02:52 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 2:41 PM, Ben Greear wrote: On 06/07/2018 02:29 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 9:06 AM, wrote: --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, flow = list_first_entry(head, struct fq_flow, flowchain); + if (WARN_ON_ONCE(!flow)) + return NULL; + How could even possibly list_first_entry() returns NULL? You need list_first_entry_or_null(). I don't know for certain flow as null, but something was NULL in this method near that line and it looked like a likely culprit. I guess possibly tin or fq was passed in as NULL? A NULL pointer is not always 0. You can trigger a NULL-ptr-def with 0x3c too, but you are checking against 0 in your patch, that is the problem and that is why list_first_entry_or_null() exists. Ahh, I see what you mean, and that is my mistake. In my case, it did seem to be a mostly-null deref, not a 0x0 deref. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 02:29 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 9:06 AM, wrote: --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, flow = list_first_entry(head, struct fq_flow, flowchain); + if (WARN_ON_ONCE(!flow)) + return NULL; + How could even possibly list_first_entry() returns NULL? You need list_first_entry_or_null(). I don't know for certain flow as null, but something was NULL in this method near that line and it looked like a likely culprit. I guess possibly tin or fq was passed in as NULL? Anyway, if the patch seems worthless just ignore it. I'll leave it in my tree since it should be harmless and will let you know if I ever hit it. If someone else hits a similar crash, hopefully they can report it. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 09:17 AM, Eric Dumazet wrote: On 06/07/2018 09:06 AM, gree...@candelatech.com wrote: From: Ben Greear While testing an ath10k firmware that often crashed under load, I was seeing kernel crashes as well. One of them appeared to be a dereference of a NULL flow object in fq_tin_dequeue. I have since fixed the firmware flaw, but I think it would be worth adding the WARN_ON in case the problem appears again. common_interrupt+0xf/0xf Please find the exact commit that brought this bug, and add a corresponding Fixes: tag It will be a total pain to bisect this problem since my test case that causes this is running my modified firmware (and a buggy one at that), modified ath10k driver (to work with this firmware and support my test case easily), and the failure case appears to cause multiple different-but-probably-related crashes and often hangs or reboots the test system. Probably this is all caused by some nasty race or buggy logic related to dealing with a crashed ath10k firmware tearing down txq logic from the bottom up. There have been many such bugs in the past, I and others fixed a few, and very likely more remain. For what it is worth, I didn't see this crash in 4.13, and I spent some time testing buggy firmware there occasionally. If someone else has interest in debugging the ath10k driver, I will be happy to generate a mostly-stock firmware image with ability to crash in the TX path and give it to them. It will crash the stock upstream code reliably in my experience. Thanks, Ben Signed-off-by: Ben Greear --- include/net/fq_impl.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..e40354d 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, flow = list_first_entry(head, struct fq_flow, flowchain); + if (WARN_ON_ONCE(!flow)) + return NULL; + if (flow->deficit <= 0) { flow->deficit += fq->quantum; list_move_tail(>flowchain, -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Regression bisected to: softirq: Let ksoftirqd do its job
One of my out-of-tree patches is a network impairment tool that acts a lot like an Ethernet bridge with latency, jitter, etc. We noticed recently that we were seeing igb adapter errors when testing with our emulator at high speeds. For whatever reason, it is only easily reproduced when we add jitter to our emulator. This would cause a bit more CPU usage and lock contention in our software, and would increase the skb pkts allocated at any given time. I bisected the problem to the commit below: Author: Eric Dumazet <eduma...@google.com> Date: Wed Aug 31 10:42:29 2016 -0700 softirq: Let ksoftirqd do its job A while back, Paolo and Hannes sent an RFC patch adding threaded-able napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) If I replace my emulator with a bridge, then I do not see the problem. But, I also do not (or very rarely?) see the problem when configuring the emulator with zero latency and jitter, which is how the bridge would act. Any idea what sort of (bad?) behaviour would be able to cause this tx q timeout? If you have any interest, I will be happy to email you my out-of-tree patches and instructions to reproduce the problem. The kernel splat looks like this, and repeats often: May 17 16:03:09 localhost.localdomain kernel: audit: type=1131 audit(1526598189.492:159): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' May 17 16:03:39 localhost.localdomain kernel: [ cut here ] May 17 16:03:39 localhost.localdomain kernel: WARNING: CPU: 5 PID: 0 at /home/greearb/git/linux-bisect/net/sched/sch_generic.c:316 dev_watchdog+0x234/0x240 May 17 16:03:39 localhost.localdomain kernel: NETDEV WATCHDOG: eth5 (igb): transmit queue 0 timed out May 17 16:03:39 localhost.localdomain kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 fuse macvlan wanlink(O) pktgen cfg80211 sunrpc coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass ipmi_ssif iTCO_wdt iTCO_vendor_support joydev i2c_i801 lpc_ich i2c_smbus ioatdma shpchp wmi ipmi_si ipmi_msghandler tpm_tis tpm_tis_core tpm acpi_power_meter acpi_pad sch_fq_codel ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_core fjes ipv6 crc_ccitt [last unloaded: nf_conntrack] May 17 16:03:39 localhost.localdomain kernel: CPU: 5 PID: 0 Comm: swapper/5 Tainted: G O4.8.0-rc7+ #132 May 17 16:03:39 localhost.localdomain kernel: Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017 May 17 16:03:39 localhost.localdomain kernel: 88087fd43d78 81417eb1 88087fd43dc8 May 17 16:03:39 localhost.localdomain kernel: 88087fd43db8 81103556 013c7fd43da8 May 17 16:03:39 localhost.localdomain kernel: 880854221940 0005 880854bb8000 May 17 16:03:39 localhost.localdomain kernel: Call Trace: May 17 16:03:39 localhost.localdomain kernel:[] dump_stack+0x63/0x82 May 17 16:03:39 localhost.localdomain kernel: [] __warn+0xc6/0xe0 May 17 16:03:39 localhost.localdomain kernel: [] warn_slowpath_fmt+0x4a/0x50 May 17 16:03:39 localhost.localdomain kernel: [] dev_watchdog+0x234/0x240 May 17 16:03:39 localhost.localdomain kernel: [] ? qdisc_rcu_free+0x40/0x40 May 17 16:03:39 localhost.localdomain kernel: [] call_timer_fn+0x30/0x150 May 17 16:03:39 localhost.localdomain kernel: [] ? qdisc_rcu_free+0x40/0x40 May 17 16:03:39 localhost.localdomain kernel: [] run_timer_softirq+0x1ea/0x450 May 17 16:03:39 localhost.localdomain kernel: [] ? ktime_get+0x37/0xa0 May 17 16:03:39 localhost.localdomain kernel: [] ? lapic_next_deadline+0x21/0x30 May 17 16:03:39 localhost.localdomain kernel: [] ? clockevents_program_event+0x7d/0x120 May 17 16:03:39 localhost.localdomain kernel: [] __do_softirq+0xca/0x2d0 May 17 16:03:39 localhost.localdomain kernel: [] irq_exit+0xb3/0xc0 May 17 16:03:39 localhost.localdomain kernel: [] smp_apic_timer_interrupt+0x3d/0x50 May 17 16:03:39 localhost.localdomain kernel: [] apic_timer_interrupt+0x82/0x90 May 17 16:03:39 localhost.localdomain kernel:[] ? cpuidle_enter_state+0x126/0x300 May 17 16:03:39 localhost.localdomain kernel: [] cpuidle_enter+0x12/0x20 May 17 16:03:39 localhost.localdomain kernel: [] call_cpuidle+0x25/0x40 May 17 16:03:39 localhost.localdomain kernel: [] cpu_startup_entry+0x2ba/0x380 May 17 16:03:39 localhost.localdomain kernel: [] start_secondary+0x149/0x170 May 17 16:03:39 localhost.localdomain kernel: ---[ end trace f62c6dd947785e8f ]--- Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Performance regression between 4.13 and 4.14
On 05/09/2018 12:02 PM, Ben Greear wrote: On 05/09/2018 11:48 AM, Eric Dumazet wrote: On 05/09/2018 11:43 AM, Ben Greear wrote: On 05/08/2018 10:10 AM, Eric Dumazet wrote: On 05/08/2018 09:44 AM, Ben Greear wrote: Hello, I am trying to track down a performance regression that appears to be between 4.13 and 4.14. I first saw the problem with a hacked version of pktgen on some ixgbe NICs. 4.13 can do right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps. I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx on 4.14 and about 4.4Gbps on 4.13. Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits crash or do not bring up networking, which makes the bisect difficult. Looking at perf top, it would appear that some lock is probably to blame. perf record -a -g -e cycles:pp sleep 5 perf report Then you'll be able to tell us which lock (or call graph) is killing your perf. I seem to be chasing multiple issues. For 4.13, at least part of my problem was that LOCKDEP was enabled, during my bisect, though it does NOT appear enabled in 4.16. I think maybe CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING in 4.16, or something like that? My 4.16 .config does have CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it: [greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config CONFIG_LOCKDEP_SUPPORT=y For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need to disable to keep from getting a performance hit from the spectre-related bug fixes? At this point, I do not care about the security implications. greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config # CONFIG_RETPOLINE is not set Thanks, Ben No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/ I initially saw the problem in 4.16, then bisected, and 4.14 still showed the issue. So, I guess I must have been enabling lockdep the whole time. This __lock_acquire is from lockdep as far as I can tell, not normal locking. I re-built 4.16 after verifying as best as I could that lockdep was not enabled, and now it performs as expected. I'm going to test a patch to change __lock_acquire to __lock_acquire_lockdep so maybe someone else will not make the same mistake I made. + 17.78%17.78% kpktgend_1 [kernel.kallsyms] [k] __lock_acquire.isra.3 Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Performance regression between 4.13 and 4.14
On 05/09/2018 11:48 AM, Eric Dumazet wrote: On 05/09/2018 11:43 AM, Ben Greear wrote: On 05/08/2018 10:10 AM, Eric Dumazet wrote: On 05/08/2018 09:44 AM, Ben Greear wrote: Hello, I am trying to track down a performance regression that appears to be between 4.13 and 4.14. I first saw the problem with a hacked version of pktgen on some ixgbe NICs. 4.13 can do right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps. I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx on 4.14 and about 4.4Gbps on 4.13. Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits crash or do not bring up networking, which makes the bisect difficult. Looking at perf top, it would appear that some lock is probably to blame. perf record -a -g -e cycles:pp sleep 5 perf report Then you'll be able to tell us which lock (or call graph) is killing your perf. I seem to be chasing multiple issues. For 4.13, at least part of my problem was that LOCKDEP was enabled, during my bisect, though it does NOT appear enabled in 4.16. I think maybe CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING in 4.16, or something like that? My 4.16 .config does have CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it: [greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config CONFIG_LOCKDEP_SUPPORT=y For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need to disable to keep from getting a performance hit from the spectre-related bug fixes? At this point, I do not care about the security implications. greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config # CONFIG_RETPOLINE is not set Thanks, Ben No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/ I initially saw the problem in 4.16, then bisected, and 4.14 still showed the issue. 4.13 works, but only when I use a .config I originally built for 4.13, not the 4.16 .config that I ended up using with the bisect (make oldconfig, accept all defaults). I originally configured 4.16 with a .config that had lockdep enabled, then manually tried to disable it through 'make xconfig'. I think that must leave "CONFIG_LOCKDEP=y" in the .config, which screws up older builds during bisect, perhaps? Before doing a (painful) dissection, the perf output would immediately tell you if something is really wrong on your .config. I didn't realize lockdep might be an issue at the time, but here is a 'bad' run from a 4.13+ (plus pktgen hacks). I guess lockdep is why this runs slowly, but I see no obvious proof of that in the output: 4.13+, patched pktgen, 6Gbps throughput, on commit 906dde0f355bd97c080c215811ae7db1137c4af8 Samples: 26K of event 'cycles:pp', Event count (approx.): 20119166736 Children Self Command Shared ObjectSymbol + 87.97% 0.00% kpktgend_1 [kernel.kallsyms][k] ret_from_fork + 87.97% 0.00% kpktgend_1 [kernel.kallsyms][k] kthread + 86.89% 5.42% kpktgend_1 [kernel.kallsyms][k] pktgen_thread_worker + 33.75% 0.18% kpktgend_1 [kernel.kallsyms][k] getnstimeofday64 + 32.77% 4.47% kpktgend_1 [kernel.kallsyms][k] __getnstimeofday64 + 24.60%10.91% kpktgend_1 [kernel.kallsyms][k] lock_acquire + 23.59% 0.03% kpktgend_1 [kernel.kallsyms][k] __do_softirq + 23.55% 0.07% kpktgend_1 [kernel.kallsyms][k] net_rx_action + 22.29% 0.47% kpktgend_1 [kernel.kallsyms][k] getRelativeCurNs + 21.33% 1.71% kpktgend_1 [kernel.kallsyms][k] ixgbe_poll + 15.79% 0.02% kpktgend_1 [kernel.kallsyms][k] ret_from_intr + 15.78% 0.01% kpktgend_1 [kernel.kallsyms][k] do_IRQ + 15.34% 0.01% kpktgend_1 [kernel.kallsyms][k] irq_exit + 13.95%10.00% kpktgend_1 [kernel.kallsyms][k] ip_send_check + 13.80%13.80% kpktgend_1 [kernel.kallsyms][k] __lock_acquire.isra.31 + 12.98% 0.53% kpktgend_1 [kernel.kallsyms][k] pktgen_finalize_skb + 12.31% 0.20% kpktgend_1 [kernel.kallsyms][k] timestamp_skb.isra.24 + 11.68% 0.13% kpktgend_1 [kernel.kallsyms][k] napi_gro_receive + 11.36% 0.25% kpktgend_1 [kernel.kallsyms][k] netif_receive_skb_internal + 10.93% 0.00% swapper [kernel.kallsyms][k] verify_cpu + 10.93% 0.00% swapper [kernel.kallsyms][k] cpu_startup_entry + 10.92% 0.02% swapper [kernel.kallsyms][k] do_idle + 10.71% 0.00% swapper [kernel.kallsyms][k] cpuidle_enter + 10.71% 0.00% swapper [ke
Re: Performance regression between 4.13 and 4.14
On 05/08/2018 10:10 AM, Eric Dumazet wrote: On 05/08/2018 09:44 AM, Ben Greear wrote: Hello, I am trying to track down a performance regression that appears to be between 4.13 and 4.14. I first saw the problem with a hacked version of pktgen on some ixgbe NICs. 4.13 can do right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps. I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx on 4.14 and about 4.4Gbps on 4.13. Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits crash or do not bring up networking, which makes the bisect difficult. Looking at perf top, it would appear that some lock is probably to blame. perf record -a -g -e cycles:pp sleep 5 perf report Then you'll be able to tell us which lock (or call graph) is killing your perf. I seem to be chasing multiple issues. For 4.13, at least part of my problem was that LOCKDEP was enabled, during my bisect, though it does NOT appear enabled in 4.16. I think maybe CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING in 4.16, or something like that? My 4.16 .config does have CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it: [greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config CONFIG_LOCKDEP_SUPPORT=y For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need to disable to keep from getting a performance hit from the spectre-related bug fixes? At this point, I do not care about the security implications. greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config # CONFIG_RETPOLINE is not set Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
ICMP redirect and VRF
While debugging some other problem today on a system using ip rules instead of VRF, I ran into a case where the remote router was sending back ICMP redirects. That got me thinking...where would these routes get stored in a VRF scenario? Would it magically go to the correct VRF routing table based on the incoming interface for the ICMP redirect response? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Performance regression between 4.13 and 4.14
Hello, I am trying to track down a performance regression that appears to be between 4.13 and 4.14. I first saw the problem with a hacked version of pktgen on some ixgbe NICs. 4.13 can do right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps. I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx on 4.14 and about 4.4Gbps on 4.13. Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits crash or do not bring up networking, which makes the bisect difficult. Looking at perf top, it would appear that some lock is probably to blame. Any ideas what might have been introduced during this interval that would cause this? Anyone else seen similar? I'm going to attempt some more manual steps to try to find the commit that introduces this... Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: The SO_BINDTODEVICE was set to the desired interface, but packets are received from all interfaces.
On 05/07/2018 03:19 AM, Damir Mansurov wrote: Greetings, After successful call of the setsockopt(SO_BINDTODEVICE) function to set data reception from only one interface, the data is still received from all interfaces. Function setsockopt() returns 0 but then recv() receives data from all available network interfaces. The problem is reproducible on linux kernels 4.14 - 4.16, but it does not on linux kernels 4.4, 4.13. I have written C-code to reproduce this issue (see attached files b2d_send.c and b2d_recv.c). See below explanation of tested configuration. Hello, I am not sure if this is your problem or not, but if you are using VRF, then you need to call SO_BINDTODEVICE before you do the 'normal' bind() call. Thanks, Ben PC-1 PC-2 --- --- | b2d_send| | b2d_recv| | | | | | --| |-- | | | eth0 |---| eth0 | | | --| |-- | | | | | | --| |-- | | | eth1 |---| eth1 | | | --| |-- | | | | | --- --- Steps: 1. Copy b2d_recv.c to PC-2, compile it ("gcc -o b2d_recv b2d_recv.c") and run "./b2d_recv eth0 23777" to get derived data only from eth0 interface. Port number in this example is 23777 only for sample. 2. Copy b2d_send.c to PC-1, compile it ("gcc -o b2d_send b2d_send.c") and run "./b2d_send ip1 ip2 23777" where ip1 and ip2 are ip addresses of interfaces eth0 and eth1 of PC-2. 3. Result: - b2d_recv prints out data from eth0 and eth1 on linux kernels from 4.14 up to 4.16. - b2d_recv prints out data from only eth0 on linux kernels below 4.14. ** Thanks, Damir Mansurov dn...@oktetlabs.ru -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net: Work around crash in ipv6 fib-walk-continue
On 05/04/2018 10:47 AM, David Ahern wrote: On 4/19/18 12:01 PM, gree...@candelatech.com wrote: From: Ben Greear <gree...@candelatech.com> This keeps us from crashing in certain test cases where we bring up many (1000, for instance) mac-vlans with IPv6 enabled in the kernel. This bug has been around for a very long time. Until a real fix is found (and for stable), maybe it is better to return an incomplete fib walk instead of crashing. BUG: unable to handle kernel NULL pointer dereference at 8 IP: fib6_walk_continue+0x5b/0x140 [ipv6] PGD 8007dfc0c067 P4D 8007dfc0c067 PUD 7e66ff067 PMD 0 Oops: [#1] PREEMPT SMP PTI Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c vrf] CPU: 3 PID: 15117 Comm: ip Tainted: G O 4.16.0+ #5 Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017 RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6] RSP: 0018:c90008c3bc10 EFLAGS: 00010287 RAX: 88085ac45050 RBX: 8807e03008a0 RCX: RDX: RSI: c90008c3bc48 RDI: 8232b240 RBP: 880819167600 R08: 0008 R09: 8807dff10071 R10: c90008c3bbd0 R11: R12: 8807e03008a0 R13: 0002 R14: 8807e05744c8 R15: 8807e08ef000 FS: 7f2f04342700() GS:88087fcc() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0008 CR3: 0007e0556002 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: inet6_dump_fib+0x14b/0x2c0 [ipv6] netlink_dump+0x216/0x2a0 netlink_recvmsg+0x254/0x400 ? copy_msghdr_from_user+0xb5/0x110 ___sys_recvmsg+0xe9/0x230 ? find_held_lock+0x3b/0xb0 ? __handle_mm_fault+0x617/0x1180 ? __audit_syscall_entry+0xb3/0x110 ? __sys_recvmsg+0x39/0x70 __sys_recvmsg+0x39/0x70 do_syscall_64+0x63/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f2f03a72030 RSP: 002b:7fffab3de508 EFLAGS: 0246 ORIG_RAX: 002f RAX: ffda RBX: 7fffab3e641c RCX: 7f2f03a72030 RDX: RSI: 7fffab3de570 RDI: 0004 RBP: R08: 7e6c R09: 7fffab3e63a8 R10: 7fffab3de5b0 R11: 0246 R12: 7fffab3e6608 R13: 0066b460 R14: 7e6c R15: Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83 ea 01 89 53 2c c7 4 RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: c90008c3bc10 CR2: 0008 ---[ end trace bd03458864eb266c ]--- Signed-off-by: Ben Greear <gree...@candelatech.com> --- Does your use case that triggers this involve replacing routes? I just noticed the route delete code in fib6_add_rt2node does not have the 'Adjust walkers' code that is in fib6_del_route. Further, the adjust walkers code in fib6_del_route looks suspicious in its timing with route deletes. If you have a reliable reproducer we can try a few things with fib6_del_route and the walker code. Yes, we replace routes, and yes we can reliably reproduce it and will be happy to test patches. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)
On 04/27/2018 08:11 PM, Steven Rostedt wrote: We'd like this email archived in netdev list, but since netdev is notorious for blocking outlook email as spam, it didn't go through. So I'm replying here to help get it into the archives. Thanks! -- Steve On Fri, 27 Apr 2018 23:05:46 + Michael Wenig <mwe...@vmware.com> wrote: As part of VMware's performance testing with the Linux 4.15 kernel, we identified CPU cost and throughput regressions when comparing to the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM send tests when using small message sizes. The regressions are significant (up 3x) and were tracked down to be a side effect of Eric Dumazat's RB tree changes that went into the Linux 4.15 kernel. Further investigation showed our use of the TCP_NODELAY flag in conjunction with Eric's change caused the regressions to show and simply disabling TCP_NODELAY brought performance back to normal. Eric's change also resulted into significant improvements in our TCP_RR test cases. Based on these results, our theory is that Eric's change made the system overall faster (reduced latency) but as a side effect less aggregation is happening (with TCP_NODELAY) and that results in lower throughput. Previously even though TCP_NODELAY was set, system was slower and we still got some benefit of aggregation. Aggregation helps in better efficiency and higher throughput although it can increase the latency. If you are seeing a regression in your application throughput after this change, using TCP_NODELAY might help bring performance back however that might increase latency. I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
On 04/22/2018 02:15 PM, Roopa Prabhu wrote: On Sun, Apr 22, 2018 at 11:54 AM, David Miller <da...@davemloft.net> wrote: From: Johannes Berg <johan...@sipsolutions.net> Date: Thu, 19 Apr 2018 17:26:57 +0200 On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote: Maybe this could be in followup patches? It's going to touch a lot of files, and might be hell to get merged all at once, and I've never used spatch, so just maybe someone else will volunteer that part :) I guess you'll have to ask davem. :) Well, first of all, I really don't like this. The first reason is that every time I see interface foo become foo2, foo3 is never far behind it. If foo was not extensible enough such that we needed foo2, we beter design the new thing with explicitly better extensibility in mind. Furthermore, what you want here is a specific filter. Someone else will want to filter on another criteria, and the next person will want yet another. This needs to be properly generalized. And frankly if we had moved to ethtool netlink/devlink by now, we could just add a netlink attribute for filtering and not even be having this conversation. +1. Also, the RTM_GETSTATS api was added to improve stats query efficiency (with filters). we should look at it to see if this fits there. Keeping all stats queries in one place will help. I like the ethtool API, so I'll be sticking with that for now. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
On 04/22/2018 11:54 AM, David Miller wrote: From: Johannes Berg <johan...@sipsolutions.net> Date: Thu, 19 Apr 2018 17:26:57 +0200 On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote: Maybe this could be in followup patches? It's going to touch a lot of files, and might be hell to get merged all at once, and I've never used spatch, so just maybe someone else will volunteer that part :) I guess you'll have to ask davem. :) Well, first of all, I really don't like this. The first reason is that every time I see interface foo become foo2, foo3 is never far behind it. If foo was not extensible enough such that we needed foo2, we beter design the new thing with explicitly better extensibility in mind. Furthermore, what you want here is a specific filter. Someone else will want to filter on another criteria, and the next person will want yet another. This needs to be properly generalized. And frankly if we had moved to ethtool netlink/devlink by now, we could just add a netlink attribute for filtering and not even be having this conversation. Well, since there are un-defined flags, it would be simple enough to extend the API further in the future (flag (1<<31) could mean expect more input members, etc. And, adding up to 30 more flags to filter on different things won't change the API and should be backwards compatible. But, if you don't want it, that is OK by me, I agree it is a fairly obscure feature. It would have saved me time if you had said you didn't want it at the first RFC patch though... Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
On 04/18/2018 11:38 PM, Johannes Berg wrote: On Wed, 2018-04-18 at 14:51 -0700, Ben Greear wrote: It'd be pretty hard to know which flags are firmware stats? Yes, it is, but ethtool stats are difficult to understand in a generic manner anyway, so someone using them is already likely aware of low-level details of the driver(s) they are using. Right. Come to think of it though, + * @get_ethtool_stats2: Return extended statistics about the device. + * This is only useful if the device maintains statistics not + * included in rtnl_link_stats64. + * Takes a flags argument: 0 means all (same as get_ethtool_stats), + * 0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats. + * Other flags are reserved for now. + * Same number of stats will be returned, but some of them might + * not be as accurate/refreshed. This is to allow not querying + * firmware or other expensive-to-read stats, for instance. "skip" vs. "don't refresh" is a bit ambiguous - I'd argue better to either really skip and not return the non-refreshed ones (also helps with the identifying), or rename the flag. In order to efficiently parse lots of stats over and over again, I probe the stat names once on startup, map them to the variable I am trying to use (since different drivers may have different names for the same basic stat), and then I store the stat index. On subsequent stat reads, I just grab stats and go right to the index to store the stat. If the stats indexes change, that will complicate my logic quite a bit. Maybe the flag could be called: ETHTOOL_GS2_NO_REFRESH_FW ? Also, wrt. the rest of the patch, I'd argue that it'd be worthwhile to write the spatch and just add the flags argument to "get_ethtool_stats" instead of adding a separate method - internally to the kernel it's not that hard to change. Maybe this could be in followup patches? It's going to touch a lot of files, and might be hell to get merged all at once, and I've never used spatch, so just maybe someone else will volunteer that part :) Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
On 04/18/2018 02:26 PM, Johannes Berg wrote: On Tue, 2018-04-17 at 18:49 -0700, gree...@candelatech.com wrote: + * @get_ethtool_stats2: Return extended statistics about the device. + * This is only useful if the device maintains statistics not + * included in rtnl_link_stats64. + * Takes a flags argument: 0 means all (same as get_ethtool_stats), + * 0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats. + * Other flags are reserved for now. It'd be pretty hard to know which flags are firmware stats? Yes, it is, but ethtool stats are difficult to understand in a generic manner anyway, so someone using them is already likely aware of low-level details of the driver(s) they are using. In my case, I have lots of virtual stations (or APs), and I want stats for them as well as for the 'radio', so I would probe the first vdev with flags of 'skip-none' to get all stats, including radio (firmware) stats. And then the rest I would just probe the non-firmware stats. To be honest, I was slightly amused that anyone expressed interest in this patch originally, but maybe other people have similar use case and/or drivers with slow-to-acquire stats. Anyway, there's no way I'm going to take this patch, so you need to float it on netdev first (best CC us here) and get it applied there before we can do anything on the wifi side. I posted the patches to netdev, ath10k and linux-wireless. If I had only posted them individually to different lists I figure I'd be hearing about how the netdev patch is useless because it has no driver support, etc. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 01/24/2018 03:59 PM, Ben Greear wrote: On 06/20/2017 08:03 PM, David Ahern wrote: On 6/20/17 5:41 PM, Ben Greear wrote: On 06/20/2017 11:05 AM, Michal Kubecek wrote: On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote: On 06/14/2017 03:25 PM, David Ahern wrote: On 6/14/17 4:23 PM, Ben Greear wrote: On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. ok, let's figure out which one is doing that. There are 3 debug statements. I suspect fib6_del_route is the one setting the state to FWS_U. Can you remove the debug prints in fib6_repair_tree and fib6_walk_continue and try again? We cannot reproduce with just that one printf in the kernel either. It must change the timing too much to trigger the bug. You might try trace_printk() which should have less impact (don't forget to enable /proc/sys/kernel/ftrace_dump_on_oops). We cannot reproduce with trace_printk() either. I think that suggests the walker state is set to FWS_U in fib6_del_route, and it is the FWS_U case in fib6_walk_continue that triggers the fault -- the null parent (pn = fn->parent). So we have the 2 areas of code that are interacting. I'm on a road trip through the end of this week with little time to focus on this problem. I'll get back to you another suggestion when I can. FYI, problem still happens in 4.16. I'm going to re-enable my hack below for this kernel as well...I had hopes it might be fixed... BUG: unable to handle kernel NULL pointer dereference at 8 IP: fib6_walk_continue+0x5b/0x140 [ipv6] PGD 8007dfc0c067 P4D 8007dfc0c067 PUD 7e66ff067 PMD 0 Oops: [#1] PREEMPT SMP PTI Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c vrf] CPU: 3 PID: 15117 Comm: ip Tainted: G O 4.16.0+ #5 Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017 RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6] RSP: 0018:c90008c3bc10 EFLAGS: 00010287 RAX: 88085ac45050 RBX: 8807e03008a0 RCX: RDX: RSI: c90008c3bc48 RDI: 8232b240 RBP: 880819167600 R08: 0008 R09: 8807dff10071 R10: c90008c3bbd0 R11: R12: 8807e03008a0 R13: 0002 R14: 8807e05744c8 R15: 8807e08ef000 FS: 7f2f04342700() GS:88087fcc() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0008 CR3: 0007e0556002 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: inet6_dump_fib+0x14b/0x2c0 [ipv6] netlink_dump+0x216/0x2a0 netlink_recvmsg+0x254/0x400 ? copy_msghdr_from_user+0xb5/0x110 ___sys_recvmsg+0xe9/0x230 ? find_held_lock+0x3b/0xb0 ? __handle_mm_fault+0x617/0x1180 ? __audit_syscall_entry+0xb3/0x110 ? __sys_recvmsg+0x39/0x70 __sys_recvmsg+0x39/0x70 do_syscall_64+0x63/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f2f03a72030 RSP: 002b:7fffab3de508 EFLAGS: 0246 ORIG_RAX: 002f RAX: ffda RBX: 7fffab3e641c RCX: 7f2f03a72030 RDX: RSI: 7fffab3de570 RDI: 0004 RBP: R08: 7e6c R09: 7fffab3e63a8 R10: 7fffab3de5b0 R11: 0246 R12: 7fffab3e6608 R13: 0066b460 R14: 7e6c R15: Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83 ea 01 89 53 2c c7 4 RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: c90008c3bc10 CR2: 0008 ---[ end trace bd03458864eb266c ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled Rebooting in 10 seconds.. ACPI MEMORY or I/O RESET_REG. So, though I don't know the right way to fix it, the patch below appears to make the system not crash. diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 68b9cc7..bf19a14 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w) pn = fn->parent; w->node = pn; #ifdef CONFIG_IPV6_SUBTREES + if (WARN_ON_ONCE(!pn)) { + pr_err("FWS-U, w: %p fn: %p pn: %p\n", + w, fn, pn); + /* Attempt to work around crash that has been here forever. --Ben */ + return 0; + } if (FIB6_SUBTREE(pn) == fn) { WARN_ON(!(fn->fn_flags & RTN_ROOT)); w->state = FWS_L; The printout looks li
Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.
On 03/20/2018 11:24 AM, Michal Kubecek wrote: On Tue, Mar 20, 2018 at 08:39:33AM -0700, Ben Greear wrote: On 03/20/2018 03:37 AM, Michal Kubecek wrote: IMHO it would be more practical to set "0 means same as GSTATS" as a rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to avoid code duplication (or perhaps a use fall-through in the switch). It would also allow drivers to provide only one of the callbacks. Yes, but that would require changing all drivers at once, and would make backporting and out-of-tree drivers harder to manage. I had low hopes that this feature would make it upstream, so I didn't want to propose any large changes up front. I don't think so. What I mean is: (a) driver implements ->get_ethtool_stats2() callback; then we use it for GSTATS2 (b) driver does not implement get_ethtool_stats2() but implements ->get_ethtool_stats(); then we call for GSTATS2 if level is zero, otherwise GSTATS2 returns -EINVAL and GSTATS is always translated to GSTATS2 with level 0, either by defining ethtool_get_stats() as a wrapper or by fall-through in the switch statement. This way, most drivers could be left untouched and only those which would implement non-default levels would provide ->get_ethtool_stats2() callback instead of ->get_ethtool_stats(). OK, that makes sense. I'll wait on feedback from the flags or #defined levels and re-spin the patch accordingly. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns
On 03/20/2018 09:44 AM, Liran Alon wrote: On 20/03/18 18:24, ebied...@xmission.com wrote: I don't believe the current behavior is a bug. I looked through the history. Basically skb_scrub_packet started out as the scrubbing needed for crossing network namespaces. Then tunnels which needed 90% of the functionality started calling it, with the xnet flag added. Because the tunnels needed to preserve their historic behavior. Then dev_forward_skb started calling skb_scrub_packet. A veth pair is supposed to give the same behavior as a cross-over cable plugged into two local nics. A cross over cable won't preserve things like the skb mark. So I don't see why anyone would expect a veth pair to preserve the mark. I disagree with this argument. I think that a skb crossing netns is what simulates a real packet crossing physical computers. Following your argument, why would skb->mark should be preserved when crossing netdevs on same netns via routing? But this does today preserve skb->mark. Therefore, I do think that skb->mark should conceptually only be scrubbed when crossing netns. Regardless of the netdev used to cross it. It should be scrubbed in VETH as well. That is one way to make virtual routers. Possibly the newer VRF features will give another better way to do it, but you should not break things that used to work. Now, if you want to add a new feature that allows one to configure the kernel (or VETH) for a new behavior, then that might be something to consider. Right now I don't see the point of handling packets that don't cross network namespace boundaries specially, other than to preserve backwards compatibility. Well, backwards compat is a big deal all by itself! Thanks, Ben Eric -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.
On 03/20/2018 09:11 AM, Steve deRosier wrote: On Tue, Mar 20, 2018 at 8:39 AM, Ben Greear <gree...@candelatech.com> wrote: On 03/20/2018 03:37 AM, Michal Kubecek wrote: On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote: From: Ben Greear <gree...@candelatech.com> This is similar to ETHTOOL_GSTATS, but it allows you to specify a 'level'. This level can be used by the driver to decrease the amount of stats refreshed. In particular, this helps with ath10k since getting the firmware stats can be slow. Signed-off-by: Ben Greear <gree...@candelatech.com> --- NOTE: I know to make it upstream I would need to split the patch and remove the #define for 'backporting' that I added. But, is the feature in general wanted? If so, I'll do the patch split and other tweaks that might be suggested. Yes, but that would require changing all drivers at once, and would make backporting and out-of-tree drivers harder to manage. I had low hopes that this feature would make it upstream, so I didn't want to propose any large changes up front. Hi Ben, I find the feature OK, but I'm not thrilled with the arbitrary scale of "level". Maybe there could be some named values, either on a spectrum as level already is, similar to the kernel log DEBUG, WARN, INFO type levels. Or named bit flags like the way the ath drivers do their debug flags for granular results. Thoughts? Yes, that would be easier to code too. If there are any other drivers out there that might take advantage of this, maybe they could chime in with what levels and/or bit-fields they would like to see. For instance a bit that says 'refresh-stats-from-firmware' would be great for ath10k, but maybe useless for everyone else Thanks, Ben - Steve -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.
On 03/20/2018 03:37 AM, Michal Kubecek wrote: On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote: From: Ben Greear <gree...@candelatech.com> This is similar to ETHTOOL_GSTATS, but it allows you to specify a 'level'. This level can be used by the driver to decrease the amount of stats refreshed. In particular, this helps with ath10k since getting the firmware stats can be slow. Signed-off-by: Ben Greear <gree...@candelatech.com> --- NOTE: I know to make it upstream I would need to split the patch and remove the #define for 'backporting' that I added. But, is the feature in general wanted? If so, I'll do the patch split and other tweaks that might be suggested. I'm not familiar enough with the technical background of stats collecting to comment on usefulness and desirability of this feature. Adding a new command just to add a numeric parameter certainly doesn't feel right but it's how the ioctl interface works. I take it as a reminder to find some time to get back to the netlink interface. diff --git a/net/core/ethtool.c b/net/core/ethtool.c index 674b6c9..d3b709f 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -1947,6 +1947,54 @@ static int ethtool_get_stats(struct net_device *dev, void __user *useraddr) return ret; } +static int ethtool_get_stats2(struct net_device *dev, void __user *useraddr) +{ + struct ethtool_stats stats; + const struct ethtool_ops *ops = dev->ethtool_ops; + u64 *data; + int ret, n_stats; + u32 stats_level = 0; + + if (!ops->get_ethtool_stats2 || !ops->get_sset_count) + return -EOPNOTSUPP; + + n_stats = ops->get_sset_count(dev, ETH_SS_STATS); + if (n_stats < 0) + return n_stats; + if (n_stats > S32_MAX / sizeof(u64)) + return -ENOMEM; + WARN_ON_ONCE(!n_stats); + if (copy_from_user(, useraddr, sizeof(stats))) + return -EFAULT; + + /* User can specify the level of stats to query. How the +* level value is used is up to the driver, but in general, +* 0 means 'all', 1 means least, and higher means more. +* The idea is that some stats may be expensive to query, so user +* space could just ask for the cheap ones... +*/ + stats_level = stats.n_stats; + + stats.n_stats = n_stats; + data = vzalloc(n_stats * sizeof(u64)); + if (n_stats && !data) + return -ENOMEM; + + ops->get_ethtool_stats2(dev, , data, stats_level); + + ret = -EFAULT; + if (copy_to_user(useraddr, , sizeof(stats))) + goto out; + useraddr += sizeof(stats); + if (n_stats && copy_to_user(useraddr, data, n_stats * sizeof(u64))) + goto out; + ret = 0; + + out: + vfree(data); + return ret; +} + static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr) { struct ethtool_stats stats; IMHO it would be more practical to set "0 means same as GSTATS" as a rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to avoid code duplication (or perhaps a use fall-through in the switch). It would also allow drivers to provide only one of the callbacks. Yes, but that would require changing all drivers at once, and would make backporting and out-of-tree drivers harder to manage. I had low hopes that this feature would make it upstream, so I didn't want to propose any large changes up front. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH net] virtio-net: disable NAPI only when enabled during XDP set
On 02/28/2018 09:22 AM, David Miller wrote: From: Jason Wang <jasow...@redhat.com> Date: Wed, 28 Feb 2018 18:20:04 +0800 We try to disable NAPI to prevent a single XDP TX queue being used by multiple cpus. But we don't check if device is up (NAPI is enabled), this could result stall because of infinite wait in napi_disable(). Fixing this by checking device state through netif_running() before. Fixes: 4941d472bf95b ("virtio-net: do not reset during XDP set") Signed-off-by: Jason Wang <jasow...@redhat.com> Yes, mis-paired NAPI enable/disable are really a pain. Probably, we can do something in the interfaces or mechanisms to make this less error prone and less fragile. Anyways, applied and queued up for -stable, thanks! I just hit a similar bug in ath10k. It seems like napi has plenty of free bit flags so it could keep track of 'is-enabled' state and allow someone to call napi_disable multiple times w/out deadlocking. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH RFC net-next 1/4] ipv4: fib_rules: support match on sport, dport and ip proto
On 02/12/2018 04:03 PM, David Miller wrote: From: Eric Dumazet <eric.duma...@gmail.com> Date: Mon, 12 Feb 2018 13:54:59 -0800 We had project/teams using different routing tables for each vlan they setup :/ Indeed, people use FIB rules and think they can scale in software. As currently implemented, they can't. The example you give sounds possibly like a great VRF use case btw :-) I'm one of those people with lots of FIB rules wishing it would scale better, and wanting a routing table per netdev. If there is a relatively easy suggestion to make this work better, I'd like to give it a try. I have not looked at VRF at all to date... Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/20/2017 08:03 PM, David Ahern wrote: On 6/20/17 5:41 PM, Ben Greear wrote: On 06/20/2017 11:05 AM, Michal Kubecek wrote: On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote: On 06/14/2017 03:25 PM, David Ahern wrote: On 6/14/17 4:23 PM, Ben Greear wrote: On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. ok, let's figure out which one is doing that. There are 3 debug statements. I suspect fib6_del_route is the one setting the state to FWS_U. Can you remove the debug prints in fib6_repair_tree and fib6_walk_continue and try again? We cannot reproduce with just that one printf in the kernel either. It must change the timing too much to trigger the bug. You might try trace_printk() which should have less impact (don't forget to enable /proc/sys/kernel/ftrace_dump_on_oops). We cannot reproduce with trace_printk() either. I think that suggests the walker state is set to FWS_U in fib6_del_route, and it is the FWS_U case in fib6_walk_continue that triggers the fault -- the null parent (pn = fn->parent). So we have the 2 areas of code that are interacting. I'm on a road trip through the end of this week with little time to focus on this problem. I'll get back to you another suggestion when I can. So, though I don't know the right way to fix it, the patch below appears to make the system not crash. diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 68b9cc7..bf19a14 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w) pn = fn->parent; w->node = pn; #ifdef CONFIG_IPV6_SUBTREES + if (WARN_ON_ONCE(!pn)) { + pr_err("FWS-U, w: %p fn: %p pn: %p\n", + w, fn, pn); + /* Attempt to work around crash that has been here forever. --Ben */ + return 0; + } if (FIB6_SUBTREE(pn) == fn) { WARN_ON(!(fn->fn_flags & RTN_ROOT)); w->state = FWS_L; The printout looks like this (when adding 4000 mac-vlans, so it is pretty rare). PN is definitely NULL sometimes: [root@2u-6n ~]# journalctl -f|grep FWS Jan 24 15:48:05 2u-6n kernel: IPv6: FWS-U, w: 8807ea121ba0 fn: 880856a09260 pn: (null) Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 8807e3963de0 fn: 880856a09260 pn: (null) Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 88081ac22de0 fn: 880856a09260 pn: (null) Jan 24 15:53:13 2u-6n kernel: IPv6: FWS-U, w: 8808290c69c0 fn: 8807e369f920 pn: (null) Jan 24 15:53:24 2u-6n kernel: IPv6: FWS-U, w: 8807ea3156c0 fn: 88082d1eeb60 pn: (null) 8066 Jan 24 15:48:04 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device eth2#1006 8067 Jan 24 15:48:05 2u-6n kernel: [ cut here ] 8068 Jan 24 15:48:05 2u-6n kernel: WARNING: CPU: 5 PID: 3346 at /home/greearb/git/linux-4.13.dev.y/net/ipv6/ip6_fib.c:1617 fib6_walk_continue+ 0x154/0x1b0 [ipv6] 8069 Jan 24 15:48:05 2u-6n kernel: Modules linked in: 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen ipmi_ssif coretemp intel_raplsb_edac x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm ath9k irqbypass iTCO_wdt ath9k_common iTCO_vendor_support ath9k_hw ath i2c_i801 mac80211 joydev lpc_ich cfg80211 ioatdma shpchp tpm_tis tpm_tis_core wmi tpm ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl sch_fq_codel lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_core ipv6 crc_ccitt 8070 Jan 24 15:48:05 2u-6n kernel: CPU: 5 PID: 3346 Comm: ip Tainted: G O4.13.16+ #22 8071 Jan 24 15:48:05 2u-6n kernel: Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017 8072 Jan 24 15:48:05 2u-6n kernel: task: 8807e9ef1dc0 task.stack: c9002083c000 8073 Jan 24 15:48:05 2u-6n kernel: RIP: 0010:fib6_walk_continue+0x154/0x1b0 [ipv6] 8074 Jan 24 15:48:05 2u-6n kernel: RSP: 0018:c9002083fbc0 EFLAGS: 00010246 8075 Jan 24 15:48:05 2u-6n kernel: RAX: RBX: 8807ea121ba0 RCX: 8076 Jan 24 15:48:05 2u-6n kernel: RDX: 880856a09260 RSI: c9002083fc00 RDI: 81ef2140 8077 Jan 24 15:48:05 2u-6n kernel: RBP: c9002083fbc8 R08: 0008 R09: 8807e36f6b25 8078 Jan 24 15:48:05 2u-6n kernel: R10: c9002083fb70 R11: R12: 0002 8079
Re: e1000e hardware unit hangs
On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote: On 2018-01-24 20:31, Ben Greear wrote: On 01/24/2018 08:34 AM, Neftin, Sasha wrote: On 1/24/2018 18:11, Alexander Duyck wrote: On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <gree...@candelatech.com> wrote: Hello, Anyone have any more suggestions for making e1000e work better? This is from a 4.9.65+ kernel, with these additional e1000e patches applied: e1000e: Fix error path in link detection e1000e: Fix wrong comment related to link detection e1000e: Fix return value test e1000e: Separate signaling for link check/link up e1000e: Avoid receiver overrun interrupt bursts Most of these patches shouldn't address anything that would trigger Tx hangs. They are mostly related to just link detection. Test case is simply to run 3 tcp connections each trying to send 56Kbps of bi-directional data between a pair of e1000e interfaces :) No OOM related issues are seen on this kernel...similar test on 4.13 showed some OOM issues, but I have not debugged that yet... Really a question like this probably belongs on e1000-devel or intel-wired-lan so I have added those lists and the e1000e maintainer to the thread. It would be useful if you could provide more information about the device itself such as the ID and the kind of test you are running. Keep in mind the e1000e driver supports a pretty broad swath of devices so we need to narrow things down a bit. please, also re-check if your kernel include: e1000e: fix buffer overrun while the I219 is processing DMA transactions e1000e: fix the use of magic numbers for buffer overrun issue where you take fresh version of kernel? Hello, I tried adding those two patches, but I still see this splat shortly after starting my test. The kernel I am using is here: https://github.com/greearb/linux-ct-4.13 I've seen similar issues at least back to the 4.0 kernel, including stock kernels and my own kernels with additional patches. Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: [ cut here ] Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 PID: 0 at /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O4.13.16+ #22 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: 81e104c0 task.stack: 81e0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: 0010:dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: 0018:88042fc03e50 EFLAGS: 00010282 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: 0086 RBX: RCX: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: 88042fc15b40 RSI: 88042fc0dbf8 RDI: 88042fc0dbf8 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: 88042fc03e98 R08: 0001 R09: 03c4 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: R11: 03c4 R12: 1388 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: 000100050dc3 R14: 88041767 R15: 000100052400 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: () GS:88042fc0() knlGS: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: ES: CR0: 80050033 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: 01d14000 CR3: 01e09000 CR4: 001406f0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_timer_fn+0x30/0x160 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: run_timer_softirq+0x1f0/0x450 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? lapic_next_deadline+0x21/0x30 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? clockevents_program_event+0x78/0xf0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: __do_softirq+0xc1/0x2c0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: smp_apic_timer_interrupt+0x38/0x50 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64
Re: e1000e hardware unit hangs
On 01/24/2018 08:34 AM, Neftin, Sasha wrote: On 1/24/2018 18:11, Alexander Duyck wrote: On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <gree...@candelatech.com> wrote: Hello, Anyone have any more suggestions for making e1000e work better? This is from a 4.9.65+ kernel, with these additional e1000e patches applied: e1000e: Fix error path in link detection e1000e: Fix wrong comment related to link detection e1000e: Fix return value test e1000e: Separate signaling for link check/link up e1000e: Avoid receiver overrun interrupt bursts Most of these patches shouldn't address anything that would trigger Tx hangs. They are mostly related to just link detection. Test case is simply to run 3 tcp connections each trying to send 56Kbps of bi-directional data between a pair of e1000e interfaces :) No OOM related issues are seen on this kernel...similar test on 4.13 showed some OOM issues, but I have not debugged that yet... Really a question like this probably belongs on e1000-devel or intel-wired-lan so I have added those lists and the e1000e maintainer to the thread. It would be useful if you could provide more information about the device itself such as the ID and the kind of test you are running. Keep in mind the e1000e driver supports a pretty broad swath of devices so we need to narrow things down a bit. please, also re-check if your kernel include: e1000e: fix buffer overrun while the I219 is processing DMA transactions e1000e: fix the use of magic numbers for buffer overrun issue where you take fresh version of kernel? Hello, I tried adding those two patches, but I still see this splat shortly after starting my test. The kernel I am using is here: https://github.com/greearb/linux-ct-4.13 I've seen similar issues at least back to the 4.0 kernel, including stock kernels and my own kernels with additional patches. Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: [ cut here ] Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 PID: 0 at /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O4.13.16+ #22 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: 81e104c0 task.stack: 81e0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: 0010:dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: 0018:88042fc03e50 EFLAGS: 00010282 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: 0086 RBX: RCX: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: 88042fc15b40 RSI: 88042fc0dbf8 RDI: 88042fc0dbf8 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: 88042fc03e98 R08: 0001 R09: 03c4 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: R11: 03c4 R12: 1388 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: 000100050dc3 R14: 88041767 R15: 000100052400 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: () GS:88042fc0() knlGS: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: ES: CR0: 80050033 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: 01d14000 CR3: 01e09000 CR4: 001406f0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_timer_fn+0x30/0x160 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: run_timer_softirq+0x1f0/0x450 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? lapic_next_deadline+0x21/0x30 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? clockevents_program_event+0x78/0xf0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: __do_softirq+0xc1/0x2c0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: smp_apic_timer_interrupt+0x38/0x50 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: apic_timer_interrupt+0x89/0x90 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64
Re: TCP many-connection regression (bisected to 4.5.0-rc2+)
On 01/23/2018 03:27 PM, Ben Greear wrote: On 01/23/2018 03:21 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 15:10 -0800, Ben Greear wrote: On 01/23/2018 02:29 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote: On 01/23/2018 02:07 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? Hello Eric, looks like it is one of your commits that causes the issue I see. Here are some more details on my specific test case I used to bisect: I have two ixgbe ports looped back, configured on same subnet, but with different IPs. Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server side let me send-to-self over the external looped cable. I have 2 mac-vlans on each physical interface. I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. On the client-side, I create a process that spawns 5000 connections to the corresponding server side. End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the mac-vlan ports. In the passing case, I get very close to all 5000 connections on all endpoints quickly. In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections across them working reliably. It seems to be an issue with 'connect' failing. connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 fcntl(2075, F_GETFD)= 0 fcntl(2075, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 fcntl(2076, F_GETFD)= 0 fcntl(2076, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit commit ea8add2b190395408b22a9127bed2c0912aecbc8 Author: Eric Dumazet <eduma...@google.com> Date: Thu Feb 11 16:28:50 2016 -0800 tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet <eduma...@google.com> Signed-off-by: David S. Miller <da...@davemloft.net> :04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef Mnet I will be happy to te
e1000e hardware unit hangs
, trans_start: 4294748730, wd-timeout: 5000 jiffies: 4294759424 tx-queues: 1 Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: Reset adapter unexpectedly Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: Reset adapter unexpectedly Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4294766123, wd-timeout: 5000 jiffies: 4294771200 tx-queues: 1 Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4294766125, wd-timeout: 5000 jiffies: 4294771200 tx-queues: 1 Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: Reset adapter unexpectedly Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: Reset adapter unexpectedly Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: Detected Hardware Unit Hang: TDH TDT ... Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: TCP many-connection regression (bisected to 4.5.0-rc2+)
On 01/23/2018 03:21 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 15:10 -0800, Ben Greear wrote: On 01/23/2018 02:29 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote: On 01/23/2018 02:07 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? Hello Eric, looks like it is one of your commits that causes the issue I see. Here are some more details on my specific test case I used to bisect: I have two ixgbe ports looped back, configured on same subnet, but with different IPs. Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server side let me send-to-self over the external looped cable. I have 2 mac-vlans on each physical interface. I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. On the client-side, I create a process that spawns 5000 connections to the corresponding server side. End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the mac-vlan ports. In the passing case, I get very close to all 5000 connections on all endpoints quickly. In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections across them working reliably. It seems to be an issue with 'connect' failing. connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 fcntl(2075, F_GETFD)= 0 fcntl(2075, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 fcntl(2076, F_GETFD)= 0 fcntl(2076, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit commit ea8add2b190395408b22a9127bed2c0912aecbc8 Author: Eric Dumazet <eduma...@google.com> Date: Thu Feb 11 16:28:50 2016 -0800 tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet <eduma...@google.com> Signed-off-by: David S. Miller <da...@davemloft.net> :04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef M net I will be happy to test patches or try to get any other r
Re: TCP many-connection regression (bisected to 4.5.0-rc2+)
On 01/23/2018 02:29 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote: On 01/23/2018 02:07 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? Hello Eric, looks like it is one of your commits that causes the issue I see. Here are some more details on my specific test case I used to bisect: I have two ixgbe ports looped back, configured on same subnet, but with different IPs. Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server side let me send-to-self over the external looped cable. I have 2 mac-vlans on each physical interface. I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. On the client-side, I create a process that spawns 5000 connections to the corresponding server side. End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the mac-vlan ports. In the passing case, I get very close to all 5000 connections on all endpoints quickly. In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections across them working reliably. It seems to be an issue with 'connect' failing. connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 fcntl(2075, F_GETFD)= 0 fcntl(2075, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 fcntl(2076, F_GETFD)= 0 fcntl(2076, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit commit ea8add2b190395408b22a9127bed2c0912aecbc8 Author: Eric Dumazet <eduma...@google.com> Date: Thu Feb 11 16:28:50 2016 -0800 tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet <eduma...@google.com> Signed-off-by: David S. Miller <da...@davemloft.net> :04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef M net I will be happy to test patches or try to get any other results that might help diagnose this problem better. Problem is I do not see anything obvious here. P
Re: TCP many-connection regression (bisected to 4.5.0-rc2+)
On 01/23/2018 02:07 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? Hello Eric, looks like it is one of your commits that causes the issue I see. Here are some more details on my specific test case I used to bisect: I have two ixgbe ports looped back, configured on same subnet, but with different IPs. Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server side let me send-to-self over the external looped cable. I have 2 mac-vlans on each physical interface. I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. On the client-side, I create a process that spawns 5000 connections to the corresponding server side. End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the mac-vlan ports. In the passing case, I get very close to all 5000 connections on all endpoints quickly. In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections across them working reliably. It seems to be an issue with 'connect' failing. connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 fcntl(2075, F_GETFD)= 0 fcntl(2075, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 fcntl(2076, F_GETFD)= 0 fcntl(2076, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit commit ea8add2b190395408b22a9127bed2c0912aecbc8 Author: Eric Dumazet <eduma...@google.com> Date: Thu Feb 11 16:28:50 2016 -0800 tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet <eduma...@google.com> Signed-off-by: David S. Miller <da...@davemloft.net> :04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef M net I will be happy to test patches or try to get any other results that might help diagnose this problem better. Problem is I do not see anything obvious here. Please provide /proc/sys/net/ipv4/ip_local_port_range [root@lf1003-e3v2-13100124-f20x64 ~]#
Re: TCP many-connection regression between 4.7 and 4.13 kernels.
On 01/22/2018 10:46 AM, Josh Hunt wrote: On Mon, Jan 22, 2018 at 10:30 AM, Ben Greear <gree...@candelatech.com> wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? I am sending to self, but over external network interfaces, by using routing tables and rules and such. On 4.13.16+, I see the Intel driver bouncing when I try to start 20k connections. In this case, I have a pair of 10G ports doing 15k, and then I try to start 5k on two of the 1G ports Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_s...es: 1 Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: Reset adapter unexpectedly Ben We had an interface doing this and grabbing these commits resolved it for us: 4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts 19110cfbb34d e1000e: Separate signaling for link check/link up d3509f8bc7b0 e1000e: Fix return value test 65a29da1f5fd e1000e: Fix wrong comment related to link detection c4c40e51f9c3 e1000e: Fix error path in link detection They are in the LTS kernels now, but don't believe they were when we first hit this problem. Thanks a lot for the suggestions, I can confirm that these patches applied to my 4.13.16+ tree does indeed seem to fix the problem. Thanks, Ben Josh -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: TCP many-connection regression (bisected to 4.5.0-rc2+)
On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? Hello Eric, looks like it is one of your commits that causes the issue I see. Here are some more details on my specific test case I used to bisect: I have two ixgbe ports looped back, configured on same subnet, but with different IPs. Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server side let me send-to-self over the external looped cable. I have 2 mac-vlans on each physical interface. I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. On the client-side, I create a process that spawns 5000 connections to the corresponding server side. End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the mac-vlan ports. In the passing case, I get very close to all 5000 connections on all endpoints quickly. In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections across them working reliably. It seems to be an issue with 'connect' failing. connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 fcntl(2075, F_GETFD)= 0 fcntl(2075, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 fcntl(2076, F_GETFD)= 0 fcntl(2076, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit commit ea8add2b190395408b22a9127bed2c0912aecbc8 Author: Eric Dumazet <eduma...@google.com> Date: Thu Feb 11 16:28:50 2016 -0800 tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet <eduma...@google.com> Signed-off-by: David S. Miller <da...@davemloft.net> :04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef M net I will be happy to test patches or try to get any other results that might help diagnose this problem better. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: TCP many-connection regression between 4.7 and 4.13 kernels.
On 01/22/2018 10:30 AM, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? I am sending to self, but over external network interfaces, by using routing tables and rules and such. On 4.13.16+, I see the Intel driver bouncing when I try to start 20k connections. In this case, I have a pair of 10G ports doing 15k, and then I try to start 5k on two of the 1G ports Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_s...es: 1 Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: Reset adapter unexpectedly System reports 10+GB RAM free in this case, btw. Actually, maybe the good kernel was even older than 4.7...I see same resets and inability to do a full 20k connections on 4.7 too. I double-checked with system-test and it seems 4.4 was a good kernel. I'll test that next. Here is splat from 4.7: [ 238.921679] [ cut here ] [ 238.921689] WARNING: CPU: 0 PID: 3 at /home/greearb/git/linux-bisect/net/sched/sch_generic.c:272 dev_watchdog+0xd4/0x12f [ 238.921690] NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out [ 238.921691] Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 macvlan pktgen bnep bluetooth fuse coretemp intel_rapl ftdi_sio x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm iTCO_wdt iTCO_vendor_support joydev ie31200_edac ipmi_devintf irqbypass serio_raw ipmi_si edac_core shpchp fjes video i2c_i801 tpm_tis lpc_ich ipmi_msghandler tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core e1000e ixgbe mdio hwmon dca ptp pps_core ipv6 [last unloaded: nf_conntrack] [ 238.921720] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.7.0 #62 [ 238.921721] Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 [ 238.921723] 88041cdd7cd8 81352a23 88041cdd7d28 [ 238.921725] 88041cdd7d18 810ea5dd 01101cdd7d90 [ 238.921727] 880417a84000 0100 8163ecff 880417a84440 [ 238.921728] Call Trace: [ 238.921733] [] dump_stack+0x61/0x7d [ 238.921736] [] __warn+0xbd/0xd8 [ 238.921738] [] ? netif_tx_lock+0x81/0x81 [ 238.921740] [] warn_slowpath_fmt+0x46/0x4e [ 238.921741] [] ? netif_tx_lock+0x74/0x81 [ 238.921743] [] dev_watchdog+0xd4/0x12f [ 238.921746] [] call_timer_fn+0x65/0x11b [ 238.921748] [] ? netif_tx_lock+0x81/0x81 [ 238.921749] [] run_timer_softirq+0x1ad/0x1d7 [ 238.921751] [] __do_softirq+0xfb/0x25c [ 238.921752] [] run_ksoftirqd+0x19/0x35 [ 238.921755] [] smpboot_thread_fn+0x169/0x1a9 [ 238.921756] [] ? sort_range+0x1d/0x1d [ 238.921759] [] kthread+0xa0/0xa8 [ 238.921763] [] ret_from_fork+0x1f/0x40 [ 238.921764] [] ? init_completion+0x24/0x24 [ 238.921765] ---[ end trace 933912956c6ee5ff ]--- [ 238.961672] e1000e :07:00.0 eth3: Reset adapter unexpectedly So, on 4.4.8+, I see this and other splats related to e1000e. I guess that is a separate issue. I can easily start 40k connections however, 30k across the two 10G ports, and 10k more across a pair of mac-vlans on the 10G ports (since I was out of address space to add a full 40k on the two physical ports). Looks like the e1000e problem is a separate issue, so
Re: TCP many-connection regression between 4.7 and 4.13 kernels.
On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? I am sending to self, but over external network interfaces, by using routing tables and rules and such. On 4.13.16+, I see the Intel driver bouncing when I try to start 20k connections. In this case, I have a pair of 10G ports doing 15k, and then I try to start 5k on two of the 1G ports Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_s...es: 1 Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: Reset adapter unexpectedly System reports 10+GB RAM free in this case, btw. Actually, maybe the good kernel was even older than 4.7...I see same resets and inability to do a full 20k connections on 4.7 too. I double-checked with system-test and it seems 4.4 was a good kernel. I'll test that next. Here is splat from 4.7: [ 238.921679] [ cut here ] [ 238.921689] WARNING: CPU: 0 PID: 3 at /home/greearb/git/linux-bisect/net/sched/sch_generic.c:272 dev_watchdog+0xd4/0x12f [ 238.921690] NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out [ 238.921691] Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 macvlan pktgen bnep bluetooth fuse coretemp intel_rapl ftdi_sio x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm iTCO_wdt iTCO_vendor_support joydev ie31200_edac ipmi_devintf irqbypass serio_raw ipmi_si edac_core shpchp fjes video i2c_i801 tpm_tis lpc_ich ipmi_msghandler tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core e1000e ixgbe mdio hwmon dca ptp pps_core ipv6 [last unloaded: nf_conntrack] [ 238.921720] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.7.0 #62 [ 238.921721] Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 [ 238.921723] 88041cdd7cd8 81352a23 88041cdd7d28 [ 238.921725] 88041cdd7d18 810ea5dd 01101cdd7d90 [ 238.921727] 880417a84000 0100 8163ecff 880417a84440 [ 238.921728] Call Trace: [ 238.921733] [] dump_stack+0x61/0x7d [ 238.921736] [] __warn+0xbd/0xd8 [ 238.921738] [] ? netif_tx_lock+0x81/0x81 [ 238.921740] [] warn_slowpath_fmt+0x46/0x4e [ 238.921741] [] ? netif_tx_lock+0x74/0x81 [ 238.921743] [] dev_watchdog+0xd4/0x12f [ 238.921746] [] call_timer_fn+0x65/0x11b [ 238.921748] [] ? netif_tx_lock+0x81/0x81 [ 238.921749] [] run_timer_softirq+0x1ad/0x1d7 [ 238.921751] [] __do_softirq+0xfb/0x25c [ 238.921752] [] run_ksoftirqd+0x19/0x35 [ 238.921755] [] smpboot_thread_fn+0x169/0x1a9 [ 238.921756] [] ? sort_range+0x1d/0x1d [ 238.921759] [] kthread+0xa0/0xa8 [ 238.921763] [] ret_from_fork+0x1f/0x40 [ 238.921764] [] ? init_completion+0x24/0x24 [ 238.921765] ---[ end trace 933912956c6ee5ff ]--- [ 238.961672] e1000e :07:00.0 eth3: Reset adapter unexpectedly Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
TCP many-connection regression between 4.7 and 4.13 kernels.
My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
fm10k cannot get link
Hello, We're trying to get an Intel 100G NIC to work, and so far, cannot get it to link. The cable is: X0016I4AO3 QSFP28 10Gtek (any suggestions for a better/different one?) [5.022681] fm10k :05:00.0: PCI Express bandwidth of 64GT/s available [5.022683] fm10k :05:00.0: (Speed:8.0GT/s, Width: x8, Encoding Loss:<2%, Payload:256B) [5.022684] fm10k :05:00.0: 00:e0:ed:54:78:f2 [5.027864] fm10k :06:00.0: PCI Express bandwidth of 64GT/s available [5.027865] fm10k :06:00.0: (Speed:8.0GT/s, Width: x8, Encoding Loss:<2%, Payload:256B) [5.027866] fm10k :06:00.0: 00:e0:ed:54:78:f3 [6.057950] Modules linked in: ioatdma(+) shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper ttm igb drm i2c_algo_bit i2c_core ixgbe mdio hwmon fm10k ptp pps_core dca fjes ipv6 crc_ccitt [7.294441] fm10k :05:00.0 eth0.r: renamed from eth0 [ 14.044914] fm10k :05:00.0 eth2: renamed from eth0.r [ 14.107798] fm10k :06:00.0 eth1.r: renamed from eth1 [ 14.178217] fm10k :06:00.0 eth3: renamed from eth1.r [root@lf1005c-is14120020 ~]# ethtool eth3 Settings for eth3: Current message level: 0x0007 (7) drv probe link Link detected: no [root@lf1005c-is14120020 ~]# uname -a Linux lf1005c-is14120020 4.9.29+ #46 SMP PREEMPT Wed Jul 26 17:48:57 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux [root@lf1005c-is14120020 ~]# ethtool -i eth3 driver: fm10k version: 0.21.2-k firmware-version: bus-info: :06:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: no [root@lf1005c-is14120020 ~]# lspci|grep 06 06:00.0 Ethernet controller: Intel Corporation Device 15a4 Please let me know if you have any suggestions. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Ethtool question
On 10/12/2017 03:00 PM, Roopa Prabhu wrote: On Thu, Oct 12, 2017 at 2:45 PM, Ben Greear <gree...@candelatech.com> wrote: On 10/11/2017 01:49 PM, David Miller wrote: From: "John W. Linville" <linvi...@tuxdriver.com> Date: Wed, 11 Oct 2017 16:44:07 -0400 On Wed, Oct 11, 2017 at 09:51:56AM -0700, Ben Greear wrote: I noticed today that setting some ethtool settings to the same value returns an error code. I would think this should silently return success instead? Makes it easier to call it from scripts this way: [root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1 combined unmodified, ignoring no channel parameters changed, aborting current values: tx 0 rx 0 other 1 combined 1 [root@lf0313-6477 lanforge]# echo $? 1 I just had this discussion a couple of months ago with someone. My initial feeling was like you, a no-op is not a failure. But someone convinced me otherwise...I will now endeavour to remember who that was and how they convinced me... Anyone else have input here? I guess this usually happens when drivers don't support changing the settings at all. So they just make their ethtool operation for the 'set' always return an error. We could have a generic ethtool helper that does "get" and then if the "set" request is identical just return zero. But from another perspective, the error returned from the "set" in this situation also indicates to the user that the driver does not support the "set" operation which has value and meaning in and of itself. And we'd lose that with the given suggestion. In my case, the driver (igb) does support the set, my program just made the same ethtool call several times and it fails after the initial change (that actually changes something), as best as I can figure. This error is returned by ethtool user-space. It does a get, check and then set if user has requested changes. So, should we fix ethtool to return 0 in this case instead of an error code? I think so. If the driver itself returns an error, then probably return the error code and/or fix the driver as seems appropriate. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Ethtool question
On 10/11/2017 01:49 PM, David Miller wrote: From: "John W. Linville" <linvi...@tuxdriver.com> Date: Wed, 11 Oct 2017 16:44:07 -0400 On Wed, Oct 11, 2017 at 09:51:56AM -0700, Ben Greear wrote: I noticed today that setting some ethtool settings to the same value returns an error code. I would think this should silently return success instead? Makes it easier to call it from scripts this way: [root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1 combined unmodified, ignoring no channel parameters changed, aborting current values: tx 0 rx 0 other 1 combined 1 [root@lf0313-6477 lanforge]# echo $? 1 I just had this discussion a couple of months ago with someone. My initial feeling was like you, a no-op is not a failure. But someone convinced me otherwise...I will now endeavour to remember who that was and how they convinced me... Anyone else have input here? I guess this usually happens when drivers don't support changing the settings at all. So they just make their ethtool operation for the 'set' always return an error. We could have a generic ethtool helper that does "get" and then if the "set" request is identical just return zero. But from another perspective, the error returned from the "set" in this situation also indicates to the user that the driver does not support the "set" operation which has value and meaning in and of itself. And we'd lose that with the given suggestion. In my case, the driver (igb) does support the set, my program just made the same ethtool call several times and it fails after the initial change (that actually changes something), as best as I can figure. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Ethtool question
I noticed today that setting some ethtool settings to the same value returns an error code. I would think this should silently return success instead? Makes it easier to call it from scripts this way: [root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1 combined unmodified, ignoring no channel parameters changed, aborting current values: tx 0 rx 0 other 1 combined 1 [root@lf0313-6477 lanforge]# echo $? 1 Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?
On 09/12/2017 01:26 PM, Michal Kubecek wrote: On Tue, Sep 12, 2017 at 11:54:43AM -0700, Ben Greear wrote: It does not appear to work on Fedora-26, and I'm curious if someone knows what needs doing to get this support working? It's rather complicated. The "vlan" and "vlan " filters didn't handle the case when vlan information is passed in metadata until commit 04660eb1e561 ("Use BPF extensions in compiled filters"), i.e. libpcap 1.7.0. Unfortunately that commit made libpcap always check only metadata for the first outermost vlan tag so that it broke the case when vlan information is passed in packet itself (which is less frequent today). To handle both cases correctly, you would need libpcap with commits d739b068ac29 ("Make VLAN filter handle both metadata and inline tags") and 7c7a19fbd9af ("Fix logic of combined VLAN test") and also the optimizer fix from https://github.com/the-tcpdump-group/libpcap/pull/582/commits/075015a3d17a (without it the filters generate incorrect BPF in some cases unless the optimizer is disabled). As far as I can see, these commits are not in any release yet. Michal Kubecek So, I cloned the latest libpcap, and I'm going to start poking at this. Do you happen to know if I need to do anything special other than 'pcap_compile()'? I'm curious how the library would know if it can use newer kernel API or not...or maybe it is somehow magically backwards/forward compatible? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?
On 09/12/2017 11:54 AM, Ben Greear wrote: It does not appear to work on Fedora-26, and I'm curious if someone knows what needs doing to get this support working? Thanks, Ben Gah, I spoke too soon. system-test guy says it works on cmd-line, but not when we try to make it work in another way...could be local bug, I'll poke at this more. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Can libpcap filter on vlan tags when vlans are hardware-accelerated?
It does not appear to work on Fedora-26, and I'm curious if someone knows what needs doing to get this support working? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] Fix build on fedora-14 (and other older systems)
On 09/03/2017 08:50 AM, Stephen Hemminger wrote: On Sat, 2 Sep 2017 07:15:02 -0700 gree...@candelatech.com wrote: diff --git a/include/linux/sysinfo.h b/include/linux/sysinfo.h index 934335a..3596b02 100644 --- a/include/linux/sysinfo.h +++ b/include/linux/sysinfo.h @@ -3,6 +3,14 @@ #include +/* So we can compile on older OSs, hopefully this is correct. --Ben */ +#ifndef __kernel_long_t +typedef long __kernel_long_t; +#endif +#ifndef __kernel_ulong_t +typedef unsigned long __kernel_ulong_t; +#endif + #define SI_LOAD_SHIFT 16 struct sysinfo { __kernel_long_t uptime; /* Seconds since boot */ I am not accepting this patch because all files in include/linux are automatically regenerated from kernel 'make install_headers'. No exceptions. If you want to change a header in include/linux it has to go through upstream kernel inclusion. It would be wrong to add this to the actual kernel header I think. Do you have another suggestion for fixing iproute2 compile? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Problem compiling iproute2 on older systems
On 09/02/2017 12:55 AM, Michal Kubecek wrote: On Fri, Sep 01, 2017 at 04:52:20PM -0700, Ben Greear wrote: In the patch below, usage of __kernel_ulong_t and __kernel_long_t is introduced, but that is not available on older system (fedora-14, at least). It is not a #define, so I am having trouble finding a quick hack around this. Any ideas on how to make this work better on older OSs running modern kernels? Author: Stephen Hemminger <step...@networkplumber.org> 2017-01-12 17:54:39 Committer: Stephen Hemminger <step...@networkplumber.org> 2017-01-12 17:54:39 Child: c7ec7697e3f000359aa317394e6dd972e35c1f84 (Fix build on fedora-14 (and other older systems)) Branches: master, remotes/origin/master Follows: v3.10.0 Precedes: add more uapi header files In order to ensure no backward/forward compatiablity problems, make sure that all kernel headers used come from the local copy. Signed-off-by: Stephen Hemminger <step...@networkplumber.org> --- include/linux/sysinfo.h --- new file mode 100644 index 000..934335a @@ -0,0 +1,24 @@ +#ifndef _LINUX_SYSINFO_H +#define _LINUX_SYSINFO_H + +#include + +#define SI_LOAD_SHIFT 16 +struct sysinfo { + __kernel_long_t uptime; /* Seconds since boot */ + __kernel_ulong_t loads[3]; /* 1, 5, and 15 minute load averages */ + __kernel_ulong_t totalram; /* Total usable main memory size */ + __kernel_ulong_t freeram; /* Available memory size */ + __kernel_ulong_t sharedram; /* Amount of shared memory */ + __kernel_ulong_t bufferram; /* Memory used by buffers */ + __kernel_ulong_t totalswap; /* Total swap space size */ + __kernel_ulong_t freeswap; /* swap space still available */ + __u16 procs;/* Number of current processes */ + __u16 pad; /* Explicit padding for m68k */ + __kernel_ulong_t totalhigh; /* Total high memory size */ + __kernel_ulong_t freehigh; /* Available high memory size */ + __u32 mem_unit; /* Memory unit size in bytes */ + char _f[20-2*sizeof(__kernel_ulong_t)-sizeof(__u32)]; /* Padding: libc5 uses this.. */ +}; + +#endif /* _LINUX_SYSINFO_H */ I've been already thinking about this a bit. Normally, we would simply add the file where __kernel_long_t and __kernel_ulong_t are defined. The problem is this is which depends on architecture - which is the point of these types. Good thing is iproute2 doesn't actually use struct sysinfo anywhere so we don't need to have them defined correctly. One possible workaround would therefore be defining them as long and unsigned long. As long as we don't use the types anywhere, we would be fine. Another option would be to replace include/linux/sysinfo.h with an empty file. The problem I can see with this is that if someone uses a script to refresh all copies of uapi headers automatically, the script would have to be aware that it must not update this file and preserve the fake empty one. I just sent a patch that appears to compile on all of my build systems, which are generally fedora-14 to fedora-24 currently. I haven't actually tested functionality yet, but if you say it is unused, then it is very likely to be OK, and even if not, I think it will be fine unless someone is trying to cross-compile. And in that case, probably more than one issue involved... Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Problem compiling iproute2 on older systems
In the patch below, usage of __kernel_ulong_t and __kernel_long_t is introduced, but that is not available on older system (fedora-14, at least). It is not a #define, so I am having trouble finding a quick hack around this. Any ideas on how to make this work better on older OSs running modern kernels? Author: Stephen Hemminger <step...@networkplumber.org> 2017-01-12 17:54:39 Committer: Stephen Hemminger <step...@networkplumber.org> 2017-01-12 17:54:39 Child: c7ec7697e3f000359aa317394e6dd972e35c1f84 (Fix build on fedora-14 (and other older systems)) Branches: master, remotes/origin/master Follows: v3.10.0 Precedes: add more uapi header files In order to ensure no backward/forward compatiablity problems, make sure that all kernel headers used come from the local copy. Signed-off-by: Stephen Hemminger <step...@networkplumber.org> --- include/linux/sysinfo.h --- new file mode 100644 index 000..934335a @@ -0,0 +1,24 @@ +#ifndef _LINUX_SYSINFO_H +#define _LINUX_SYSINFO_H + +#include + +#define SI_LOAD_SHIFT 16 +struct sysinfo { + __kernel_long_t uptime; /* Seconds since boot */ + __kernel_ulong_t loads[3]; /* 1, 5, and 15 minute load averages */ + __kernel_ulong_t totalram; /* Total usable main memory size */ + __kernel_ulong_t freeram; /* Available memory size */ + __kernel_ulong_t sharedram; /* Amount of shared memory */ + __kernel_ulong_t bufferram; /* Memory used by buffers */ + __kernel_ulong_t totalswap; /* Total swap space size */ + __kernel_ulong_t freeswap; /* swap space still available */ + __u16 procs;/* Number of current processes */ + __u16 pad; /* Explicit padding for m68k */ + __kernel_ulong_t totalhigh; /* Total high memory size */ + __kernel_ulong_t freehigh; /* Available high memory size */ + __u32 mem_unit; /* Memory unit size in bytes */ + char _f[20-2*sizeof(__kernel_ulong_t)-sizeof(__u32)]; /* Padding: libc5 uses this.. */ +}; + +#endif /* _LINUX_SYSINFO_H */ -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers
On 08/16/2017 08:18 PM, Dan Williams wrote: On Wed, 2017-08-16 at 19:36 -0700, Ben Greear wrote: On 08/16/2017 07:11 PM, Dan Williams wrote: On Wed, 2017-08-16 at 14:31 -0700, David Miller wrote: From: Dan Williams <d...@redhat.com> Date: Wed, 16 Aug 2017 16:22:41 -0500 My biggest suggestion is that perhaps bonding should grow hysteresis for link speeds. Since WiFi can change speed every packet, you probably don't want the bond characteristics changing every couple seconds just in case your WiFi link is jumping around. Ethernet won't bounce around that much, so the hysteresis would have no effect there. Or, if people are concerned about response time to speed changes on ethernet (where you probably do want an instant switch-over) some new flag to indicate that certain devices don't have stable speeds over time. Or just report the average of the range the wireless link can hit, and be done with it. I think you guys are overcomplicating things. That range can be from 1 to > 800Mb/s. No, it won't usually be all over that range, but it won't be uncommon to fluctuate by hundreds of Mb/s. I'm not sure a simple average is really the answer here. Even doing that would require new knobs to ethtool, since the rate depends heavily on card capabilities and also what AP you're connected to *at that moment*. If you roam to another AP, then the max speed can certainly change. You'll probably say "aim for the 75% case" or something like that, which is fine, but then you're depending on your 75% case to be (a) single AP, (b) never move (eg, only bond wifi + ethernet), (c) little radio interference. I'm not sure I'd buy that. If I've put words in your mouth, forgive me. If you keep ethtool API simple and just return the last (rx-rate + tx-rate) / 2, or the rate averaged over the last 100 frames or 10 seconds, then the caller can do longer term averaging as it sees fit. Probably no need for lots of averaging complexity in the kernel. Yeah, that works too, but I was thinking it was better to present the actual data through ethtool so that things other than bonding could use it, and since bonding is the thing that actually cares about the fluctuation, make it do the more extensive processing. What do you mean by 'actual data'? If you want to know the most accurate transmit/rx rate info, then you need to pay attention to each and every frame's tx/rx rate, as well as it's ampdu/amsdu, retries, etc. It is virtually impossible. So, you will have to settle for something less... I suggest something simple to calculate, similar to existing values that are available via debugfs and/or 'iw dev foo station dump', etc. Let higher layers manipulate the raw data as they see fit (they can query ethtool as often as they like). Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers
On 08/16/2017 07:11 PM, Dan Williams wrote: On Wed, 2017-08-16 at 14:31 -0700, David Miller wrote: From: Dan Williams <d...@redhat.com> Date: Wed, 16 Aug 2017 16:22:41 -0500 My biggest suggestion is that perhaps bonding should grow hysteresis for link speeds. Since WiFi can change speed every packet, you probably don't want the bond characteristics changing every couple seconds just in case your WiFi link is jumping around. Ethernet won't bounce around that much, so the hysteresis would have no effect there. Or, if people are concerned about response time to speed changes on ethernet (where you probably do want an instant switch-over) some new flag to indicate that certain devices don't have stable speeds over time. Or just report the average of the range the wireless link can hit, and be done with it. I think you guys are overcomplicating things. That range can be from 1 to > 800Mb/s. No, it won't usually be all over that range, but it won't be uncommon to fluctuate by hundreds of Mb/s. I'm not sure a simple average is really the answer here. Even doing that would require new knobs to ethtool, since the rate depends heavily on card capabilities and also what AP you're connected to *at that moment*. If you roam to another AP, then the max speed can certainly change. You'll probably say "aim for the 75% case" or something like that, which is fine, but then you're depending on your 75% case to be (a) single AP, (b) never move (eg, only bond wifi + ethernet), (c) little radio interference. I'm not sure I'd buy that. If I've put words in your mouth, forgive me. If you keep ethtool API simple and just return the last (rx-rate + tx-rate) / 2, or the rate averaged over the last 100 frames or 10 seconds, then the caller can do longer term averaging as it sees fit. Probably no need for lots of averaging complexity in the kernel. rate-ctrl for wifi basically doesn't happen until you transmit or receive a fairly steady stream, so it will fluctuate a lot. Thanks, Ben Dan -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/20/2017 11:05 AM, Michal Kubecek wrote: On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote: On 06/14/2017 03:25 PM, David Ahern wrote: On 6/14/17 4:23 PM, Ben Greear wrote: On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. ok, let's figure out which one is doing that. There are 3 debug statements. I suspect fib6_del_route is the one setting the state to FWS_U. Can you remove the debug prints in fib6_repair_tree and fib6_walk_continue and try again? We cannot reproduce with just that one printf in the kernel either. It must change the timing too much to trigger the bug. You might try trace_printk() which should have less impact (don't forget to enable /proc/sys/kernel/ftrace_dump_on_oops). We cannot reproduce with trace_printk() either. Thanks, Ben Michal Kubecek -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/14/2017 03:25 PM, David Ahern wrote: On 6/14/17 4:23 PM, Ben Greear wrote: On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. ok, let's figure out which one is doing that. There are 3 debug statements. I suspect fib6_del_route is the one setting the state to FWS_U. Can you remove the debug prints in fib6_repair_tree and fib6_walk_continue and try again? We cannot reproduce with just that one printf in the kernel either. It must change the timing too much to trigger the bug. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/13/2017 01:28 PM, David Ahern wrote: On 6/13/17 2:16 PM, Ben Greear wrote: On 06/09/2017 02:25 PM, Eric Dumazet wrote: On Fri, 2017-06-09 at 07:27 -0600, David Ahern wrote: On 6/8/17 11:55 PM, Cong Wang wrote: On Thu, Jun 8, 2017 at 2:27 PM, Ben Greear <gree...@candelatech.com> wrote: As far as I can tell, the patch did not help, or at least we still reproduce the crash easily. netlink dump is serialized by nlk->cb_mutex so I don't think that patch makes any sense w.r.t race condition. From what I can see fn_sernum should be accessed under table lock, so when saving and checking it during a walk make sure it the lock is held. That has nothing to do with the netlink dump, but the table changing during a walk. Yes, your patch makes total sense, of course. I guess someone should go ahead and make an official patch and submit it, even if it doesn't fix my problem. I can do that; was hoping to root cause the problem first. (gdb) l *(fib6_walk_continue+0x76) 0x188c6 is in fib6_walk_continue (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593). 1588if (fn == w->root) 1589return 0; 1590pn = fn->parent; 1591w->node = pn; 1592#ifdef CONFIG_IPV6_SUBTREES 1593if (FIB6_SUBTREE(pn) == fn) { Apparently fn->parent is NULL here for some reason, but I don't know if that is expected or not. If a simple NULL check is not enough here, we have to trace why it is NULL. From my understanding, parent should not be null hence the attempts to fix access to table nodes under a lock. ie., figuring out why it is null here. If someone has more suggestions, I'll be happy to test. I have looked at the code again and nothing is jumping out. Will look again later today. I noticed there is some code to help fix up the walkers when nodes are deleted. They use lock: read_lock(>ipv6.fib6_walker_lock); The code you were tweaking uses a different lock: read_lock_bh(>tb6_lock); In is certainly not simple code, so I don't know if that is correct or not, but might possibly be a place to start looking. I'm going to re-test with a WARN_ON to see if that triggers since previous suggestion was that f->parent was NULL. diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 51cd637..86295df 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1571,6 +1571,10 @@ static int fib6_walk_continue(struct fib6_walker *w) case FWS_U: if (fn == w->root) return 0; + if (!fn->parent) { + WARN_ON_ONCE(0); + return 0; + } pn = fn->parent; w->node = pn; #ifdef CONFIG_IPV6_SUBTREES Thanks, Ben Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/09/2017 02:25 PM, Eric Dumazet wrote: On Fri, 2017-06-09 at 07:27 -0600, David Ahern wrote: On 6/8/17 11:55 PM, Cong Wang wrote: On Thu, Jun 8, 2017 at 2:27 PM, Ben Greear <gree...@candelatech.com> wrote: As far as I can tell, the patch did not help, or at least we still reproduce the crash easily. netlink dump is serialized by nlk->cb_mutex so I don't think that patch makes any sense w.r.t race condition. From what I can see fn_sernum should be accessed under table lock, so when saving and checking it during a walk make sure it the lock is held. That has nothing to do with the netlink dump, but the table changing during a walk. Yes, your patch makes total sense, of course. I guess someone should go ahead and make an official patch and submit it, even if it doesn't fix my problem. (gdb) l *(fib6_walk_continue+0x76) 0x188c6 is in fib6_walk_continue (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593). 1588if (fn == w->root) 1589return 0; 1590pn = fn->parent; 1591w->node = pn; 1592#ifdef CONFIG_IPV6_SUBTREES 1593if (FIB6_SUBTREE(pn) == fn) { Apparently fn->parent is NULL here for some reason, but I don't know if that is expected or not. If a simple NULL check is not enough here, we have to trace why it is NULL. From my understanding, parent should not be null hence the attempts to fix access to table nodes under a lock. ie., figuring out why it is null here. If someone has more suggestions, I'll be happy to test. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
fib+0x1ab) 0x1939b is in inet6_dump_fib (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:392). 387 w->skip = w->count; 388 } else 389 w->skip = 0; 390 391 res = fib6_walk_continue(w); 392 read_unlock_bh(>tb6_lock); 393 if (res <= 0) { 394 fib6_walker_unlink(net, w); 395 cb->args[4] = 0; 396 } (gdb) [greearb@ben-dt3 linux-2.6]$ git diff diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index d4bf2c6..4e32a16 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -372,12 +372,13 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, read_lock_bh(>tb6_lock); res = fib6_walk(net, w); - read_unlock_bh(>tb6_lock); if (res > 0) { cb->args[4] = 1; cb->args[5] = w->root->fn_sernum; } + read_unlock_bh(>tb6_lock); } else { + read_lock_bh(>tb6_lock); if (cb->args[5] != w->root->fn_sernum) { /* Begin at the root if the tree changed */ cb->args[5] = w->root->fn_sernum; @@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, } else w->skip = 0; - read_lock_bh(>tb6_lock); res = fib6_walk_continue(w); read_unlock_bh(>tb6_lock); if (res <= 0) { Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/06/2017 05:27 PM, Eric Dumazet wrote: On Tue, 2017-06-06 at 18:00 -0600, David Ahern wrote: On 6/6/17 3:06 PM, Ben Greear wrote: This bug has been around forever, and we recently got an intern and stuck him with trying to reproduce it on the latest kernel. It is still here. I'm not super excited about trying to fix this, but we can easily test patches if someone has a patch to try. Can you try this (whitespace damaged on paste, but it is moving the lock ahead of the fn_sernum check): diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index deea901746c8..7a44c49055c0 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -378,6 +378,7 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, cb->args[5] = w->root->fn_sernum; } } else { + read_lock_bh(>tb6_lock); if (cb->args[5] != w->root->fn_sernum) { /* Begin at the root if the tree changed */ cb->args[5] = w->root->fn_sernum; @@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, } else w->skip = 0; - read_lock_bh(>tb6_lock); res = fib6_walk_continue(w); read_unlock_bh(>tb6_lock); if (res <= 0) { Good catch, but it looks like similar fix is needed a few lines before. We will test this tomorrow. Thanks, Ben diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index deea901746c8570c5e801e40592c91e3b62812e0..b214443dc8346cef3690df7f27cc48a864028865 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -372,12 +372,13 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, read_lock_bh(>tb6_lock); res = fib6_walk(net, w); - read_unlock_bh(>tb6_lock); if (res > 0) { cb->args[4] = 1; cb->args[5] = w->root->fn_sernum; } + read_unlock_bh(>tb6_lock); } else { + read_lock_bh(>tb6_lock); if (cb->args[5] != w->root->fn_sernum) { /* Begin at the root if the tree changed */ cb->args[5] = w->root->fn_sernum; @@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, } else w->skip = 0; - read_lock_bh(>tb6_lock); res = fib6_walk_continue(w); read_unlock_bh(>tb6_lock); if (res <= 0) { -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
Hello, This bug has been around forever, and we recently got an intern and stuck him with trying to reproduce it on the latest kernel. It is still here. I'm not super excited about trying to fix this, but we can easily test patches if someone has a patch to try. Test case is to create 1000 mac-vlans and bring them up, with user-space tools running lots of 'dump' related commands as part of bringing up the interfaces and configuring some special source-based routing tables. (gdb) l *(inet6_dump_fib+0x109) 0x192f9 is in inet6_dump_fib (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:392). 387 } else 388 w->skip = 0; 389 390 read_lock_bh(>tb6_lock); 391 res = fib6_walk_continue(w); 392 read_unlock_bh(>tb6_lock); 393 if (res <= 0) { 394 fib6_walker_unlink(net, w); 395 cb->args[4] = 0; 396 } (gdb) l *(fib6_walk_continue+0x76) 0x188c6 is in fib6_walk_continue (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593). 1588if (fn == w->root) 1589return 0; 1590pn = fn->parent; 1591w->node = pn; 1592#ifdef CONFIG_IPV6_SUBTREES 1593if (FIB6_SUBTREE(pn) == fn) { 1594WARN_ON(!(fn->fn_flags & RTN_ROOT)); 1595w->state = FWS_L; 1596continue; 1597} [root@ct524-ffb0 ~]# BUG: unable to handle kernel NULL pointer dereference at 0018 IP: fib6_walk_continue+0x76/0x180 [ipv6] PGD 3d9226067 P4D 3d9226067 PUD 3d9020067 PMD 0 Oops: [#1] PREEMPT SMP Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c bnep fuse macvlan pktgen cfg80211 ipmi_ssif iTCO_wdt iTCO_vendor_support coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass joydev i2c_i801 ie31200_edac intel_pch_thermal shpchp hci_uart ipmi_si btbcm btqca ipmi_devintf btintel ipmi_msghandler bluetooth pinctrl_sunrisepoint acpi_als pinctrl_intel video tpm_tis intel_lpss_acpi kfifo_buf tpm_tis_core intel_lpss industrialio tpm acpi_pad acpi_power_meter sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_hid i2c_core ipv6 crc_ccitt [last unloaded: nf_conntrack] CPU: 1 PID: 996 Comm: ip Not tainted 4.12.0-rc4+ #32 Hardware name: Supermicro Super Server/X11SSM-F, BIOS 1.0b 12/29/2015 task: 8803d4d61dc0 task.stack: c9000970c000 RIP: 0010:fib6_walk_continue+0x76/0x180 [ipv6] RSP: 0018:c9000970fbb8 EFLAGS: 00010283 RAX: 8803de84b020 RBX: 8803e0756f00 RCX: RDX: RSI: c9000970fc00 RDI: 81eee280 RBP: c9000970fbc0 R08: 0008 R09: 8803d4fbbf31 R10: c9000970fb68 R11: R12: 0001 R13: 0001 R14: 8803e0756f00 R15: 8803d9345b18 FS: 7f32ca4ec700() GS:88047784() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0018 CR3: 0003ddacc000 CR4: 003406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: inet6_dump_fib+0x109/0x290 [ipv6] netlink_dump+0x11d/0x290 netlink_recvmsg+0x260/0x3f0 sock_recvmsg+0x38/0x40 ___sys_recvmsg+0xe9/0x230 ? alloc_pages_vma+0x9d/0x260 ? page_add_new_anon_rmap+0x88/0xc0 ? lru_cache_add_active_or_unevictable+0x31/0xb0 ? __handle_mm_fault+0xce3/0xf70 __sys_recvmsg+0x3d/0x70 ? __sys_recvmsg+0x3d/0x70 SyS_recvmsg+0xd/0x20 do_syscall_64+0x56/0xc0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f32c9e21050 RSP: 002b:7fff96401de8 EFLAGS: 0246 ORIG_RAX: 002f RAX: ffda RBX: RCX: 7f32c9e21050 RDX: RSI: 7fff96401e50 RDI: 0004 RBP: 7fff96405e74 R08: 3fe4 R09: R10: 7fff96401e90 R11: 0246 R12: 0064f3a0 R13: 7fff96405ee0 R14: 3fe4 R15: Code: f6 40 2a 04 74 11 8b 53 30 85 d2 0f 84 02 01 00 00 83 ea 01 89 53 30 c7 43 28 04 00 00 00 48 39 43 10 74 33 48 8b 10 48 89 53 18 <48> 39 42 18 0f 84 a3 00 00 00 48 39 42 08 0f 84 ae 00 00 00 48 RIP: fib6_walk_continue+0x76/0x180 [ipv6] RSP: c9000970fbb8 CR2: 0018 ---[ end trace 5ebbc4ee97bea64e ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled Rebooting in 10 seconds.. ACPI MEMORY or I/O RESET_REG. -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: 'iw events' stops receiving events after a while on 4.9 + hacks
On 05/31/2017 01:18 AM, Bastian Bittorf wrote: * Johannes Berg <johan...@sipsolutions.net> [31.05.2017 10:09]: Is there any way to dump out the socket information if we reproduce the problem? I have no idea, sorry. If you or Bastian can tell me how to reproduce the problem, I can try to investigate it. there was an interesting fix regarding the shell-builtin 'read' in busybox[1]. I will retest again and report if this changes anything. bye, bastian PS: @ben: are you also using 'iw event | while read -r LINE ...'? I'm using a perl script to read the output, and not using busybox. I have not seen the problem again, so it is not easy for me to reproduce. If you reproduce it, maybe check 'strace' on the 'iw' process to see if it is hung on writing output to the pipe or reading input? In my case, it appeared to be hung reading input from netlink, input that never arrived. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: 'iw events' stops receiving events after a while on 4.9 + hacks
On 05/17/2017 06:30 AM, Johannes Berg wrote: On Wed, 2017-05-17 at 12:08 +0200, Bastian Bittorf wrote: * Ben Greear <gree...@candelatech.com> [17.05.2017 11:51]: I have been keeping an 'iw events' program running with a perl script gathering its output and post-processing it. This has been working for several years on 4.7 and earlier kernels, but when testing on 4.9 overnight, I notice that 'iw events' is not showing any input. 'strace' shows that it is waiting on recvmsg. If I start a second 'iw events' then it will get wifi events as expected. me too, also seen on 4.4 - i'am happy for debug ideas. I've never seen this. Does it happen when it's very long-running? Or when there are lots of events? Perhaps something in the socket buffer accounting is going wrong, so that it's slowly decreasing to 0? I saw it exactly once so far, and it happened overnight, but we have not been doing a lot of work with the 4.9 kernel until recently. I don't think there were many messages on this system, and certainly others have run much longer on systems that should be generating many more events without trouble. Is there any way to dump out the socket information if we reproduce the problem? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
'iw events' stops receiving events after a while on 4.9 + hacks
I have been keeping an 'iw events' program running with a perl script gathering its output and post-processing it. This has been working for several years on 4.7 and earlier kernels, but when testing on 4.9 overnight, I notice that 'iw events' is not showing any input. 'strace' shows that it is waiting on recvmsg. If I start a second 'iw events' then it will get wifi events as expected. Are there any known issues in this area? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: How to debug DMAR errors?
On 04/14/2017 09:24 AM, Alexander Duyck wrote: On Fri, Apr 14, 2017 at 9:19 AM, Ben Greear <gree...@candelatech.com> wrote: On 04/14/2017 08:45 AM, Alexander Duyck wrote: On Thu, Apr 13, 2017 at 11:12 AM, Ben Greear <gree...@candelatech.com> wrote: Hello, I have been seeing a regular occurrence of DMAR errors, looking something like this when testing my ath10k driver/firmware under some specific loads (maximum receive of 512 byte frames in AP mode): DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Read] Request device [05:00.0] fault addr fd99f000 [fault reason 06] PTE Read access is not set ath10k_pci :05:00.0: firmware crashed! (uuid 594b1393-ae35-42b5-9dec-74ff0c6791ff) So, I am wondering if there is any way I can get more information about what this fd99f000 address is? Once this problem hits, the entire OS locks hard (not even sysrq-boot will do anything), so I guess I would need the DMAR logic to print out more info on that address somehow. Thanks, Ben There isn't much more info to give you. The problem is that the device at 5:00.0 attempted to read at fd99f000 even though it didn't have permissions. In response this should trigger a PCI Master Abort message to that function. It looks like the firmware for the device doesn't handle that and so that is likely why things got hung. Really you would need to interrogate the ath10k_pci to see if there is/was a mapping somewhere for that address and what it was supposed to be used for. I'm working on a hook in DMAR logic to call into ath10k_pci when the error is seen, so the ath10k can dump debug info, including recent DMA addresses. My code is an awful hack so far, but if someone could add a clean way to register DMAR error callbacks, I think that would be very welcome. It might could tie into automated dma map/unmap debugging logic, and at the least, someone could write custom debugging callbacks for the driver(s) in question. Thanks, Ben You might look at coding up something to add pci_error_handlers for the pci_driver in the ath10k_pci driver. The PCI Master Abort should trigger an error that you could then capture in the driver and handle at least dumping it via your own implementation of the error handlers. If nothing else I suspect there are probably some sort of descriptor rings you could probably dump. I'm suspecting this is some sort of Tx issue since the problem was a read fault, but I suppose there are other paths in the driver that might trigger DMA read requests. This is a thick firmware driver, so the firmware could also be screwing up and accessing something it should not. There are some existing work-arounds in it to deal with sketchy behaviour already, maybe more are needed. Anyway, once I added the debugging code, I didn't see it crash again, so might be a while before I know more. Thanks, Ben - Alex -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: How to debug DMAR errors?
On 04/14/2017 08:45 AM, Alexander Duyck wrote: On Thu, Apr 13, 2017 at 11:12 AM, Ben Greear <gree...@candelatech.com> wrote: Hello, I have been seeing a regular occurrence of DMAR errors, looking something like this when testing my ath10k driver/firmware under some specific loads (maximum receive of 512 byte frames in AP mode): DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Read] Request device [05:00.0] fault addr fd99f000 [fault reason 06] PTE Read access is not set ath10k_pci :05:00.0: firmware crashed! (uuid 594b1393-ae35-42b5-9dec-74ff0c6791ff) So, I am wondering if there is any way I can get more information about what this fd99f000 address is? Once this problem hits, the entire OS locks hard (not even sysrq-boot will do anything), so I guess I would need the DMAR logic to print out more info on that address somehow. Thanks, Ben There isn't much more info to give you. The problem is that the device at 5:00.0 attempted to read at fd99f000 even though it didn't have permissions. In response this should trigger a PCI Master Abort message to that function. It looks like the firmware for the device doesn't handle that and so that is likely why things got hung. Really you would need to interrogate the ath10k_pci to see if there is/was a mapping somewhere for that address and what it was supposed to be used for. I'm working on a hook in DMAR logic to call into ath10k_pci when the error is seen, so the ath10k can dump debug info, including recent DMA addresses. My code is an awful hack so far, but if someone could add a clean way to register DMAR error callbacks, I think that would be very welcome. It might could tie into automated dma map/unmap debugging logic, and at the least, someone could write custom debugging callbacks for the driver(s) in question. Thanks, Ben - Alex -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
How to debug DMAR errors?
Hello, I have been seeing a regular occurrence of DMAR errors, looking something like this when testing my ath10k driver/firmware under some specific loads (maximum receive of 512 byte frames in AP mode): DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Read] Request device [05:00.0] fault addr fd99f000 [fault reason 06] PTE Read access is not set ath10k_pci :05:00.0: firmware crashed! (uuid 594b1393-ae35-42b5-9dec-74ff0c6791ff) So, I am wondering if there is any way I can get more information about what this fd99f000 address is? Once this problem hits, the entire OS locks hard (not even sysrq-boot will do anything), so I guess I would need the DMAR logic to print out more info on that address somehow. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Horrid balance-rr bonding udp throughput
On 04/10/2017 11:50 AM, Jarod Wilson wrote: On 2017-04-08 7:33 PM, Jarod Wilson wrote: I'm digging into some bug reports covering performance issues with balance-rr, and discovered something even worse than the reporter. My test setup has a pair of NICs, one e1000e, one e1000 (but dual e1000e seems the same). When I do a test run in LNST with bonding mode balance-rr and either miimon or arpmon, the throughput of the UDP_STREAM netperf test is absolutely horrible: TCP: 941.19 +-0.88 mbits/sec UDP: 45.42 +-4.59 mbits/sec I figured I'd try LNST's packet capture mode, so exact same test, add the -p flag and I get: TCP: 941.21 +-0.82 mbits/sec UDP: 961.54 +-0.01 mbits/sec Uh. What? So yeah. I can't capture the traffic in the bad case, but I guess that gives some potential insight into what's not happening correctly in either the bonding driver or the NIC drivers... More digging forthcoming, but first I have a flooded basement to deal with, so if in the interim, anyone has some insight, I'd be happy to hear it. :) Okay, ignore the bit about bonding, I should have eliminated the bond from the picture entirely. I think the traffic simply ended up on the e1000 on the non-capture test and on the e1000e for the capture test, as those numbers match perfectly with straight NIC to NIC testing, no bond involved. That said, really odd that the e1000 is so severely crippled for UDP, while TCP is still respectable. Not sure if I have a flaky NIC or what... For reference, e1000 to e1000e netperf: TCP_STREAM: Measured rate was 849.95 +-1.32 mbits/sec UDP_STREAM: Measured rate was 44.73 +-5.73 mbits/sec Maybe check that you have re-ordering issues? I ran into that with igb recently and it took a while to realize my problem! Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [RFC 2/3] genetlink: pass extended error report down
On 04/07/2017 12:12 PM, Johannes Berg wrote: On Fri, 2017-04-07 at 11:37 -0700, Ben Greear wrote: I guess the error string must be constant and always available in memory in this implementation? Yes. I think it would be nice to dynamically create strings (malloc, snprintf, etc) and have the err_str logic free it when done? We can think about that later, but I don't actually think it makes a lot of sense - if we point to the attribute and/or offset you really ought to have enough info to figure out what's up. We can think about it later, but lots of things in the wifi stack could use a descriptive message specific to the failure. Often these messages are much more useful if you explain why the failure conflicts with regulatory, channel, virtual-dev combination, etc info, so that needs to be dynamic. The code that is failing knows, so I'd like to pass it back to user-space. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [RFC 2/3] genetlink: pass extended error report down
On 04/07/2017 11:26 AM, Johannes Berg wrote: From: Johannes Berg <johannes.b...@intel.com> Signed-off-by: Johannes Berg <johannes.b...@intel.com> --- include/net/genetlink.h | 27 +++ net/netlink/genetlink.c | 6 -- 2 files changed, 31 insertions(+), 2 deletions(-) diff --git a/include/net/genetlink.h b/include/net/genetlink.h index a34275be3600..67ad2326cfa6 100644 --- a/include/net/genetlink.h +++ b/include/net/genetlink.h @@ -84,6 +84,7 @@ struct nlattr **genl_family_attrbuf(const struct genl_family *family); * @attrs: netlink attributes * @_net: network namespace * @user_ptr: user pointers + * @exterr: extended error report struct */ struct genl_info { u32 snd_seq; @@ -94,6 +95,7 @@ struct genl_info { struct nlattr **attrs; possible_net_t _net; void * user_ptr[2]; + struct netlink_ext_err *exterr; }; static inline struct net *genl_info_net(struct genl_info *info) @@ -106,6 +108,31 @@ static inline void genl_info_net_set(struct genl_info *info, struct net *net) write_pnet(>_net, net); } +static inline int genl_err_str(struct genl_info *info, int err, + const char *msg) +{ + info->exterr->msg = msg; + + return err; +} I guess the error string must be constant and always available in memory in this implementation? I think it would be nice to dynamically create strings (malloc, snprintf, etc) and have the err_str logic free it when done? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] igb: add module param to set max-rss-queues.
On 03/24/2017 04:14 PM, Stephen Hemminger wrote: On Fri, 24 Mar 2017 14:20:56 -0700 Ben Greear <gree...@candelatech.com> wrote: On 03/24/2017 02:12 PM, David Miller wrote: From: gree...@candelatech.com Date: Fri, 24 Mar 2017 13:58:47 -0700 From: Ben Greear <gree...@candelatech.com> In systems where you may have a very large number of network adapters, certain drivers may consume an unfair amount of IRQ resources. So, allow a module param that will limit the number of IRQs at driver load time. This way, other drivers (40G Ethernet, for instance), which probably will need the multiple IRQs more, will not be starved of IRQ resources. Signed-off-by: Ben Greear <gree...@candelatech.com> Sorry, no module params. Use generic run-time facilities such as ethtool to configure such things. You cannot call ethtool before module load time, and that is when the IRQs are first acquired. It may be way more useful to give each of 20 network adapters 2 irqs than have the first few grab 16 and the rest get lumped into legacy crap. Almost all network devices do not acquire interrupts until device is brought up. I.e request_irq is called from open not probe. This is done so that configuration can be done and also so that unused ports don't consume interrupt space. If I ever have to deal with this on stock kernels again I'll keep that in mind. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] igb: add module param to set max-rss-queues.
On 03/24/2017 02:12 PM, David Miller wrote: From: gree...@candelatech.com Date: Fri, 24 Mar 2017 13:58:47 -0700 From: Ben Greear <gree...@candelatech.com> In systems where you may have a very large number of network adapters, certain drivers may consume an unfair amount of IRQ resources. So, allow a module param that will limit the number of IRQs at driver load time. This way, other drivers (40G Ethernet, for instance), which probably will need the multiple IRQs more, will not be starved of IRQ resources. Signed-off-by: Ben Greear <gree...@candelatech.com> Sorry, no module params. Use generic run-time facilities such as ethtool to configure such things. You cannot call ethtool before module load time, and that is when the IRQs are first acquired. It may be way more useful to give each of 20 network adapters 2 irqs than have the first few grab 16 and the rest get lumped into legacy crap. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Performance issue with igb with lots of different src-ip addrs.
On 03/16/2017 08:51 PM, Ben Greear wrote: I think we can, might take us a day or two to get time to do it. Thanks, Ben On 03/16/2017 08:05 PM, Alexander Duyck wrote: I'm not really interested in installing a custom version of pktgen. Any chance you can recreate the issue with standard pktgen? You might try running perf to get a snapshot of what is using CPU time on the system. It will probably give you a pretty good idea where the code is that is eating up all your CPU time. So, I had time to dig into this today. Turns out that our tool was reporting drops because of sequence number gaps, which in turn were caused by out-of-order frames...not actually dropping frames. If I force the rss_queues to one, the problem goes away. Sorry for mis-reporting a bug. Thanks, Ben - Alex On Thu, Mar 16, 2017 at 7:46 PM, Ben Greear <gree...@candelatech.com> wrote: I'm actually using a hacked up version of pktgen nicely driven by our GUI tool, but the crux is that you need to set min and max src IP to some large range. We are driving pktgen from a separate machine. Stock pktgen isn't good at reporting received pkts last I checked, so it may be more difficult to easily view the problem. I'll be happy to set up my tool on your Fedora 24 or similar VM or machine if you want. Thanks, Ben On 03/16/2017 07:35 PM, Alexander Duyck wrote: Can you include the pktgen script you are running? Also when you say you are driving traffic through the bridge are you sending from something external on the system or are you actually directing the traffic from pktgen into the bridge directly? - Alex On Thu, Mar 16, 2017 at 3:49 PM, Ben Greear <gree...@candelatech.com> wrote: Hello, We notice that when using two igb ports as a bridge, if we use pktgen to drive traffic through the bridge and randomize (or use a very large range) for the source IP addr in pktgen, then performance of igb is very poor (like 150Mbps throughput instead of 1Gbps). It runs right at line speed if we use same src/dest IP addr in pktgen. So, seems it is related to lots of src/dest IP addresses. We see same problem when using pktgen to send to itself, and we see this in several different kernels. We specifically tested bridge mode in this stock Fedora kernel: Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux e1000e does not show this problem in our testing. Any ideas what the issue might be and how to fix it? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Performance issue with igb with lots of different src-ip addrs.
I think we can, might take us a day or two to get time to do it. Thanks, Ben On 03/16/2017 08:05 PM, Alexander Duyck wrote: I'm not really interested in installing a custom version of pktgen. Any chance you can recreate the issue with standard pktgen? You might try running perf to get a snapshot of what is using CPU time on the system. It will probably give you a pretty good idea where the code is that is eating up all your CPU time. - Alex On Thu, Mar 16, 2017 at 7:46 PM, Ben Greear <gree...@candelatech.com> wrote: I'm actually using a hacked up version of pktgen nicely driven by our GUI tool, but the crux is that you need to set min and max src IP to some large range. We are driving pktgen from a separate machine. Stock pktgen isn't good at reporting received pkts last I checked, so it may be more difficult to easily view the problem. I'll be happy to set up my tool on your Fedora 24 or similar VM or machine if you want. Thanks, Ben On 03/16/2017 07:35 PM, Alexander Duyck wrote: Can you include the pktgen script you are running? Also when you say you are driving traffic through the bridge are you sending from something external on the system or are you actually directing the traffic from pktgen into the bridge directly? - Alex On Thu, Mar 16, 2017 at 3:49 PM, Ben Greear <gree...@candelatech.com> wrote: Hello, We notice that when using two igb ports as a bridge, if we use pktgen to drive traffic through the bridge and randomize (or use a very large range) for the source IP addr in pktgen, then performance of igb is very poor (like 150Mbps throughput instead of 1Gbps). It runs right at line speed if we use same src/dest IP addr in pktgen. So, seems it is related to lots of src/dest IP addresses. We see same problem when using pktgen to send to itself, and we see this in several different kernels. We specifically tested bridge mode in this stock Fedora kernel: Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux e1000e does not show this problem in our testing. Any ideas what the issue might be and how to fix it? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Performance issue with igb with lots of different src-ip addrs.
I'm actually using a hacked up version of pktgen nicely driven by our GUI tool, but the crux is that you need to set min and max src IP to some large range. We are driving pktgen from a separate machine. Stock pktgen isn't good at reporting received pkts last I checked, so it may be more difficult to easily view the problem. I'll be happy to set up my tool on your Fedora 24 or similar VM or machine if you want. Thanks, Ben On 03/16/2017 07:35 PM, Alexander Duyck wrote: Can you include the pktgen script you are running? Also when you say you are driving traffic through the bridge are you sending from something external on the system or are you actually directing the traffic from pktgen into the bridge directly? - Alex On Thu, Mar 16, 2017 at 3:49 PM, Ben Greear <gree...@candelatech.com> wrote: Hello, We notice that when using two igb ports as a bridge, if we use pktgen to drive traffic through the bridge and randomize (or use a very large range) for the source IP addr in pktgen, then performance of igb is very poor (like 150Mbps throughput instead of 1Gbps). It runs right at line speed if we use same src/dest IP addr in pktgen. So, seems it is related to lots of src/dest IP addresses. We see same problem when using pktgen to send to itself, and we see this in several different kernels. We specifically tested bridge mode in this stock Fedora kernel: Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux e1000e does not show this problem in our testing. Any ideas what the issue might be and how to fix it? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: netdev level filtering? perhaps pushing socket filters down?
On 03/16/2017 03:33 PM, Johannes Berg wrote: Hi all, Occasionally - we just had another case - people want to hook into packets received and processed by the mac80211 stack, but because they don't need all of them (e.g. not data packets), even adding a monitor interface and bringing it up has too high a cost because SKBs need to be prepared to send them to the monitor interface, even if no socket is consuming them. Ideally, we'd be able to detect that there are filter programs attached to the socket(s) that are looking at the frames coming in on the monitor interface, and we could somehow magically run those before we create a new SKB. One problem here is that we wouldn't really want to prepare all the radiotap header just to throw it away, so we'd have to be able to analyse the filter program to make sure it doesn't access anything but the radiotap header length, and that only in order to jump over it. That seems ... difficult, but we don't even know the header length - although we could fudge that and make a very long constant-size header, which might make it possible to do such analysis, or handle it by trapping on such access. But it seems rather difficult to implement this. The next best thing would be to install a filter program on the virtual monitor *interface* (netdev), but say that it doesn't get frames with radiotap, but pure 802.11 frames. We already have those in SKB format at this point, so it'd be simple to run such a program and only pass the SKB to the monitor netdev's RX when the program asked to do that. This now seems a bit like XDP, but for XDP this header difference doesn't seem appropriate either. Anyone have any other thoughts? Attach at just above the driver, before it ever gets to stations/vdevs, and ignore radiotap headers and/or add special processing for metadata like rx-info? Thanks, Ben Thanks, johannes -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Performance issue with igb with lots of different src-ip addrs.
Hello, We notice that when using two igb ports as a bridge, if we use pktgen to drive traffic through the bridge and randomize (or use a very large range) for the source IP addr in pktgen, then performance of igb is very poor (like 150Mbps throughput instead of 1Gbps). It runs right at line speed if we use same src/dest IP addr in pktgen. So, seems it is related to lots of src/dest IP addresses. We see same problem when using pktgen to send to itself, and we see this in several different kernels. We specifically tested bridge mode in this stock Fedora kernel: Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux e1000e does not show this problem in our testing. Any ideas what the issue might be and how to fix it? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 1/3] ath10k: remove ath10k_vif_to_arvif()
On 02/09/2017 11:03 PM, Valo, Kalle wrote: Ben Greear <gree...@candelatech.com> writes: On 02/07/2017 01:14 AM, Valo, Kalle wrote: Adrian Chadd <adr...@freebsd.org> writes: Removing this method makes the diff to FreeBSD larger, as "vif" in FreeBSD is a different pointer. (Yes, I have ath10k on freebsd working and I'd like to find a way to reduce the diff moving forward.) I don't like this "(void *) vif->drv_priv" style that much either but apparently it's commonly used in Linux wireless code and already parts of ath10k. So this patch just unifies the coding style. Surely the code compiles to the same thing, so why add a patch that makes it more difficult for Adrian and makes the code no easier to read for the rest of us? Because that's the coding style used already in Linux. It's great to see that parts of ath10k can be used also in other systems but in principle I'm not very fond of the idea starting to reject valid upstream patches because of driver forks. There are lots of people trying to maintain out-of-tree or backported patches to ath10k, and every time there is a meaningless style change, that just makes us waste more time on useless work instead of having time to work on more important matters. Thanks, Ben I think backports project is doing it right, it's not limiting upstream development in any way and handles all the API changes internally. Maybe FreeBSD could do something similar? -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 1/3] ath10k: remove ath10k_vif_to_arvif()
On 02/07/2017 01:14 AM, Valo, Kalle wrote: Adrian Chadd <adr...@freebsd.org> writes: Removing this method makes the diff to FreeBSD larger, as "vif" in FreeBSD is a different pointer. (Yes, I have ath10k on freebsd working and I'd like to find a way to reduce the diff moving forward.) I don't like this "(void *) vif->drv_priv" style that much either but apparently it's commonly used in Linux wireless code and already parts of ath10k. So this patch just unifies the coding style. Surely the code compiles to the same thing, so why add a patch that makes it more difficult for Adrian and makes the code no easier to read for the rest of us? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Commit 1fe8e0... (include more headers in if_tunnel.h) breaks my user-space build.
On 01/13/2017 02:08 PM, Stephen Hemminger wrote: On Fri, 13 Jan 2017 11:50:32 -0800 Ben Greear <gree...@candelatech.com> wrote: On 01/13/2017 11:41 AM, Stephen Hemminger wrote: On Fri, 13 Jan 2017 11:12:32 -0800 Ben Greear <gree...@candelatech.com> wrote: I am including netinet/ip.h, and also linux/if_tunnel.h, and the linux/ip.h conflicts with netinet/ip.h. Maybe my build environment is screwed up, but maybe also it would be better to just let the user include appropriate headers before including if_tunnel.h and revert this patch? include/uapi/linux/if_tunnel.h: include linux/if.h, linux/ip.h and linux/in6.h Fixes userspace compilation errors like: error: field ‘iph’ has incomplete type error: field ‘prefix’ has incomplete type Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi> Signed-off-by: David S. Miller <da...@davemloft.net> Thanks, Ben What I ended up doing for iproute2 was including all headers used by the source based on sanitized kernel headers. Basically $ git grep '^#include .*$//' | \ sort -u >linux.headers $ for f in $(cat linux.headers) do cp ~/kernel/net-next/usr/include/$f include/$f done You can't take only some of the headers, once you decide to diverge from glibc provided headers, you got to take them all. I do grab a copy of the linux kernel headers and compile against that, but netinet/ip.h is coming from the OS. Do you mean I should not include netinet/ip.h and instead use linux/ip.h? I don't think you can mix netinet/ip.h and linux/ip.h, yes that is a mess. Well, I still like the idea of reverting this patch..that way user-space does not have to use linux/ip.h, and that lets them use netinet/ip.h and if_tunnel.h. Anyway, I'll let Dave and/or the original committer decideI've reverted it in my local tree so I am able to build again... Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Commit 1fe8e0... (include more headers in if_tunnel.h) breaks my user-space build.
On 01/13/2017 11:41 AM, Stephen Hemminger wrote: On Fri, 13 Jan 2017 11:12:32 -0800 Ben Greear <gree...@candelatech.com> wrote: I am including netinet/ip.h, and also linux/if_tunnel.h, and the linux/ip.h conflicts with netinet/ip.h. Maybe my build environment is screwed up, but maybe also it would be better to just let the user include appropriate headers before including if_tunnel.h and revert this patch? include/uapi/linux/if_tunnel.h: include linux/if.h, linux/ip.h and linux/in6.h Fixes userspace compilation errors like: error: field ‘iph’ has incomplete type error: field ‘prefix’ has incomplete type Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi> Signed-off-by: David S. Miller <da...@davemloft.net> Thanks, Ben What I ended up doing for iproute2 was including all headers used by the source based on sanitized kernel headers. Basically $ git grep '^#include .*$//' | \ sort -u >linux.headers $ for f in $(cat linux.headers) do cp ~/kernel/net-next/usr/include/$f include/$f done You can't take only some of the headers, once you decide to diverge from glibc provided headers, you got to take them all. I do grab a copy of the linux kernel headers and compile against that, but netinet/ip.h is coming from the OS. Do you mean I should not include netinet/ip.h and instead use linux/ip.h? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Commit 1fe8e0... (include more headers in if_tunnel.h) breaks my user-space build.
On 01/13/2017 11:12 AM, Ben Greear wrote: I am including netinet/ip.h, and also linux/if_tunnel.h, and the linux/ip.h conflicts with netinet/ip.h. Maybe my build environment is screwed up, but maybe also it would be better to just let the user include appropriate headers before including if_tunnel.h and revert this patch? include/uapi/linux/if_tunnel.h: include linux/if.h, linux/ip.h and linux/in6.h Fixes userspace compilation errors like: error: field ‘iph’ has incomplete type error: field ‘prefix’ has incomplete type Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi> Signed-off-by: David S. Miller <da...@davemloft.net> Thanks, Ben I forgot the full commit ID, my abbreviation was not sufficient to be unique it seems: 1fe8e0f074c77aa41aaa579345a9e675acbebfa9 Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Commit 1fe8e0... (include more headers in if_tunnel.h) breaks my user-space build.
I am including netinet/ip.h, and also linux/if_tunnel.h, and the linux/ip.h conflicts with netinet/ip.h. Maybe my build environment is screwed up, but maybe also it would be better to just let the user include appropriate headers before including if_tunnel.h and revert this patch? include/uapi/linux/if_tunnel.h: include linux/if.h, linux/ip.h and linux/in6.h Fixes userspace compilation errors like: error: field ‘iph’ has incomplete type error: field ‘prefix’ has incomplete type Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi> Signed-off-by: David S. Miller <da...@davemloft.net> Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
ixgbe Port cannot load, "failed to register GSI"
We put 3 10-g dual-port ixgbe NICs and 4 4-port I350 NICs in a 2U rackmount, and one of the ixgbe ports fails to come up. This previously worked before reboot, so maybe it is a race somehow. Kernel is 4.4.11+, but not hacks to ixgbe or I350 drivers. Anyone know if there is some sort of way to make this work reliably? dmesg | grep ixgbe [5.803307] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 4.2.1-k [5.803309] ixgbe: Copyright (c) 1999-2015 Intel Corporation. [5.952119] ixgbe :04:00.0: Multiqueue Enabled: Rx Queue count = 8, Tx Queue count = 8 [5.952245] ixgbe :04:00.0: PCI Express bandwidth of 32GT/s available [5.952246] ixgbe :04:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%) [5.952328] ixgbe :04:00.0: MAC: 2, PHY: 15, SFP+: 5, PBA No: FF-0FF [5.952330] ixgbe :04:00.0: 00:e0:ed:77:09:16 [5.954004] ixgbe :04:00.0: Intel(R) 10 Gigabit Network Connection [6.102346] ixgbe :04:00.1: Multiqueue Enabled: Rx Queue count = 8, Tx Queue count = 8 [6.102475] ixgbe :04:00.1: PCI Express bandwidth of 32GT/s available [6.102478] ixgbe :04:00.1: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%) [6.102562] ixgbe :04:00.1: MAC: 2, PHY: 15, SFP+: 6, PBA No: FF-0FF [6.102564] ixgbe :04:00.1: 00:e0:ed:77:09:17 [6.104869] ixgbe :04:00.1: Intel(R) 10 Gigabit Network Connection [6.253429] ixgbe :05:00.0: Multiqueue Enabled: Rx Queue count = 8, Tx Queue count = 8 [6.253558] ixgbe :05:00.0: PCI Express bandwidth of 32GT/s available [6.253560] ixgbe :05:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%) [6.253644] ixgbe :05:00.0: MAC: 2, PHY: 15, SFP+: 5, PBA No: FF-0FF [6.253646] ixgbe :05:00.0: 00:e0:ed:79:06:50 [6.255855] ixgbe :05:00.0: Intel(R) 10 Gigabit Network Connection [6.404128] ixgbe :05:00.1: Multiqueue Enabled: Rx Queue count = 8, Tx Queue count = 8 [6.404254] ixgbe :05:00.1: PCI Express bandwidth of 32GT/s available [6.404255] ixgbe :05:00.1: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%) [6.404337] ixgbe :05:00.1: MAC: 2, PHY: 15, SFP+: 6, PBA No: FF-0FF [6.404339] ixgbe :05:00.1: 00:e0:ed:79:06:51 [6.405914] ixgbe :05:00.1: Intel(R) 10 Gigabit Network Connection [6.554373] ixgbe :06:00.0: Multiqueue Enabled: Rx Queue count = 8, Tx Queue count = 8 [6.554501] ixgbe :06:00.0: PCI Express bandwidth of 32GT/s available [6.554504] ixgbe :06:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%) [6.554588] ixgbe :06:00.0: MAC: 2, PHY: 15, SFP+: 5, PBA No: FF-0FF [6.554590] ixgbe :06:00.0: 00:e0:ed:79:06:56 [6.556994] ixgbe :06:00.0: Intel(R) 10 Gigabit Network Connection [6.557160] ixgbe :06:00.1: PCI INT B: failed to register GSI [6.557169] ixgbe: probe of :06:00.1 failed with error -28 Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption
On 10/19/2016 08:08 AM, Ard Biesheuvel wrote: On 19 October 2016 at 08:43, Johannes Berg <johan...@sipsolutions.net> wrote: On Wed, 2016-10-19 at 11:31 +0800, Herbert Xu wrote: We could probably make mac80211 do that too, but can we guarantee in- order processing? Anyway, it's pretty low priority, maybe never happening, since hardly anyone really uses "software" crypto, the wifi devices mostly have it built in anyway. Indeed. The code is now correct in terms of API requirements, so let's just wait for someone to complain about any performance regressions. Do you actually expect performance regressions? I'll be complaining if so, but will test first :) Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
ethtool.h compile warning on c++
I am getting warnings about sign missmatch. Maybe make SPEED_UNKNOWN be ((__u32)(0x)) ? from ethtool.h: #define SPEED_UNKNOWN -1 static inline int ethtool_validate_speed(__u32 speed) { return speed <= INT_MAX || speed == SPEED_UNKNOWN; } Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] ath10k: fix system hang at qca99x0 probe on x86 platform (DMA32 issue)
On 07/20/2016 10:02 AM, Adrian Chadd wrote: Hi, The "right" way for the target CPU to interact with host CPU memory (and vice versa, for mostly what it's worth) is to have the copy engine copy (ie, "DMA") the pieces between them. This may be for diagnostic purposes, but it's not supposed to be used like this for doing wifi data exchange, right? :-P Now, there /may/ be some alignment hilarity in various bits of code and/or hardware. Eg, Merlin (AR9280) requires its descriptors to be within a 4k block - the code to iterate through the descriptor physical address space didn't do a "val = val + offset", it did something in verilog like "val = (val & 0xc000) | (offset & 0x3fff)". This meant if you allocated a descriptor that started just before the end of a 4k physmem aligned block, you'd end up with exciting results. I don't know if there are any situations like this in the ath10k hardware, but I'm sure there will be some gotchas somewhere. In any case, if ath10k is consuming too much bounce buffers, the calls to allocate memory aren't working right and should be restricted to 32 bit addresses. Whether that's by using the DMA memory API (before it's mapped) or passing in GFP_DMA32 is a fun debate. (My test hardware arrived, so I'll test this all out today on Peregrine-v2 and see if the driver works.) I have been running this patch for a while: ath10k: Use GPF_DMA32 for firmware swap memory. This fixes OS crash when using QCA 9984 NIC on x86-64 system without vt-d enabled. Also tested on ea8500 with 9980, and x86-64 with 9980 and 9880. All tests were with CT firmware. Signed-off-by: Ben Greear <gree...@candelatech.com> drivers/net/wireless/ath/ath10k/wmi.c index e20aa39..727b3aa 100644 @@ -4491,7 +4491,7 @@ static int ath10k_wmi_alloc_chunk(struct ath10k *ar, u32 req_id, if (!pool_size) return -EINVAL; - vaddr = kzalloc(pool_size, GFP_KERNEL | __GFP_NOWARN); + vaddr = kzalloc(pool_size, GFP_KERNEL | __GFP_NOWARN | GFP_DMA32); if (!vaddr) num_units /= 2; } It mostly seems to work, but then sometimes I get a splat like this below. It appears it is invalid to actually do kzalloc with GFP_DMA32 (based on that BUG_ON that hit in the new_slab method)?? Any idea for a more proper way to do this? gfp: 4 [ cut here ] kernel BUG at /home/greearb/git/linux-4.7.dev.y/mm/slub.c:1508! invalid opcode: [#1] PREEMPT SMP Modules linked in: coretemp hwmon ath9k intel_rapl ath10k_pci x86_pkg_temp_thermal ath9k_common ath10k_core intel_powerclamp ath9k_hw ath kvm iTCO_wdt mac80211 iTCO_vendor_support irqbypass snd_hda_codec_hdmi 6 CPU: 2 PID: 268 Comm: kworker/u8:5 Not tainted 4.7.2+ #16 Hardware name: To be filled by O.E.M. To be filled by O.E.M./ChiefRiver, BIOS 4.6.5 06/07/2013 Workqueue: ath10k_aux_wq ath10k_wmi_event_service_ready_work [ath10k_core] task: 880036433a00 ti: 88003644 task.ti: 88003644 RIP: 0010:[] [] new_slab+0x39a/0x410 RSP: 0018:880036443b58 EFLAGS: 00010092 RAX: 0006 RBX: 024082c4 RCX: RDX: 0006 RSI: 88021e30dd08 RDI: 88021e30dd08 RBP: 880036443b90 R08: R09: R10: R11: 0372 R12: 88021dc01200 R13: 88021dc00cc0 R14: 88021dc01200 R15: 0001 FS: () GS:88021e30() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f3e65c1c730 CR3: 01e06000 CR4: 001406e0 Stack: 8127a4fc 0a01ff10 024082c4 88021dc01200 88021dc00cc0 88021dc01200 0001 880036443c58 81247ac6 88021e31b360 880036433a00 880036433a00 Call Trace: [] ? __d_lookup+0x9c/0x160 [] ___slab_alloc+0x396/0x4a0 [] ? ath10k_wmi_event_service_ready_work+0x5ad/0x800 [ath10k_core] [] ? alloc_kmem_pages+0x9/0x10 [] ? kmalloc_order+0x13/0x40 [] ? ath10k_wmi_event_service_ready_work+0x5ad/0x800 [ath10k_core] [] __slab_alloc.isra.72+0x26/0x40 [] __kmalloc+0x147/0x1b0 [] ath10k_wmi_event_service_ready_work+0x5ad/0x800 [ath10k_core] [] ? dequeue_entity+0x261/0xac0 [] process_one_work+0x148/0x420 [] worker_thread+0x49/0x480 [] ? rescuer_thread+0x330/0x330 [] kthread+0xc4/0xe0 [] ret_from_fork+0x1f/0x40 [] ? kthread_create_on_node+0x170/0x170 Code: e9 65 fd ff ff 49 8b 57 20 48 8d 42 ff 83 e2 01 49 0f 44 c7 f0 80 08 40 e9 6f fd ff ff 89 c6 48 c7 c7 01 36 c7 81 e8 e8 40 fa ff <0f> 0b ba 00 10 00 00 be 5a 00 00 00 48 89 c7 48 d3 e2 e8 bf 18 RIP [] new_slab+0x39a/0x410 RSP ---[ end trace ea3b0043b2911d93 ]--- static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) { if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
Fwd: Re: nfs broken on Fedora-24, 32-bit?
This is probably not an NFS specific issue, though I guess possibly it is. Forwarding to netdev in case someone wants to take a look at it. Thanks, Ben Forwarded Message Subject: Re: nfs broken on Fedora-24, 32-bit? Date: Fri, 16 Sep 2016 16:31:51 -0700 From: Ben Greear <gree...@candelatech.com> Organization: Candela Technologies To: Trond Myklebust <tron...@primarydata.com> CC: List Linux NFS Mailing <linux-...@vger.kernel.org> On 09/15/2016 04:06 PM, Ben Greear wrote: On 09/15/2016 04:00 PM, Trond Myklebust wrote: Hi Ben, On Sep 15, 2016, at 18:32, Ben Greear <gree...@candelatech.com> wrote: I have a Fedora-24 machine mounting an NFS server running Fedora-13 (kernel 2.6.34.9-69.fc13.x86_64). F24 machine has this in /etc/fstab: 192.168.100.3:/mnt/d2 /mnt/d2 nfs nfsvers=3 0 0 When I copy a file from f24-32 to the F-13 machine, the file size is the same, but the file is corrupted on the file server. I see a different md5sum each time. Various other systems (F21, F19, etc) can all copy to the F13 machine fine. And, F24-64 machine can copy to the F13 machine fine. Anyone seen something similar? Do you know if the corruption is happening on the read()s or on the write()s? Do you, for instance get the same corruption if you copy from a local file on the F-24 client to the server? ..or if you copy from a file on the server to a local directory on the F-24 client? Cheers Trond Seems to be a write issue: # This is the nfs server: [greearb@fs3 candela_cdrom.5.3.5]$ md5sum gua-f21-32 ad4073fa8b806bb82b85a645e21f5e67 gua-f21-32 [greearb@fs3 candela_cdrom.5.3.5]$ md5sum ../greearb/tmp/gua-f21-32 582bfea0cc8cc52aa38dc0f5048d0156 ../greearb/tmp/gua-f21-32 [greearb@fs3 candela_cdrom.5.3.5]$ # This is the v-f24-32 client: greearb@v-f24-32 ~]$ cp /mnt/d2/pub/candela_cdrom.5.3.5/gua-f21-32 ./ [greearb@v-f24-32 ~]$ md5sum gua-f21-32 ad4073fa8b806bb82b85a645e21f5e67 gua-f21-32 [greearb@v-f24-32 ~]$ cp gua-f21-32 /mnt/d2/pub/greearb/tmp/ [greearb@v-f24-32 ~]$ md5sum /mnt/d2/pub/greearb/tmp/gua-f21-32 ad4073fa8b806bb82b85a645e21f5e67 /mnt/d2/pub/greearb/tmp/gua-f21-32 Interesting that the client reads back the file it copied over as if it were correct, but it shows up wrong on the nfs server. Maybe it is just reading a local cache? Thanks, Ben Here is some more info on this: We can only reproduce this on virtual machines using the KVM infrastructure, and only when we use the rtl8139 virtual hardware (in bridge mode). With the e1000 virtual hardware we cannot reproduce the problem. Also, multiple different nfs servers (including much newer kernels) all show the same behaviour with this broken nfs client. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Buggy rhashtable walking
On 08/05/2016 03:50 AM, Johannes Berg wrote: On Fri, 2016-08-05 at 18:48 +0800, Herbert Xu wrote: On Fri, Aug 05, 2016 at 08:16:53AM +0200, Johannes Berg wrote: Hm. Would you rather allocate a separate head entry for the hashtable, or chain the entries? My plan is to build support for this directly into rhashtable. So I'm adding a struct rhlist_head that would be used in place of rhash_head for these cases and it'll carry an extra pointer for the list of identical entries. I will then add an additional layer of insert/lookup interfaces for rhlist_head. Herbert, thank you for fixing this! It would not be fun to have to revert to the old way of hashing stations in mac80211... I'll be happy to test the patches when you have them ready. Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
pktgen issue with "Observe needed_headroom of the device"
Regarding this commit: 879c7220e828af8bd82ea9d774c7e45c46b976e4 net: pktgen: Observe needed_headroom of the device Allocate enough space so as not to force the outgoing net device to do skb_realloc_headroom(). Signed-off-by: Bogdan Hamciuc <bogdan.hamc...@freescale.com> Signed-off-by: David S. Miller <da...@davemloft.net> I think it may be incorrect. It seems that pkt_overhead is meant to be the data-portion of the skb, not lower-level padding? For instance: int pkt_overhead; /* overhead for MPLS, VLANs, IPSEC etc */ ... /* Eth + IPh + UDPh + mpls */ datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 - pkt_dev->pkt_overhead; So, maybe we need to add that LL_RESERVED_SPACE to the size when allocating the skb in pktgen_alloc_skb and leave it out of pkt_overhead? And for that matter, what is that '+ 64 +' for in the size calculation? Looks a lot like some fudge factor from long ago? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Make TCP work better with re-ordered frames?
On 05/18/2016 08:25 AM, Eric Dumazet wrote: On Wed, 2016-05-18 at 08:07 -0700, Ben Greear wrote: On 05/18/2016 07:29 AM, Eric Dumazet wrote: On Wed, 2016-05-18 at 07:00 -0700, Ben Greear wrote: We are investigating a system that has fairly poor TCP throughput with the 3.17 and 4.0 kernels, but evidently it worked pretty well with 3.14 (I should be able to verify 3.14 later today). One thing I notice is that a UDP download test shows lots of reordered frames, so I am thinking maybe TCP is running slow because of this. (We see about 800Mbps UDP download, but only 500Mbps TCP, even when using 100 concurrent TCP streams.) Is there some way to tune the TCP stack to better handle reordered frames? Nothing yet. Are you the sender or the receiver ? You really want to avoid reorders as much as possible. Are you telling us something broke in networking layers between 3.14 and 3.17 leadings to reorders ? I am both sender and receiver, through an access-controller and wifi AP as DUT. The sender is Intel 1G NIC, so I suspect it is not causing reordering, which indicates most likely DUT is to blame. Using several off-the-shelf APs in our lab we do not see this problem. I am not certain yet what is the difference, but customer reports 600+Mbps with their older code, and best I can get is around 500Mbps with newer stuff. Lots of stuff changed though (ath10k firmware, user-space at least slightly, kernel, etc), so possibly the regression is elsewhere. You possibly could send me some pcap (limited to the headers, using -s 128 for example) and limited to few flows, not the whole of them ;) TCP reorders are tricky for the receiver : It sends a lot of SACK (one for every incoming packet, instead of the normal rule of sending one ACK for two incoming packets) Increasing number of ACK might impact half-duplex networks, but also considerably increase cpu processing time. I will work on captures...do you care if it is from transmitter or receiver's perspective? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: Make TCP work better with re-ordered frames?
On 05/18/2016 07:29 AM, Eric Dumazet wrote: On Wed, 2016-05-18 at 07:00 -0700, Ben Greear wrote: We are investigating a system that has fairly poor TCP throughput with the 3.17 and 4.0 kernels, but evidently it worked pretty well with 3.14 (I should be able to verify 3.14 later today). One thing I notice is that a UDP download test shows lots of reordered frames, so I am thinking maybe TCP is running slow because of this. (We see about 800Mbps UDP download, but only 500Mbps TCP, even when using 100 concurrent TCP streams.) Is there some way to tune the TCP stack to better handle reordered frames? Nothing yet. Are you the sender or the receiver ? You really want to avoid reorders as much as possible. Are you telling us something broke in networking layers between 3.14 and 3.17 leadings to reorders ? I am both sender and receiver, through an access-controller and wifi AP as DUT. The sender is Intel 1G NIC, so I suspect it is not causing reordering, which indicates most likely DUT is to blame. Using several off-the-shelf APs in our lab we do not see this problem. I am not certain yet what is the difference, but customer reports 600+Mbps with their older code, and best I can get is around 500Mbps with newer stuff. Lots of stuff changed though (ath10k firmware, user-space at least slightly, kernel, etc), so possibly the regression is elsewhere. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Make TCP work better with re-ordered frames?
We are investigating a system that has fairly poor TCP throughput with the 3.17 and 4.0 kernels, but evidently it worked pretty well with 3.14 (I should be able to verify 3.14 later today). One thing I notice is that a UDP download test shows lots of reordered frames, so I am thinking maybe TCP is running slow because of this. (We see about 800Mbps UDP download, but only 500Mbps TCP, even when using 100 concurrent TCP streams.) Is there some way to tune the TCP stack to better handle reordered frames? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don???t modify ip_summed; doing so treats packets with bad checksums as good.
On 05/13/2016 11:21 AM, David Miller wrote: From: Ben Greear <gree...@candelatech.com> Date: Fri, 13 May 2016 09:57:19 -0700 How do you feel about a new socket-option to allow a socket to request the old veth behaviour? I depend upon the opinions of the experts who work upstream on and maintain these components, since it is an area I am not so familiar with. Generally speaking asking me directly for opinions on matters like this isn't the way to go, in fact I kind of find it irritating. It can't all be on me. Fair enough, thanks for your time. Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don???t modify ip_summed; doing so treats packets with bad checksums as good.
Mr Miller: How do you feel about a new socket-option to allow a socket to request the old veth behaviour? Thanks, Ben On 04/30/2016 10:30 PM, Willy Tarreau wrote: On Sat, Apr 30, 2016 at 03:43:51PM -0700, Ben Greear wrote: On 04/30/2016 03:01 PM, Vijay Pandurangan wrote: Consider: - App A sends out corrupt packets 50% of the time and discards inbound data. (...) How can you make a generic app C know how to do this? The path could be, for instance: eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> vethC <-> vethD <-> appC There are no sockets on vethB, but it does need to have special behaviour to elide csums. Even if appC is hacked to know how to twiddle some thing on it's veth port, mucking with vethD will have no effect on vethB. With regard to your example above, why would A corrupt packets? My guess: 1) It has bugs (so, fix the bugs, it could equally create incorrect data with proper checksums, so just enabling checksumming adds no useful protection.) I agree with Ben here, what he needs is the ability for userspace to be trusted when *forwarding* a packet. Ideally you'd only want to receive the csum status per packet on the packet socket and pass the same value on the vethA interface, with this status being kept when the packet reaches vethB. If A purposely corrupts packet, it's A's problem. It's similar to designing a NIC which intentionally corrupts packets and reports "checksum good". The real issue is that in order to do things right, the userspace bridge (here, "A") would really need to pass this status. In Ben's case as he says, bad checksum packets are dropped before reaching A, so that simplifies the process quite a bit and that might be what causes some confusion, but ideally we'd rather have recvmsg() and sendmsg() with these flags. I faced the exact same issue 3 years ago when playing with netmap, it was slow as hell because it would lose all checksum information when packets were passing through userland, resulting in GRO/GSO etc being disabled, and had to modify it to let userland preserve it. That's especially important when you have to deal with possibly corrupted packets not yet detected in the chain because the NIC did not validate their checksums. Willy -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Performance suggestions for bridging module?
Hello! I have a network emulator module that acts a lot like an ethernet bridge. It is implemented roughly like this: Hook into the rx logic and steal packets in the rx-all logic, similar to how sniffers work. Then, it puts the packet onto a queue for transmit. A kernel thread services this queue transmitting frames on a different NIC. I am using spin-locks to protect this queue. I am disabling LRO/GRO etc on the ixgbe NICs so that I don't have to deal with linearization when trying to do corruptions and such. Re-enabling LRO/GRO makes the transmit logic use less CPU, but the RX logic is the bottleneck anyway it seems. The code, which is GPL, is here, in case someone wants to take a look: http://www.candelatech.com/downloads/wanlink/ What I see is that this is very sensitive to which CPU core does what. If I run the transmitter thread on cpu-0, performance is awful. If I run it on 1, then it is good. Sometimes, though hard to reproduce, I can run right at 10Gbps bi-directional throughput. More often, it is stuck at around 7Gbps bi-directional throughput. I tried adding some prefetch logic, and that helped when emulating very long latency (like, 10 seconds worth), but not sure I am really doing that optimally either. My basic question is: Any suggestion for an optimal CPU core configuration (most likely including binding a NIC's irqs to a particular core)?? Any other suggestions for things to look for? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On 04/30/2016 03:01 PM, Vijay Pandurangan wrote: On Sat, Apr 30, 2016 at 5:52 PM, Ben Greear <gree...@candelatech.com> wrote: Good point, so if you had: eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <-> userspace-stub <->eth1 and user-space hub enabled this elide flag, things would work, right? Then, it seems like what we need is a way to tell the kernel router/bridge logic to follow elide signals in packets coming from veth. I'm not sure what the best way to do this is because I'm less familiar with conventions in that part of the kernel, but assuming there's a way to do this, would it be acceptable? You cannot receive on one veth without transmitting on the other, so I think the elide csum logic can go on the raw-socket, and apply to packets in the transmit-from-user-space direction. Just allowing the socket to make the veth behave like it used to before this patch in question should be good enough, since that worked for us for years. So, just an option to modify the ip_summed for pkts sent on a socket is probably sufficient. I don't think this is right. Consider: - App A sends out corrupt packets 50% of the time and discards inbound data. - App B doesn't care about corrupt packets and is happy to receive them and has some way of dealing with them (special case) - App C is a regular app, say nc or something. In your world, where A decides what happens to data it transmits, then A<--veth-->B and A<---wire-->B will have the same behaviour but A<-- veth --> C and A<-- wire --> C will have _different_ behaviour: C will behave incorrectly if it's connected over veth but correctly if connected with a wire. That is a bug. Since A cannot know what the app it's talking to will desire, I argue that both sides of a message must be opted in to this optimization. How can you make a generic app C know how to do this? The path could be, for instance: eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> vethC <-> vethD <-> appC There are no sockets on vethB, but it does need to have special behaviour to elide csums. Even if appC is hacked to know how to twiddle some thing on it's veth port, mucking with vethD will have no effect on vethB. With regard to your example above, why would A corrupt packets? My guess: 1) It has bugs (so, fix the bugs, it could equally create incorrect data with proper checksums, so just enabling checksumming adds no useful protection.) 2) It means to corrupt frames. In that case, someone must expect that C should receive incorrect frames, otherwise why bother making App-A corrupt them in the first place? 3) You are explicitly trying to test the kernel checksum logic, so you want the kernel to detect the bad checksum and throw away the packet. In this case, just don't set the socket option in appA to elide checksums and the packet will be thrown away. Any other cases you can think of? Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On 04/30/2016 02:36 PM, Vijay Pandurangan wrote: On Sat, Apr 30, 2016 at 5:29 PM, Ben Greear <gree...@candelatech.com> wrote: On 04/30/2016 02:13 PM, Vijay Pandurangan wrote: On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear <gree...@candelatech.com> wrote: On 04/30/2016 12:54 PM, Tom Herbert wrote: We've put considerable effort into cleaning up the checksum interface to make it as unambiguous as possible, please be very careful to follow it. Broken checksum processing is really hard to detect and debug. CHECKSUM_UNNECESSARY means that some number of _specific_ checksums (indicated by csum_level) have been verified to be correct in a packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is never right. If CHECKSUM_UNNECESSARY is set in such a manner but the checksum it would refer to has not been verified and is incorrect this is a major bug. Suppose I know that the packet received on a packet-socket has already been verified by a NIC that supports hardware checksumming. Then, I want to transmit it on a veth interface using a second packet socket. I do not want veth to recalculate the checksum on transmit, nor to validate it on the peer veth on receive, because I do not want to waste the CPU cycles. I am assuming that my app is not accidentally corrupting frames, so the checksum can never be bad. How should the checksumming be configured for the packets going into the packet-socket from user-space? It seems like that only the receiver should decide whether or not to checksum packets on the veth, not the sender. How about: We could add a receiving socket option for "don't checksum packets received from a veth when the other side has marked them as elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending socket option for "mark all data sent via this socket to a veth as elide-checksum-suggested". So the process would be: Writer: 1. open read socket 2. open write socket, with option elide-checksum-for-veth-suggested 3. write data Reader: 1. open read socket with "follow-elide-checksum-suggestions-on-veth" 2. read data The kernel / module would then need to persist the flag on all packets that traverse a veth, and drop these data when they leave the veth module. I'm not sure this works completely. In my app, the packet flow might be: eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB <-> [kernel router/bridge logic ...] <-> eth1 Good point, so if you had: eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <-> userspace-stub <->eth1 and user-space hub enabled this elide flag, things would work, right? Then, it seems like what we need is a way to tell the kernel router/bridge logic to follow elide signals in packets coming from veth. I'm not sure what the best way to do this is because I'm less familiar with conventions in that part of the kernel, but assuming there's a way to do this, would it be acceptable? You cannot receive on one veth without transmitting on the other, so I think the elide csum logic can go on the raw-socket, and apply to packets in the transmit-from-user-space direction. Just allowing the socket to make the veth behave like it used to before this patch in question should be good enough, since that worked for us for years. So, just an option to modify the ip_summed for pkts sent on a socket is probably sufficient. There may be no sockets on the vethB port. And reader/writer is not a good way to look at it since I am implementing a bi-directional bridge in user-space and each packet-socket is for both rx and tx. Sure, but we could model a bidrectional connection as two unidirectional sockets for our discussions here, right? Best not to I think, you want to make sure that one socket can correctly handle tx and rx. As long as that works, then using uni-directional sockets should work too. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On 04/30/2016 02:13 PM, Vijay Pandurangan wrote: On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear <gree...@candelatech.com> wrote: On 04/30/2016 12:54 PM, Tom Herbert wrote: We've put considerable effort into cleaning up the checksum interface to make it as unambiguous as possible, please be very careful to follow it. Broken checksum processing is really hard to detect and debug. CHECKSUM_UNNECESSARY means that some number of _specific_ checksums (indicated by csum_level) have been verified to be correct in a packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is never right. If CHECKSUM_UNNECESSARY is set in such a manner but the checksum it would refer to has not been verified and is incorrect this is a major bug. Suppose I know that the packet received on a packet-socket has already been verified by a NIC that supports hardware checksumming. Then, I want to transmit it on a veth interface using a second packet socket. I do not want veth to recalculate the checksum on transmit, nor to validate it on the peer veth on receive, because I do not want to waste the CPU cycles. I am assuming that my app is not accidentally corrupting frames, so the checksum can never be bad. How should the checksumming be configured for the packets going into the packet-socket from user-space? It seems like that only the receiver should decide whether or not to checksum packets on the veth, not the sender. How about: We could add a receiving socket option for "don't checksum packets received from a veth when the other side has marked them as elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending socket option for "mark all data sent via this socket to a veth as elide-checksum-suggested". So the process would be: Writer: 1. open read socket 2. open write socket, with option elide-checksum-for-veth-suggested 3. write data Reader: 1. open read socket with "follow-elide-checksum-suggestions-on-veth" 2. read data The kernel / module would then need to persist the flag on all packets that traverse a veth, and drop these data when they leave the veth module. I'm not sure this works completely. In my app, the packet flow might be: eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB <-> [kernel router/bridge logic ...] <-> eth1 There may be no sockets on the vethB port. And reader/writer is not a good way to look at it since I am implementing a bi-directional bridge in user-space and each packet-socket is for both rx and tx. Also, I might want to send raw frames that do have broken checksums (lets assume a real NIC, not veth), and I want them to hit the wire with those bad checksums. How do I configure the checksumming in this case? Correct me if I'm wrong but I think this is already possible now. You can have packets with incorrect checksum hitting the wire as is. What you cannot do is instruct the receiving end to ignore the checksum from the sending end when using a physical device (and something I think we should mimic on the sending device). Yes, it does work currently (or, last I checked)...I just want to make sure it keeps working. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On 04/30/2016 12:54 PM, Tom Herbert wrote: We've put considerable effort into cleaning up the checksum interface to make it as unambiguous as possible, please be very careful to follow it. Broken checksum processing is really hard to detect and debug. CHECKSUM_UNNECESSARY means that some number of _specific_ checksums (indicated by csum_level) have been verified to be correct in a packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is never right. If CHECKSUM_UNNECESSARY is set in such a manner but the checksum it would refer to has not been verified and is incorrect this is a major bug. Suppose I know that the packet received on a packet-socket has already been verified by a NIC that supports hardware checksumming. Then, I want to transmit it on a veth interface using a second packet socket. I do not want veth to recalculate the checksum on transmit, nor to validate it on the peer veth on receive, because I do not want to waste the CPU cycles. I am assuming that my app is not accidentally corrupting frames, so the checksum can never be bad. How should the checksumming be configured for the packets going into the packet-socket from user-space? Also, I might want to send raw frames that do have broken checksums (lets assume a real NIC, not veth), and I want them to hit the wire with those bad checksums. How do I configure the checksumming in this case? Thanks, Ben Tom On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear <gree...@candelatech.com> wrote: On 04/30/2016 11:33 AM, Ben Hutchings wrote: On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: Hello, http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a Actually, no, this is not really a regression. [...] It really is. Even though the old behaviour was a bug (raw packets should not be changed), if there are real applications that depend on that then we have to keep those applications working somehow. To be honest, I fail to see why the old behaviour is a bug when sending raw packets from user-space. If raw packets should not be changed, then we need some way to specify what the checksum setting is to begin with, otherwise, user-space has not enough control. A socket option for new programs, and sysctl configurable defaults for raw sockets for old binary programs would be sufficient I think. Thanks, Ben -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com -- Ben Greear <gree...@candelatech.com> Candela Technologies Inc http://www.candelatech.com