Can NFS work with VRF?

2018-11-05 Thread Ben Greear

Hello,

I was trying to improve my old series of patches that binds NFS to
a particular source IP address so that it could work with VRF in a 4.16
kernel.  But, it seems a huge tangle to try to make NFS (and rpc, etc) able to 
bind to
a local netdevice, which I think is what would be needed to make it work with 
VRF.

Has anyone already worked on VRF support for NFS?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Anyone know if strongswan works with vrf?

2018-06-29 Thread Ben Greear

Hello,

We're trying to create lots of strongswan VPN tunnels on network devices
bound to different VRFs.  We are using Fedora-24 on the client side, with a 
4.16.15+ kernel
and updated 'ip' package, etc.

So far, no luck getting it to work.

Any idea if this is supported or not?

Thanks,
Ben
--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-11 Thread Ben Greear

On 06/10/2018 10:10 AM, Michał Kazior wrote:

Ben,

The patch is symptomatic. fq_tin_dequeue() already checks if the list
is empty before it tries to access first entry. I see no point in
using the _or_null() + WARN_ON.

The 0x3c deref is likely an offset off of NULL base pointer. Did you
check gdb/addr2line of the ieee80211_tx_dequeue+0xfb? Where did it
point to?


gdb pointed to one line above the flow dereference, which is why I was
going to put some debugging in there.



I suspect there's not enough synchronization between quescing the
device/ath10k after fw crashes and performing mac80211's reconfig
procedure.


I am already running this patch which helps with some of that.  That
patch never made it upstream, but it fixed problems for me earlier.

https://patchwork.kernel.org/patch/9457639/

Could easily be there are some more issues in that logic.

Someone else posted a patch to disable mac-80211 tx when FW crashes,
I think...I have not tried to backport that.

https://patchwork.kernel.org/patch/10411967/

Thanks,
Ben





Michał

On 8 June 2018 at 23:40, Arend van Spriel  wrote:

On 6/8/2018 5:17 PM, Ben Greear wrote:

I recalled an email from Michał leaving tieto so adding his alternate email
he provided back then.

Gr. AvS



On 06/07/2018 04:59 PM, Cong Wang wrote:


On Thu, Jun 7, 2018 at 4:48 PM,   wrote:


diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..cb911f0 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
return NULL;
}

-   flow = list_first_entry(head, struct fq_flow, flowchain);
+   flow = list_first_entry_or_null(head, struct fq_flow,
flowchain);
+
+   if (WARN_ON_ONCE(!flow))
+   return NULL;



This does not make sense either. list_first_entry_or_null()
returns NULL only when the list is empty, but we already check
list_empty() right before this code, and it is protected by fq->lock.



Hello Michal,

git blame shows you as the author of the fq_impl.h code.

I saw a crash when debugging funky ath10k firmware in a 4.16 + hacks
kernel.  There was an apparent
mostly-null deref in the fq_tin_dequeue method.  According to gdb, it
was within
1 line of the dereference of 'flow'.

My hack above is probably not that useful.  Cong thinks maybe the
locking is bad.

If you get a chance, please review this thread and see if you have any
ideas for
a better fix (or better debugging code).

As always, if you would like me to generate you a buggy firmware that
will crash
in the tx path and cause all sorts of mayhem in the ath10k driver and
wifi stack,
I will be happy to do so.

https://www.mail-archive.com/netdev@vger.kernel.org/msg239738.html

Thanks,
Ben







--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-08 Thread Ben Greear




On 06/07/2018 05:13 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 4:48 PM,   wrote:

From: Ben Greear 

While testing an ath10k firmware that often crashed under load,
I was seeing kernel crashes as well.  One of them appeared to
be a dereference of a NULL flow object in fq_tin_dequeue.

I have since fixed the firmware flaw, but I think it would be
worth adding the WARN_ON in case the problem appears again.

BUG: unable to handle kernel NULL pointer dereference at 003c
IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211]


Instead of adding WARN_ON(), you need to think about
the locking there, it is suspicious:

fq is from struct ieee80211_local:

struct fq *fq = >fq;

tin is from struct txq_info:

struct fq_tin *tin = >tin;

I don't know if fq and tin are supposed to be 1:1, if not there is
a bug in the locking, because ->new_flows and ->old_flows are
both inside tin instead of fq, but they are protected by fq->lock


Maybe whoever put this code together can take a stab at it.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-08 Thread Ben Greear




On 06/07/2018 04:59 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 4:48 PM,   wrote:

diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..cb911f0 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
return NULL;
}

-   flow = list_first_entry(head, struct fq_flow, flowchain);
+   flow = list_first_entry_or_null(head, struct fq_flow, flowchain);
+
+   if (WARN_ON_ONCE(!flow))
+   return NULL;


This does not make sense either. list_first_entry_or_null()
returns NULL only when the list is empty, but we already check
list_empty() right before this code, and it is protected by fq->lock.



Nevermind then.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Ben Greear

On 06/07/2018 02:52 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 2:41 PM, Ben Greear  wrote:

On 06/07/2018 02:29 PM, Cong Wang wrote:


On Thu, Jun 7, 2018 at 9:06 AM,   wrote:


--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,

flow = list_first_entry(head, struct fq_flow, flowchain);

+   if (WARN_ON_ONCE(!flow))
+   return NULL;
+



How could even possibly list_first_entry() returns NULL?
You need list_first_entry_or_null().



I don't know for certain flow as null, but something was NULL in this method
near that line and it looked like a likely culprit.

I guess possibly tin or fq was passed in as NULL?


A NULL pointer is not always 0. You can trigger a NULL-ptr-def with 0x3c
too, but you are checking against 0 in your patch, that is the problem and
that is why list_first_entry_or_null() exists.



Ahh, I see what you mean, and that is my mistake.  In my case, it did seem to
be a mostly-null deref, not a 0x0 deref.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Ben Greear

On 06/07/2018 02:29 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 9:06 AM,   wrote:

--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,

flow = list_first_entry(head, struct fq_flow, flowchain);

+   if (WARN_ON_ONCE(!flow))
+   return NULL;
+


How could even possibly list_first_entry() returns NULL?
You need list_first_entry_or_null().



I don't know for certain flow as null, but something was NULL in this method
near that line and it looked like a likely culprit.

I guess possibly tin or fq was passed in as NULL?

Anyway, if the patch seems worthless just ignore it.  I'll leave it in my tree
since it should be harmless and will let you know if I ever hit it.

If someone else hits a similar crash, hopefully they can report it.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Ben Greear

On 06/07/2018 09:17 AM, Eric Dumazet wrote:



On 06/07/2018 09:06 AM, gree...@candelatech.com wrote:

From: Ben Greear 

While testing an ath10k firmware that often crashed under load,
I was seeing kernel crashes as well.  One of them appeared to
be a dereference of a NULL flow object in fq_tin_dequeue.

I have since fixed the firmware flaw, but I think it would be
worth adding the WARN_ON in case the problem appears again.

 common_interrupt+0xf/0xf
 



Please find the exact commit that brought this bug,
and add a corresponding Fixes: tag


It will be a total pain to bisect this problem since my test
case that causes this is running my modified firmware (and a buggy one at that),
modified ath10k driver (to work with this firmware and support my test case 
easily),
and the failure case appears to cause multiple different-but-probably-related
crashes and often hangs or reboots the test system.

Probably this is all caused by some nasty race or buggy logic related to
dealing with a crashed ath10k firmware tearing down txq logic from the
bottom up.  There have been many such bugs in the past, I and others fixed a 
few,
and very likely more remain.

For what it is worth, I didn't see this crash in 4.13, and I spent some time
testing buggy firmware there occasionally.

If someone else has interest in debugging the ath10k driver, I will be happy to 
generate
a mostly-stock firmware image with ability to crash in the TX path and give it 
to them.
It will crash the stock upstream code reliably in my experience.

Thanks,
Ben





Signed-off-by: Ben Greear 
---
 include/net/fq_impl.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..e40354d 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,

flow = list_first_entry(head, struct fq_flow, flowchain);

+   if (WARN_ON_ONCE(!flow))
+   return NULL;
+
if (flow->deficit <= 0) {
flow->deficit += fq->quantum;
list_move_tail(>flowchain,






--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Regression bisected to: softirq: Let ksoftirqd do its job

2018-05-17 Thread Ben Greear

One of my out-of-tree patches is a network impairment tool that acts a lot like
an Ethernet bridge with latency, jitter, etc.

We noticed recently that we were seeing igb adapter errors when testing with 
our emulator
at high speeds.  For whatever reason, it is only easily reproduced when we add 
jitter
to our emulator.  This would cause a bit more CPU usage and lock contention in 
our software,
and would increase the skb pkts allocated at any given time.

I bisected the problem to the commit below:

Author: Eric Dumazet <eduma...@google.com>
Date:   Wed Aug 31 10:42:29 2016 -0700

softirq: Let ksoftirqd do its job

A while back, Paolo and Hannes sent an RFC patch adding threaded-able
napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/)


If I replace my emulator with a bridge, then I do not see the problem.  But, I 
also do not
(or very rarely?) see the problem when configuring the emulator with zero 
latency and jitter,
which is how the bridge would act.

Any idea what sort of (bad?) behaviour would be able to cause this tx q timeout?

If you have any interest, I will be happy to email you my out-of-tree patches 
and
instructions to reproduce the problem.


The kernel splat looks like this, and repeats often:


May 17 16:03:09 localhost.localdomain kernel: audit: type=1131 audit(1526598189.492:159): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed 
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

May 17 16:03:39 localhost.localdomain kernel: [ cut here 
]
May 17 16:03:39 localhost.localdomain kernel: WARNING: CPU: 5 PID: 0 at 
/home/greearb/git/linux-bisect/net/sched/sch_generic.c:316 
dev_watchdog+0x234/0x240
May 17 16:03:39 localhost.localdomain kernel: NETDEV WATCHDOG: eth5 (igb): 
transmit queue 0 timed out
May 17 16:03:39 localhost.localdomain kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 fuse macvlan wanlink(O) pktgen 
cfg80211 sunrpc coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass ipmi_ssif iTCO_wdt iTCO_vendor_support joydev i2c_i801 lpc_ich 
i2c_smbus ioatdma shpchp wmi ipmi_si ipmi_msghandler tpm_tis tpm_tis_core tpm acpi_power_meter acpi_pad sch_fq_codel ast drm_kms_helper ttm drm igb hwmon ptp 
pps_core dca i2c_algo_bit i2c_core fjes ipv6 crc_ccitt [last unloaded: nf_conntrack]

May 17 16:03:39 localhost.localdomain kernel: CPU: 5 PID: 0 Comm: swapper/5 
Tainted: G   O4.8.0-rc7+ #132
May 17 16:03:39 localhost.localdomain kernel: Hardware name: Iron_Systems,Inc 
CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
May 17 16:03:39 localhost.localdomain kernel:   
88087fd43d78 81417eb1 88087fd43dc8
May 17 16:03:39 localhost.localdomain kernel:   
88087fd43db8 81103556 013c7fd43da8
May 17 16:03:39 localhost.localdomain kernel:   
880854221940 0005 880854bb8000
May 17 16:03:39 localhost.localdomain kernel: Call Trace:
May 17 16:03:39 localhost.localdomain kernel:[] 
dump_stack+0x63/0x82
May 17 16:03:39 localhost.localdomain kernel:  [] 
__warn+0xc6/0xe0
May 17 16:03:39 localhost.localdomain kernel:  [] 
warn_slowpath_fmt+0x4a/0x50
May 17 16:03:39 localhost.localdomain kernel:  [] 
dev_watchdog+0x234/0x240
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
qdisc_rcu_free+0x40/0x40
May 17 16:03:39 localhost.localdomain kernel:  [] 
call_timer_fn+0x30/0x150
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
qdisc_rcu_free+0x40/0x40
May 17 16:03:39 localhost.localdomain kernel:  [] 
run_timer_softirq+0x1ea/0x450
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
ktime_get+0x37/0xa0
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
lapic_next_deadline+0x21/0x30
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
clockevents_program_event+0x7d/0x120
May 17 16:03:39 localhost.localdomain kernel:  [] 
__do_softirq+0xca/0x2d0
May 17 16:03:39 localhost.localdomain kernel:  [] 
irq_exit+0xb3/0xc0
May 17 16:03:39 localhost.localdomain kernel:  [] 
smp_apic_timer_interrupt+0x3d/0x50
May 17 16:03:39 localhost.localdomain kernel:  [] 
apic_timer_interrupt+0x82/0x90
May 17 16:03:39 localhost.localdomain kernel:[] ? 
cpuidle_enter_state+0x126/0x300
May 17 16:03:39 localhost.localdomain kernel:  [] 
cpuidle_enter+0x12/0x20
May 17 16:03:39 localhost.localdomain kernel:  [] 
call_cpuidle+0x25/0x40
May 17 16:03:39 localhost.localdomain kernel:  [] 
cpu_startup_entry+0x2ba/0x380
May 17 16:03:39 localhost.localdomain kernel:  [] 
start_secondary+0x149/0x170
May 17 16:03:39 localhost.localdomain kernel: ---[ end trace f62c6dd947785e8f 
]---


Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Performance regression between 4.13 and 4.14

2018-05-09 Thread Ben Greear

On 05/09/2018 12:02 PM, Ben Greear wrote:

On 05/09/2018 11:48 AM, Eric Dumazet wrote:



On 05/09/2018 11:43 AM, Ben Greear wrote:

On 05/08/2018 10:10 AM, Eric Dumazet wrote:



On 05/08/2018 09:44 AM, Ben Greear wrote:

Hello,

I am trying to track down a performance regression that appears to be between 
4.13
and 4.14.

I first saw the problem with a hacked version of pktgen on some ixgbe NICs.  
4.13 can do
right at 10G bi-directional on two ports, and 4.14 and later can do only about 
6Gbps.

I also tried with user-space UDP traffic on a stock kernel, and I can get about 
3.2Gbps combined tx+rx
on 4.14 and about 4.4Gbps on 4.13.

Attempting to bisect seems to be triggering a weirdness in git, and also lots 
of commits
crash or do not bring up networking, which makes the bisect difficult.

Looking at perf top, it would appear that some lock is probably to blame.



perf record -a -g -e cycles:pp sleep 5
perf report

Then you'll be able to tell us which lock (or call graph) is killing your perf.



I seem to be chasing multiple issues.  For 4.13, at least part of my problem 
was that LOCKDEP was enabled,
during my bisect, though it does NOT appear enabled in 4.16.  I think maybe 
CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING
in 4.16, or something like that?  My 4.16 .config does have 
CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it:

[greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config
CONFIG_LOCKDEP_SUPPORT=y


For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need
to disable to keep from getting a performance hit from the spectre-related bug
fixes?  At this point, I do not care about the security implications.

greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config
# CONFIG_RETPOLINE is not set


Thanks,
Ben



No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/


I initially saw the problem in 4.16, then bisected, and 4.14 still showed the
issue.


So, I guess I must have been enabling lockdep the whole time.  This 
__lock_acquire
is from lockdep as far as I can tell, not normal locking.  I re-built 4.16 after
verifying as best as I could that lockdep was not enabled, and now it performs
as expected.

I'm going to test a patch to change __lock_acquire to __lock_acquire_lockdep so
maybe someone else will not make the same mistake I made.


+   17.78%17.78%  kpktgend_1   [kernel.kallsyms] [k] 
__lock_acquire.isra.3



Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Performance regression between 4.13 and 4.14

2018-05-09 Thread Ben Greear

On 05/09/2018 11:48 AM, Eric Dumazet wrote:



On 05/09/2018 11:43 AM, Ben Greear wrote:

On 05/08/2018 10:10 AM, Eric Dumazet wrote:



On 05/08/2018 09:44 AM, Ben Greear wrote:

Hello,

I am trying to track down a performance regression that appears to be between 
4.13
and 4.14.

I first saw the problem with a hacked version of pktgen on some ixgbe NICs.  
4.13 can do
right at 10G bi-directional on two ports, and 4.14 and later can do only about 
6Gbps.

I also tried with user-space UDP traffic on a stock kernel, and I can get about 
3.2Gbps combined tx+rx
on 4.14 and about 4.4Gbps on 4.13.

Attempting to bisect seems to be triggering a weirdness in git, and also lots 
of commits
crash or do not bring up networking, which makes the bisect difficult.

Looking at perf top, it would appear that some lock is probably to blame.



perf record -a -g -e cycles:pp sleep 5
perf report

Then you'll be able to tell us which lock (or call graph) is killing your perf.



I seem to be chasing multiple issues.  For 4.13, at least part of my problem 
was that LOCKDEP was enabled,
during my bisect, though it does NOT appear enabled in 4.16.  I think maybe 
CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING
in 4.16, or something like that?  My 4.16 .config does have 
CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it:

[greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config
CONFIG_LOCKDEP_SUPPORT=y


For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need
to disable to keep from getting a performance hit from the spectre-related bug
fixes?  At this point, I do not care about the security implications.

greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config
# CONFIG_RETPOLINE is not set


Thanks,
Ben



No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/


I initially saw the problem in 4.16, then bisected, and 4.14 still showed the
issue.

4.13 works, but only when I use a .config I originally built for 4.13, not the 
4.16 .config
that I ended up using with the bisect (make oldconfig, accept all defaults).  I 
originally
configured 4.16 with a .config that had lockdep enabled, then manually tried to 
disable it
through 'make xconfig'.  I think that must leave "CONFIG_LOCKDEP=y" in the 
.config, which
screws up older builds during bisect, perhaps?


Before doing a (painful) dissection, the perf output would immediately tell you 
if
something is really wrong on your .config.


I didn't realize lockdep might be an issue at the time, but here is a 'bad' run 
from
a 4.13+ (plus pktgen hacks).  I guess lockdep is why this runs slowly, but I 
see no obvious
proof of that in the output:

4.13+, patched pktgen, 6Gbps throughput, on commit 
906dde0f355bd97c080c215811ae7db1137c4af8

Samples: 26K of event 'cycles:pp', Event count (approx.): 20119166736
  Children  Self  Command  Shared ObjectSymbol
+   87.97% 0.00%  kpktgend_1   [kernel.kallsyms][k] 
ret_from_fork
+   87.97% 0.00%  kpktgend_1   [kernel.kallsyms][k] kthread
+   86.89% 5.42%  kpktgend_1   [kernel.kallsyms][k] 
pktgen_thread_worker
+   33.75% 0.18%  kpktgend_1   [kernel.kallsyms][k] 
getnstimeofday64
+   32.77% 4.47%  kpktgend_1   [kernel.kallsyms][k] 
__getnstimeofday64
+   24.60%10.91%  kpktgend_1   [kernel.kallsyms][k] 
lock_acquire
+   23.59% 0.03%  kpktgend_1   [kernel.kallsyms][k] 
__do_softirq
+   23.55% 0.07%  kpktgend_1   [kernel.kallsyms][k] 
net_rx_action
+   22.29% 0.47%  kpktgend_1   [kernel.kallsyms][k] 
getRelativeCurNs
+   21.33% 1.71%  kpktgend_1   [kernel.kallsyms][k] 
ixgbe_poll
+   15.79% 0.02%  kpktgend_1   [kernel.kallsyms][k] 
ret_from_intr
+   15.78% 0.01%  kpktgend_1   [kernel.kallsyms][k] do_IRQ
+   15.34% 0.01%  kpktgend_1   [kernel.kallsyms][k] irq_exit
+   13.95%10.00%  kpktgend_1   [kernel.kallsyms][k] 
ip_send_check
+   13.80%13.80%  kpktgend_1   [kernel.kallsyms][k] 
__lock_acquire.isra.31
+   12.98% 0.53%  kpktgend_1   [kernel.kallsyms][k] 
pktgen_finalize_skb
+   12.31% 0.20%  kpktgend_1   [kernel.kallsyms][k] 
timestamp_skb.isra.24
+   11.68% 0.13%  kpktgend_1   [kernel.kallsyms][k] 
napi_gro_receive
+   11.36% 0.25%  kpktgend_1   [kernel.kallsyms][k] 
netif_receive_skb_internal
+   10.93% 0.00%  swapper  [kernel.kallsyms][k] 
verify_cpu
+   10.93% 0.00%  swapper  [kernel.kallsyms][k] 
cpu_startup_entry
+   10.92% 0.02%  swapper  [kernel.kallsyms][k] do_idle
+   10.71% 0.00%  swapper  [kernel.kallsyms][k] 
cpuidle_enter
+   10.71% 0.00%  swapper  [ke

Re: Performance regression between 4.13 and 4.14

2018-05-09 Thread Ben Greear

On 05/08/2018 10:10 AM, Eric Dumazet wrote:



On 05/08/2018 09:44 AM, Ben Greear wrote:

Hello,

I am trying to track down a performance regression that appears to be between 
4.13
and 4.14.

I first saw the problem with a hacked version of pktgen on some ixgbe NICs.  
4.13 can do
right at 10G bi-directional on two ports, and 4.14 and later can do only about 
6Gbps.

I also tried with user-space UDP traffic on a stock kernel, and I can get about 
3.2Gbps combined tx+rx
on 4.14 and about 4.4Gbps on 4.13.

Attempting to bisect seems to be triggering a weirdness in git, and also lots 
of commits
crash or do not bring up networking, which makes the bisect difficult.

Looking at perf top, it would appear that some lock is probably to blame.



perf record -a -g -e cycles:pp sleep 5
perf report

Then you'll be able to tell us which lock (or call graph) is killing your perf.



I seem to be chasing multiple issues.  For 4.13, at least part of my problem 
was that LOCKDEP was enabled,
during my bisect, though it does NOT appear enabled in 4.16.  I think maybe 
CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING
in 4.16, or something like that?  My 4.16 .config does have 
CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it:

[greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config
CONFIG_LOCKDEP_SUPPORT=y


For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need
to disable to keep from getting a performance hit from the spectre-related bug
fixes?  At this point, I do not care about the security implications.

greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config
# CONFIG_RETPOLINE is not set


Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



ICMP redirect and VRF

2018-05-08 Thread Ben Greear

While debugging some other problem today on a system using ip rules instead of
VRF, I ran into a case where the remote router was sending back ICMP redirects.

That got me thinking...where would these routes get stored in a VRF scenario?

Would it magically go to the correct VRF routing table based on the incoming 
interface
for the ICMP redirect response?

Thanks,
Ben
--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Performance regression between 4.13 and 4.14

2018-05-08 Thread Ben Greear

Hello,

I am trying to track down a performance regression that appears to be between 
4.13
and 4.14.

I first saw the problem with a hacked version of pktgen on some ixgbe NICs.  
4.13 can do
right at 10G bi-directional on two ports, and 4.14 and later can do only about 
6Gbps.

I also tried with user-space UDP traffic on a stock kernel, and I can get about 
3.2Gbps combined tx+rx
on 4.14 and about 4.4Gbps on 4.13.

Attempting to bisect seems to be triggering a weirdness in git, and also lots 
of commits
crash or do not bring up networking, which makes the bisect difficult.

Looking at perf top, it would appear that some lock is probably to blame.

Any ideas what might have been introduced during this interval that
would cause this?

Anyone else seen similar?

I'm going to attempt some more manual steps to try to find the commit that
introduces this...

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: The SO_BINDTODEVICE was set to the desired interface, but packets are received from all interfaces.

2018-05-07 Thread Ben Greear

On 05/07/2018 03:19 AM, Damir Mansurov wrote:


Greetings,

After successful call of the setsockopt(SO_BINDTODEVICE) function to set data 
reception from only one interface, the data is still received from all 
interfaces.
Function setsockopt() returns 0 but then recv() receives data from all 
available network interfaces.

The problem is reproducible on linux kernels 4.14 - 4.16, but it does not on 
linux kernels 4.4, 4.13.

I have written C-code to reproduce this issue (see attached files b2d_send.c 
and b2d_recv.c). See below explanation of tested configuration.


Hello,

I am not sure if this is your problem or not, but if you are using VRF, then 
you need
to call SO_BINDTODEVICE before you do the 'normal' bind() call.

Thanks,
Ben




PC-1  PC-2
 ---   ---
 | b2d_send|   | b2d_recv|
 | |   | |
 |   --|   |--   |
 |  | eth0 |---| eth0 |  |
 |   --|   |--   |
 | |   | |
 |   --|   |--   |
 |  | eth1 |---| eth1 |  |
 |   --|   |--   |
 | |   | |
 ---   ---

Steps:
1. Copy b2d_recv.c to PC-2, compile it ("gcc -o b2d_recv b2d_recv.c") and run 
"./b2d_recv eth0 23777" to get derived data only from eth0 interface. Port number
in this example is 23777 only for sample.

2. Copy b2d_send.c to PC-1, compile it ("gcc -o b2d_send b2d_send.c") and run 
"./b2d_send ip1 ip2 23777" where ip1 and ip2 are ip addresses of interfaces eth0
and eth1 of PC-2.

3. Result:
- b2d_recv prints out data from eth0 and eth1 on linux kernels from 4.14 up to 
4.16.
- b2d_recv prints out data from only eth0 on linux kernels below 4.14.


**
Thanks,
Damir Mansurov
dn...@oktetlabs.ru



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] net: Work around crash in ipv6 fib-walk-continue

2018-05-04 Thread Ben Greear

On 05/04/2018 10:47 AM, David Ahern wrote:

On 4/19/18 12:01 PM, gree...@candelatech.com wrote:

From: Ben Greear <gree...@candelatech.com>

This keeps us from crashing in certain test cases where we
bring up many (1000, for instance) mac-vlans with IPv6
enabled in the kernel.  This bug has been around for a
very long time.

Until a real fix is found (and for stable), maybe it
is better to return an incomplete fib walk instead
of crashing.

BUG: unable to handle kernel NULL pointer dereference at 8
IP: fib6_walk_continue+0x5b/0x140 [ipv6]
PGD 8007dfc0c067 P4D 8007dfc0c067 PUD 7e66ff067 PMD 0
Oops:  [#1] PREEMPT SMP PTI
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 
libcrc32c vrf]
CPU: 3 PID: 15117 Comm: ip Tainted: G   O 4.16.0+ #5
Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6]
RSP: 0018:c90008c3bc10 EFLAGS: 00010287
RAX: 88085ac45050 RBX: 8807e03008a0 RCX: 
RDX:  RSI: c90008c3bc48 RDI: 8232b240
RBP: 880819167600 R08: 0008 R09: 8807dff10071
R10: c90008c3bbd0 R11:  R12: 8807e03008a0
R13: 0002 R14: 8807e05744c8 R15: 8807e08ef000
FS:  7f2f04342700() GS:88087fcc() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0008 CR3: 0007e0556002 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 inet6_dump_fib+0x14b/0x2c0 [ipv6]
 netlink_dump+0x216/0x2a0
 netlink_recvmsg+0x254/0x400
 ? copy_msghdr_from_user+0xb5/0x110
 ___sys_recvmsg+0xe9/0x230
 ? find_held_lock+0x3b/0xb0
 ? __handle_mm_fault+0x617/0x1180
 ? __audit_syscall_entry+0xb3/0x110
 ? __sys_recvmsg+0x39/0x70
 __sys_recvmsg+0x39/0x70
 do_syscall_64+0x63/0x120
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f2f03a72030
RSP: 002b:7fffab3de508 EFLAGS: 0246 ORIG_RAX: 002f
RAX: ffda RBX: 7fffab3e641c RCX: 7f2f03a72030
RDX:  RSI: 7fffab3de570 RDI: 0004
RBP:  R08: 7e6c R09: 7fffab3e63a8
R10: 7fffab3de5b0 R11: 0246 R12: 7fffab3e6608
R13: 0066b460 R14: 7e6c R15: 
Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83 ea 01 
89 53 2c c7 4
RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: c90008c3bc10
CR2: 0008
---[ end trace bd03458864eb266c ]---

Signed-off-by: Ben Greear <gree...@candelatech.com>
---



Does your use case that triggers this involve replacing routes? I just
noticed the route delete code in fib6_add_rt2node does not have the
'Adjust walkers' code that is in fib6_del_route.

Further, the adjust walkers code in fib6_del_route looks suspicious in
its timing with route deletes. If you have a reliable reproducer we can
try a few things with fib6_del_route and the walker code.


Yes, we replace routes, and yes we can reliably reproduce it and will
be happy to test patches.

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-04-30 Thread Ben Greear

On 04/27/2018 08:11 PM, Steven Rostedt wrote:


We'd like this email archived in netdev list, but since netdev is
notorious for blocking outlook email as spam, it didn't go through. So
I'm replying here to help get it into the archives.

Thanks!

-- Steve


On Fri, 27 Apr 2018 23:05:46 +
Michael Wenig <mwe...@vmware.com> wrote:


As part of VMware's performance testing with the Linux 4.15 kernel,
we identified CPU cost and throughput regressions when comparing to
the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM
send tests when using small message sizes. The regressions are
significant (up 3x) and were tracked down to be a side effect of Eric
Dumazat's RB tree changes that went into the Linux 4.15 kernel.
Further investigation showed our use of the TCP_NODELAY flag in
conjunction with Eric's change caused the regressions to show and
simply disabling TCP_NODELAY brought performance back to normal.
Eric's change also resulted into significant improvements in our
TCP_RR test cases.



Based on these results, our theory is that Eric's change made the
system overall faster (reduced latency) but as a side effect less
aggregation is happening (with TCP_NODELAY) and that results in lower
throughput. Previously even though TCP_NODELAY was set, system was
slower and we still got some benefit of aggregation. Aggregation
helps in better efficiency and higher throughput although it can
increase the latency. If you are seeing a regression in your
application throughput after this change, using TCP_NODELAY might
help bring performance back however that might increase latency.


I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY?

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-23 Thread Ben Greear

On 04/22/2018 02:15 PM, Roopa Prabhu wrote:

On Sun, Apr 22, 2018 at 11:54 AM, David Miller <da...@davemloft.net> wrote:

From: Johannes Berg <johan...@sipsolutions.net>
Date: Thu, 19 Apr 2018 17:26:57 +0200


On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:


Maybe this could be in followup patches?  It's going to touch a lot of files,
and might be hell to get merged all at once, and I've never used spatch, so
just maybe someone else will volunteer that part :)


I guess you'll have to ask davem. :)


Well, first of all, I really don't like this.

The first reason is that every time I see interface foo become foo2,
foo3 is never far behind it.

If foo was not extensible enough such that we needed foo2, we beter
design the new thing with explicitly better extensibility in mind.

Furthermore, what you want here is a specific filter.  Someone else
will want to filter on another criteria, and the next person will
want yet another.

This needs to be properly generalized.

And frankly if we had moved to ethtool netlink/devlink by now, we
could just add a netlink attribute for filtering and not even be
having this conversation.



+1.

Also, the RTM_GETSTATS api was added to improve stats query efficiency
(with filters).
 we should look at it  to see if this fits there. Keeping all stats
queries in one place will help.


I like the ethtool API, so I'll be sticking with that for now.

Thanks,
Ben



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-23 Thread Ben Greear

On 04/22/2018 11:54 AM, David Miller wrote:

From: Johannes Berg <johan...@sipsolutions.net>
Date: Thu, 19 Apr 2018 17:26:57 +0200


On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:


Maybe this could be in followup patches?  It's going to touch a lot of files,
and might be hell to get merged all at once, and I've never used spatch, so
just maybe someone else will volunteer that part :)


I guess you'll have to ask davem. :)


Well, first of all, I really don't like this.

The first reason is that every time I see interface foo become foo2,
foo3 is never far behind it.

If foo was not extensible enough such that we needed foo2, we beter
design the new thing with explicitly better extensibility in mind.

Furthermore, what you want here is a specific filter.  Someone else
will want to filter on another criteria, and the next person will
want yet another.

This needs to be properly generalized.

And frankly if we had moved to ethtool netlink/devlink by now, we
could just add a netlink attribute for filtering and not even be
having this conversation.


Well, since there are un-defined flags, it would be simple enough to
extend the API further in the future (flag (1<<31) could mean expect
more input members, etc.  And, adding up to 30 more flags to filter on different
things won't change the API and should be backwards compatible.

But, if you don't want it, that is OK by me, I agree it is a fairly
obscure feature.  It would have saved me time if you had said you didn't
want it at the first RFC patch though...

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-19 Thread Ben Greear



On 04/18/2018 11:38 PM, Johannes Berg wrote:

On Wed, 2018-04-18 at 14:51 -0700, Ben Greear wrote:


It'd be pretty hard to know which flags are firmware stats?


Yes, it is, but ethtool stats are difficult to understand in a generic
manner anyway, so someone using them is already likely aware of low-level
details of the driver(s) they are using.


Right. Come to think of it though,


+ * @get_ethtool_stats2: Return extended statistics about the device.
+ * This is only useful if the device maintains statistics not
+ * included in  rtnl_link_stats64.
+ *  Takes a flags argument:  0 means all (same as get_ethtool_stats),
+ *  0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
+ *  Other flags are reserved for now.
+ *  Same number of stats will be returned, but some of them might
+ *  not be as accurate/refreshed.  This is to allow not querying
+ *  firmware or other expensive-to-read stats, for instance.


"skip" vs. "don't refresh" is a bit ambiguous - I'd argue better to
either really skip and not return the non-refreshed ones (also helps
with the identifying), or rename the flag.


In order to efficiently parse lots of stats over and over again, I probe
the stat names once on startup, map them to the variable I am trying to use
(since different drivers may have different names for the same basic stat),
and then I store the stat index.

On subsequent stat reads, I just grab stats and go right to the index to
store the stat.

If the stats indexes change, that will complicate my logic quite a bit.

Maybe the flag could be called:  ETHTOOL_GS2_NO_REFRESH_FW ?



Also, wrt. the rest of the patch, I'd argue that it'd be worthwhile to
write the spatch and just add the flags argument to "get_ethtool_stats"
instead of adding a separate method - internally to the kernel it's not
that hard to change.


Maybe this could be in followup patches?  It's going to touch a lot of files,
and might be hell to get merged all at once, and I've never used spatch, so
just maybe someone else will volunteer that part :)

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-18 Thread Ben Greear

On 04/18/2018 02:26 PM, Johannes Berg wrote:

On Tue, 2018-04-17 at 18:49 -0700, gree...@candelatech.com wrote:


+ * @get_ethtool_stats2: Return extended statistics about the device.
+ * This is only useful if the device maintains statistics not
+ * included in  rtnl_link_stats64.
+ *  Takes a flags argument:  0 means all (same as get_ethtool_stats),
+ *  0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
+ *  Other flags are reserved for now.


It'd be pretty hard to know which flags are firmware stats?


Yes, it is, but ethtool stats are difficult to understand in a generic
manner anyway, so someone using them is already likely aware of low-level
details of the driver(s) they are using.

In my case, I have lots of virtual stations (or APs), and I want stats
for them as well as for the 'radio', so I would probe the first vdev with
flags of 'skip-none' to get all stats, including radio (firmware) stats.

And then the rest I would just probe the non-firmware stats.

To be honest, I was slightly amused that anyone expressed interest in
this patch originally, but maybe other people have similar use case
and/or drivers with slow-to-acquire stats.


Anyway, there's no way I'm going to take this patch, so you need to
float it on netdev first (best CC us here) and get it applied there
before we can do anything on the wifi side.


I posted the patches to netdev, ath10k and linux-wireless.  If I had only
posted them individually to different lists I figure I'd be hearing about how
the netdev patch is useless because it has no driver support, etc.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2018-04-17 Thread Ben Greear

On 01/24/2018 03:59 PM, Ben Greear wrote:

On 06/20/2017 08:03 PM, David Ahern wrote:

On 6/20/17 5:41 PM, Ben Greear wrote:

On 06/20/2017 11:05 AM, Michal Kubecek wrote:

On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:

On 06/14/2017 03:25 PM, David Ahern wrote:

On 6/14/17 4:23 PM, Ben Greear wrote:

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no
longer
reproduce the problem.



ok, let's figure out which one is doing that. There are 3 debug
statements. I suspect fib6_del_route is the one setting the state to
FWS_U. Can you remove the debug prints in fib6_repair_tree and
fib6_walk_continue and try again?


We cannot reproduce with just that one printf in the kernel either.  It
must change the timing too much to trigger the bug.


You might try trace_printk() which should have less impact (don't forget
to enable /proc/sys/kernel/ftrace_dump_on_oops).


We cannot reproduce with trace_printk() either.


I think that suggests the walker state is set to FWS_U in
fib6_del_route, and it is the FWS_U case in fib6_walk_continue that
triggers the fault -- the null parent (pn = fn->parent). So we have the
2 areas of code that are interacting.

I'm on a road trip through the end of this week with little time to
focus on this problem. I'll get back to you another suggestion when I can.


FYI, problem still happens in 4.16.  I'm going to re-enable my hack below
for this kernel as well...I had hopes it might be fixed...

BUG: unable to handle kernel NULL pointer dereference at 8
IP: fib6_walk_continue+0x5b/0x140 [ipv6]
PGD 8007dfc0c067 P4D 8007dfc0c067 PUD 7e66ff067 PMD 0
Oops:  [#1] PREEMPT SMP PTI
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 
libcrc32c vrf]
CPU: 3 PID: 15117 Comm: ip Tainted: G   O 4.16.0+ #5
Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6]
RSP: 0018:c90008c3bc10 EFLAGS: 00010287
RAX: 88085ac45050 RBX: 8807e03008a0 RCX: 
RDX:  RSI: c90008c3bc48 RDI: 8232b240
RBP: 880819167600 R08: 0008 R09: 8807dff10071
R10: c90008c3bbd0 R11:  R12: 8807e03008a0
R13: 0002 R14: 8807e05744c8 R15: 8807e08ef000
FS:  7f2f04342700() GS:88087fcc() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0008 CR3: 0007e0556002 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 inet6_dump_fib+0x14b/0x2c0 [ipv6]
 netlink_dump+0x216/0x2a0
 netlink_recvmsg+0x254/0x400
 ? copy_msghdr_from_user+0xb5/0x110
 ___sys_recvmsg+0xe9/0x230
 ? find_held_lock+0x3b/0xb0
 ? __handle_mm_fault+0x617/0x1180
 ? __audit_syscall_entry+0xb3/0x110
 ? __sys_recvmsg+0x39/0x70
 __sys_recvmsg+0x39/0x70
 do_syscall_64+0x63/0x120
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f2f03a72030
RSP: 002b:7fffab3de508 EFLAGS: 0246 ORIG_RAX: 002f
RAX: ffda RBX: 7fffab3e641c RCX: 7f2f03a72030
RDX:  RSI: 7fffab3de570 RDI: 0004
RBP:  R08: 7e6c R09: 7fffab3e63a8
R10: 7fffab3de5b0 R11: 0246 R12: 7fffab3e6608
R13: 0066b460 R14: 7e6c R15: 
Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83 ea 01 
89 53 2c c7 4
RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: c90008c3bc10
CR2: 0008
---[ end trace bd03458864eb266c ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 10 seconds..
ACPI MEMORY or I/O RESET_REG.



So, though I don't know the right way to fix it, the patch below appears
to make the system not crash.


diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 68b9cc7..bf19a14 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w)
pn = fn->parent;
w->node = pn;
 #ifdef CONFIG_IPV6_SUBTREES
+   if (WARN_ON_ONCE(!pn)) {
+   pr_err("FWS-U, w: %p  fn: %p  pn: %p\n",
+  w, fn, pn);
+   /* Attempt to work around crash that has been 
here forever. --Ben */
+   return 0;
+   }
if (FIB6_SUBTREE(pn) == fn) {
WARN_ON(!(fn->fn_flags & RTN_ROOT));
w->state = FWS_L;



The printout looks li

Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Ben Greear

On 03/20/2018 11:24 AM, Michal Kubecek wrote:

On Tue, Mar 20, 2018 at 08:39:33AM -0700, Ben Greear wrote:

On 03/20/2018 03:37 AM, Michal Kubecek wrote:


IMHO it would be more practical to set "0 means same as GSTATS" as a
rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to
avoid code duplication (or perhaps a use fall-through in the switch). It
would also allow drivers to provide only one of the callbacks.


Yes, but that would require changing all drivers at once, and would make 
backporting
and out-of-tree drivers harder to manage.  I had low hopes that this feature 
would
make it upstream, so I didn't want to propose any large changes up front.


I don't think so. What I mean is:

(a) driver implements ->get_ethtool_stats2() callback; then we use it
for GSTATS2
(b) driver does not implement get_ethtool_stats2() but implements
->get_ethtool_stats(); then we call for GSTATS2 if level is zero,
otherwise GSTATS2 returns -EINVAL

and GSTATS is always translated to GSTATS2 with level 0, either by
defining ethtool_get_stats() as a wrapper or by fall-through in the
switch statement.

This way, most drivers could be left untouched and only those which
would implement non-default levels would provide ->get_ethtool_stats2()
callback instead of ->get_ethtool_stats().


OK, that makes sense.  I'll wait on feedback from the flags or #defined levels
and re-spin the patch accordingly.

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns

2018-03-20 Thread Ben Greear

On 03/20/2018 09:44 AM, Liran Alon wrote:



On 20/03/18 18:24, ebied...@xmission.com wrote:


I don't believe the current behavior is a bug.

I looked through the history.  Basically skb_scrub_packet
started out as the scrubbing needed for crossing network
namespaces.

Then tunnels which needed 90% of the functionality started
calling it, with the xnet flag added.  Because the tunnels
needed to preserve their historic behavior.

Then dev_forward_skb started calling skb_scrub_packet.

A veth pair is supposed to give the same behavior as a cross-over
cable plugged into two local nics.  A cross over cable won't
preserve things like the skb mark.  So I don't see why anyone would
expect a veth pair to preserve the mark.


I disagree with this argument.

I think that a skb crossing netns is what simulates a real packet crossing 
physical computers. Following your argument, why would skb->mark should be 
preserved
when crossing netdevs on same netns via routing? But this does today preserve 
skb->mark.

Therefore, I do think that skb->mark should conceptually only be scrubbed when 
crossing netns. Regardless of the netdev used to cross it.


It should be scrubbed in VETH as well.  That is one way to make virtual 
routers.  Possibly
the newer VRF features will give another better way to do it, but you should 
not break
things that used to work.

Now, if you want to add a new feature that allows one to configure the kernel 
(or VETH) for
a new behavior, then that might be something to consider.


Right now I don't see the point of handling packets that don't cross
network namespace boundaries specially, other than to preserve backwards
compatibility.


Well, backwards compat is a big deal all by itself!

Thanks,
Ben



Eric







--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Ben Greear

On 03/20/2018 09:11 AM, Steve deRosier wrote:

On Tue, Mar 20, 2018 at 8:39 AM, Ben Greear <gree...@candelatech.com> wrote:

On 03/20/2018 03:37 AM, Michal Kubecek wrote:


On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote:


From: Ben Greear <gree...@candelatech.com>

This is similar to ETHTOOL_GSTATS, but it allows you to specify
a 'level'.  This level can be used by the driver to decrease the
amount of stats refreshed.  In particular, this helps with ath10k
since getting the firmware stats can be slow.

Signed-off-by: Ben Greear <gree...@candelatech.com>
---

NOTE:  I know to make it upstream I would need to split the patch and
remove the #define for 'backporting' that I added.  But, is the
feature in general wanted?  If so, I'll do the patch split and
other tweaks that might be suggested.





Yes, but that would require changing all drivers at once, and would make
backporting
and out-of-tree drivers harder to manage.  I had low hopes that this feature
would
make it upstream, so I didn't want to propose any large changes up front.



Hi Ben,

I find the feature OK, but I'm not thrilled with the arbitrary scale
of "level". Maybe there could be some named values, either on a
spectrum as level already is, similar to the kernel log DEBUG, WARN,
INFO  type levels. Or named bit flags like the way the ath drivers
do their debug flags for granular results.  Thoughts?


Yes, that would be easier to code too.  If there are any other drivers
out there that might take advantage of this, maybe they could chime in with
what levels and/or bit-fields they would like to see.

For instance a bit that says 'refresh-stats-from-firmware' would be great for 
ath10k,
but maybe useless for everyone else

Thanks,
Ben



- Steve




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Ben Greear

On 03/20/2018 03:37 AM, Michal Kubecek wrote:

On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote:

From: Ben Greear <gree...@candelatech.com>

This is similar to ETHTOOL_GSTATS, but it allows you to specify
a 'level'.  This level can be used by the driver to decrease the
amount of stats refreshed.  In particular, this helps with ath10k
since getting the firmware stats can be slow.

Signed-off-by: Ben Greear <gree...@candelatech.com>
---

NOTE:  I know to make it upstream I would need to split the patch and
remove the #define for 'backporting' that I added.  But, is the
feature in general wanted?  If so, I'll do the patch split and
other tweaks that might be suggested.


I'm not familiar enough with the technical background of stats
collecting to comment on usefulness and desirability of this feature.
Adding a new command just to add a numeric parameter certainly doesn't
feel right but it's how the ioctl interface works. I take it as
a reminder to find some time to get back to the netlink interface.


diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 674b6c9..d3b709f 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1947,6 +1947,54 @@ static int ethtool_get_stats(struct net_device *dev, 
void __user *useraddr)
return ret;
 }

+static int ethtool_get_stats2(struct net_device *dev, void __user *useraddr)
+{
+   struct ethtool_stats stats;
+   const struct ethtool_ops *ops = dev->ethtool_ops;
+   u64 *data;
+   int ret, n_stats;
+   u32 stats_level = 0;
+
+   if (!ops->get_ethtool_stats2 || !ops->get_sset_count)
+   return -EOPNOTSUPP;
+
+   n_stats = ops->get_sset_count(dev, ETH_SS_STATS);
+   if (n_stats < 0)
+   return n_stats;
+   if (n_stats > S32_MAX / sizeof(u64))
+   return -ENOMEM;
+   WARN_ON_ONCE(!n_stats);
+   if (copy_from_user(, useraddr, sizeof(stats)))
+   return -EFAULT;
+
+   /* User can specify the level of stats to query.  How the
+* level value is used is up to the driver, but in general,
+* 0 means 'all', 1 means least, and higher means more.
+* The idea is that some stats may be expensive to query, so user
+* space could just ask for the cheap ones...
+*/
+   stats_level = stats.n_stats;
+
+   stats.n_stats = n_stats;
+   data = vzalloc(n_stats * sizeof(u64));
+   if (n_stats && !data)
+   return -ENOMEM;
+
+   ops->get_ethtool_stats2(dev, , data, stats_level);
+
+   ret = -EFAULT;
+   if (copy_to_user(useraddr, , sizeof(stats)))
+   goto out;
+   useraddr += sizeof(stats);
+   if (n_stats && copy_to_user(useraddr, data, n_stats * sizeof(u64)))
+   goto out;
+   ret = 0;
+
+ out:
+   vfree(data);
+   return ret;
+}
+
 static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 {
struct ethtool_stats stats;


IMHO it would be more practical to set "0 means same as GSTATS" as a
rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to
avoid code duplication (or perhaps a use fall-through in the switch). It
would also allow drivers to provide only one of the callbacks.


Yes, but that would require changing all drivers at once, and would make 
backporting
and out-of-tree drivers harder to manage.  I had low hopes that this feature 
would
make it upstream, so I didn't want to propose any large changes up front.

Thanks,
Ben



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH net] virtio-net: disable NAPI only when enabled during XDP set

2018-02-28 Thread Ben Greear

On 02/28/2018 09:22 AM, David Miller wrote:

From: Jason Wang <jasow...@redhat.com>
Date: Wed, 28 Feb 2018 18:20:04 +0800


We try to disable NAPI to prevent a single XDP TX queue being used by
multiple cpus. But we don't check if device is up (NAPI is enabled),
this could result stall because of infinite wait in
napi_disable(). Fixing this by checking device state through
netif_running() before.

Fixes: 4941d472bf95b ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang <jasow...@redhat.com>


Yes, mis-paired NAPI enable/disable are really a pain.

Probably, we can do something in the interfaces or mechanisms to make
this less error prone and less fragile.

Anyways, applied and queued up for -stable, thanks!



I just hit a similar bug in ath10k.  It seems like napi has plenty
of free bit flags so it could keep track of 'is-enabled' state and
allow someone to call napi_disable multiple times w/out deadlocking.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH RFC net-next 1/4] ipv4: fib_rules: support match on sport, dport and ip proto

2018-02-13 Thread Ben Greear

On 02/12/2018 04:03 PM, David Miller wrote:

From: Eric Dumazet <eric.duma...@gmail.com>
Date: Mon, 12 Feb 2018 13:54:59 -0800


We had project/teams using different routing tables for each vlan they
setup :/


Indeed, people use FIB rules and think they can scale in software.  As
currently implemented, they can't.

The example you give sounds possibly like a great VRF use case btw :-)


I'm one of those people with lots of FIB rules wishing it would scale
better, and wanting a routing table per netdev.

If there is a relatively easy suggestion to make this work better, I'd
like to give it a try.  I have not looked at VRF at all to date...

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2018-01-24 Thread Ben Greear

On 06/20/2017 08:03 PM, David Ahern wrote:

On 6/20/17 5:41 PM, Ben Greear wrote:

On 06/20/2017 11:05 AM, Michal Kubecek wrote:

On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:

On 06/14/2017 03:25 PM, David Ahern wrote:

On 6/14/17 4:23 PM, Ben Greear wrote:

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no
longer
reproduce the problem.



ok, let's figure out which one is doing that. There are 3 debug
statements. I suspect fib6_del_route is the one setting the state to
FWS_U. Can you remove the debug prints in fib6_repair_tree and
fib6_walk_continue and try again?


We cannot reproduce with just that one printf in the kernel either.  It
must change the timing too much to trigger the bug.


You might try trace_printk() which should have less impact (don't forget
to enable /proc/sys/kernel/ftrace_dump_on_oops).


We cannot reproduce with trace_printk() either.


I think that suggests the walker state is set to FWS_U in
fib6_del_route, and it is the FWS_U case in fib6_walk_continue that
triggers the fault -- the null parent (pn = fn->parent). So we have the
2 areas of code that are interacting.

I'm on a road trip through the end of this week with little time to
focus on this problem. I'll get back to you another suggestion when I can.


So, though I don't know the right way to fix it, the patch below appears
to make the system not crash.


diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 68b9cc7..bf19a14 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w)
pn = fn->parent;
w->node = pn;
 #ifdef CONFIG_IPV6_SUBTREES
+   if (WARN_ON_ONCE(!pn)) {
+   pr_err("FWS-U, w: %p  fn: %p  pn: %p\n",
+  w, fn, pn);
+   /* Attempt to work around crash that has been 
here forever. --Ben */
+   return 0;
+   }
if (FIB6_SUBTREE(pn) == fn) {
WARN_ON(!(fn->fn_flags & RTN_ROOT));
w->state = FWS_L;



The printout looks like this (when adding 4000 mac-vlans, so it is pretty 
rare).  PN is definitely NULL sometimes:

[root@2u-6n ~]# journalctl -f|grep FWS
Jan 24 15:48:05 2u-6n kernel: IPv6: FWS-U, w: 8807ea121ba0  fn: 
880856a09260  pn:   (null)
Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 8807e3963de0  fn: 
880856a09260  pn:   (null)
Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 88081ac22de0  fn: 
880856a09260  pn:   (null)
Jan 24 15:53:13 2u-6n kernel: IPv6: FWS-U, w: 8808290c69c0  fn: 
8807e369f920  pn:   (null)
Jan 24 15:53:24 2u-6n kernel: IPv6: FWS-U, w: 8807ea3156c0  fn: 
88082d1eeb60  pn:   (null)



8066 Jan 24 15:48:04 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device 
eth2#1006
 8067 Jan 24 15:48:05 2u-6n kernel: [ cut here ]
 8068 Jan 24 15:48:05 2u-6n kernel: WARNING: CPU: 5 PID: 3346 at /home/greearb/git/linux-4.13.dev.y/net/ipv6/ip6_fib.c:1617 fib6_walk_continue+ 
0x154/0x1b0 [ipv6]
 8069 Jan 24 15:48:05 2u-6n kernel: Modules linked in: 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen ipmi_ssif coretemp intel_raplsb_edac 
x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm ath9k irqbypass iTCO_wdt ath9k_common iTCO_vendor_support ath9k_hw ath  i2c_i801 mac80211 joydev 
lpc_ich cfg80211 ioatdma shpchp tpm_tis tpm_tis_core wmi tpm ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl 
sch_fq_codel lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_core ipv6 crc_ccitt

 8070 Jan 24 15:48:05 2u-6n kernel: CPU: 5 PID: 3346 Comm: ip Tainted: G
   O4.13.16+ #22
 8071 Jan 24 15:48:05 2u-6n kernel: Hardware name: Iron_Systems,Inc 
CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
 8072 Jan 24 15:48:05 2u-6n kernel: task: 8807e9ef1dc0 task.stack: 
c9002083c000
 8073 Jan 24 15:48:05 2u-6n kernel: RIP: 0010:fib6_walk_continue+0x154/0x1b0 
[ipv6]
 8074 Jan 24 15:48:05 2u-6n kernel: RSP: 0018:c9002083fbc0 EFLAGS: 00010246
 8075 Jan 24 15:48:05 2u-6n kernel: RAX:  RBX: 8807ea121ba0 
RCX: 
 8076 Jan 24 15:48:05 2u-6n kernel: RDX: 880856a09260 RSI: c9002083fc00 
RDI: 81ef2140
 8077 Jan 24 15:48:05 2u-6n kernel: RBP: c9002083fbc8 R08: 0008 
R09: 8807e36f6b25
 8078 Jan 24 15:48:05 2u-6n kernel: R10: c9002083fb70 R11:  
R12: 0002
 8079

Re: e1000e hardware unit hangs

2018-01-24 Thread Ben Greear

On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote:

On 2018-01-24 20:31, Ben Greear wrote:

On 01/24/2018 08:34 AM, Neftin, Sasha wrote:

On 1/24/2018 18:11, Alexander Duyck wrote:

On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <gree...@candelatech.com> wrote:

Hello,

Anyone have any more suggestions for making e1000e work better?  This is
from a 4.9.65+ kernel,
with these additional e1000e patches applied:

e1000e: Fix error path in link detection
e1000e: Fix wrong comment related to link detection
e1000e: Fix return value test
e1000e: Separate signaling for link check/link up
e1000e: Avoid receiver overrun interrupt bursts


Most of these patches shouldn't address anything that would trigger Tx
hangs. They are mostly related to just link detection.


Test case is simply to run 3 tcp connections each trying to send 56Kbps
of bi-directional
data between a pair of e1000e interfaces :)

No OOM related issues are seen on this kernel...similar test on 4.13 showed
some OOM
issues, but I have not debugged that yet...


Really a question like this probably belongs on e1000-devel or
intel-wired-lan so I have added those lists and the e1000e maintainer
to the thread.

It would be useful if you could provide more information about the
device itself such as the ID and the kind of test you are running.
Keep in mind the e1000e driver supports a pretty broad swath of
devices so we need to narrow things down a bit.


please, also re-check if your kernel include:
e1000e: fix buffer overrun while the I219 is processing DMA transactions
e1000e: fix the use of magic numbers for buffer overrun issue
where you take fresh version of kernel?


Hello,

I tried adding those two patches, but I still see this splat shortly
after starting
my test.  The kernel I am using is here:

https://github.com/greearb/linux-ct-4.13

I've seen similar issues at least back to the 4.0 kernel, including
stock kernels and my
own kernels with additional patches.

Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499,
wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: [ cut
here ]
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0
PID: 0 at
/home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322
dev_watchdog+0x228/0x250
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in:
nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c
cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0
Comm: swapper/0 Tainted: G   O4.13.16+ #22
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name:
Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task:
81e104c0 task.stack: 81e0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP:
0010:dev_watchdog+0x228/0x250
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP:
0018:88042fc03e50 EFLAGS: 00010282
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX:
0086 RBX:  RCX: 
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX:
88042fc15b40 RSI: 88042fc0dbf8 RDI: 88042fc0dbf8
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP:
88042fc03e98 R08: 0001 R09: 03c4
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10:
 R11: 03c4 R12: 1388
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13:
000100050dc3 R14: 88041767 R15: 000100052400
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS:
() GS:88042fc0()
knlGS:
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS:  0010 DS: 
ES:  CR0: 80050033
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2:
01d14000 CR3: 01e09000 CR4: 001406f0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace:
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  call_timer_fn+0x30/0x160
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
run_timer_softirq+0x1f0/0x450
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
lapic_next_deadline+0x21/0x30
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
clockevents_program_event+0x78/0xf0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  __do_softirq+0xc1/0x2c0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  irq_exit+0xb1/0xc0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
smp_apic_timer_interrupt+0x38/0x50
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 

Re: e1000e hardware unit hangs

2018-01-24 Thread Ben Greear

On 01/24/2018 08:34 AM, Neftin, Sasha wrote:

On 1/24/2018 18:11, Alexander Duyck wrote:

On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <gree...@candelatech.com> wrote:

Hello,

Anyone have any more suggestions for making e1000e work better?  This is
from a 4.9.65+ kernel,
with these additional e1000e patches applied:

e1000e: Fix error path in link detection
e1000e: Fix wrong comment related to link detection
e1000e: Fix return value test
e1000e: Separate signaling for link check/link up
e1000e: Avoid receiver overrun interrupt bursts


Most of these patches shouldn't address anything that would trigger Tx
hangs. They are mostly related to just link detection.


Test case is simply to run 3 tcp connections each trying to send 56Kbps
of bi-directional
data between a pair of e1000e interfaces :)

No OOM related issues are seen on this kernel...similar test on 4.13 showed
some OOM
issues, but I have not debugged that yet...


Really a question like this probably belongs on e1000-devel or
intel-wired-lan so I have added those lists and the e1000e maintainer
to the thread.

It would be useful if you could provide more information about the
device itself such as the ID and the kind of test you are running.
Keep in mind the e1000e driver supports a pretty broad swath of
devices so we need to narrow things down a bit.


please, also re-check if your kernel include:
e1000e: fix buffer overrun while the I219 is processing DMA transactions
e1000e: fix the use of magic numbers for buffer overrun issue
where you take fresh version of kernel?


Hello,

I tried adding those two patches, but I still see this splat shortly after 
starting
my test.  The kernel I am using is here:

https://github.com/greearb/linux-ct-4.13

I've seen similar issues at least back to the 4.0 kernel, including stock 
kernels and my
own kernels with additional patches.

Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, wd-timeout: 5000 
jiffies: 4295304192 tx-queues: 1

Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: [ cut here 
]
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 PID: 0 at /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 
dev_watchdog+0x228/0x250
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c cfg80211 macvlan 
wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip

Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 Comm: 
swapper/0 Tainted: G   O4.13.16+ #22
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: Supermicro 
X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: 81e104c0 
task.stack: 81e0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: 
0010:dev_watchdog+0x228/0x250
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: 0018:88042fc03e50 
EFLAGS: 00010282
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: 0086 RBX: 
 RCX: 
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: 88042fc15b40 RSI: 
88042fc0dbf8 RDI: 88042fc0dbf8
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: 88042fc03e98 R08: 
0001 R09: 03c4
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10:  R11: 
03c4 R12: 1388
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: 000100050dc3 R14: 
88041767 R15: 000100052400
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS:  () 
GS:88042fc0() knlGS:
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS:  0010 DS:  ES:  
CR0: 80050033
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: 01d14000 CR3: 
01e09000 CR4: 001406f0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace:
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  call_timer_fn+0x30/0x160
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  
run_timer_softirq+0x1f0/0x450
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? 
lapic_next_deadline+0x21/0x30
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? 
clockevents_program_event+0x78/0xf0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  __do_softirq+0xc1/0x2c0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  irq_exit+0xb1/0xc0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  
smp_apic_timer_interrupt+0x38/0x50
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  
apic_timer_interrupt+0x89/0x90
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 

Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

2018-01-23 Thread Ben Greear

On 01/23/2018 03:27 PM, Ben Greear wrote:

On 01/23/2018 03:21 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 15:10 -0800, Ben Greear wrote:

On 01/23/2018 02:29 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote:

On 01/23/2018 02:07 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



Hello Eric, looks like it is one of your commits that causes the issue
I see.

Here are some more details on my specific test case I used to bisect:

I have two ixgbe ports looped back, configured on same subnet, but with 
different IPs.
Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client 
and server
side let me send-to-self over the external looped cable.

I have 2 mac-vlans on each physical interface.

I created 5 server-side connections on one physical port, and two more on one 
of the mac-vlans.

On the client-side, I create a process that spawns 5000 connections to the 
corresponding server side.

End result is 25,000 connections on one pair of real interfaces, and 10,000 
connections on the
mac-vlan ports.

In the passing case, I get very close to all 5000 connections on all endpoints 
quickly.

In the failing case, I get a max of around 16k connections on the two physical 
ports.  The two mac-vlans have 10k connections
across them working reliably.  It seems to be an issue with 'connect' failing.

connect(2074, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
fcntl(2075, F_GETFD)= 0
fcntl(2075, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2075, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2075, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
fcntl(2076, F_GETFD)= 0
fcntl(2076, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2076, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2076, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign 
requested address)



ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet <eduma...@google.com>
Date:   Thu Feb 11 16:28:50 2016 -0800

 tcp/dccp: better use of ephemeral ports in bind()

 Implement strategy used in __inet_hash_connect() in opposite way :

 Try to find a candidate using odd ports, then fallback to even ports.

 We no longer disable BH for whole traversal, but one bucket at a time.
 We also use cond_resched() to yield cpu to other tasks if needed.

 I removed one indentation level and tried to mirror the loop we have
 in __inet_hash_connect() and variable names to ease code maintenance.

 Signed-off-by: Eric Dumazet <eduma...@google.com>
 Signed-off-by: David S. Miller <da...@davemloft.net>

:04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 
e0c014ae8b7e2867256eff60f6210821d36eacef Mnet


I will be happy to te

e1000e hardware unit hangs

2018-01-23 Thread Ben Greear
, trans_start: 4294748730, wd-timeout: 5000 
jiffies: 4294759424 tx-queues: 1

Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: 
Reset adapter unexpectedly
Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: 
Reset adapter unexpectedly
Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4294766123, wd-timeout: 5000 
jiffies: 4294771200 tx-queues: 1
Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4294766125, wd-timeout: 5000 
jiffies: 4294771200 tx-queues: 1

Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: 
Reset adapter unexpectedly
Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: 
Reset adapter unexpectedly
Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: 
Detected Hardware Unit Hang:
  TDH  
  TDT  
...
Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx


Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

2018-01-23 Thread Ben Greear

On 01/23/2018 03:21 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 15:10 -0800, Ben Greear wrote:

On 01/23/2018 02:29 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote:

On 01/23/2018 02:07 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



Hello Eric, looks like it is one of your commits that causes the issue
I see.

Here are some more details on my specific test case I used to bisect:

I have two ixgbe ports looped back, configured on same subnet, but with 
different IPs.
Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client 
and server
side let me send-to-self over the external looped cable.

I have 2 mac-vlans on each physical interface.

I created 5 server-side connections on one physical port, and two more on one 
of the mac-vlans.

On the client-side, I create a process that spawns 5000 connections to the 
corresponding server side.

End result is 25,000 connections on one pair of real interfaces, and 10,000 
connections on the
mac-vlan ports.

In the passing case, I get very close to all 5000 connections on all endpoints 
quickly.

In the failing case, I get a max of around 16k connections on the two physical 
ports.  The two mac-vlans have 10k connections
across them working reliably.  It seems to be an issue with 'connect' failing.

connect(2074, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
fcntl(2075, F_GETFD)= 0
fcntl(2075, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2075, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2075, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
fcntl(2076, F_GETFD)= 0
fcntl(2076, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2076, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2076, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign 
requested address)



ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet <eduma...@google.com>
Date:   Thu Feb 11 16:28:50 2016 -0800

 tcp/dccp: better use of ephemeral ports in bind()

 Implement strategy used in __inet_hash_connect() in opposite way :

 Try to find a candidate using odd ports, then fallback to even ports.

 We no longer disable BH for whole traversal, but one bucket at a time.
 We also use cond_resched() to yield cpu to other tasks if needed.

 I removed one indentation level and tried to mirror the loop we have
 in __inet_hash_connect() and variable names to ease code maintenance.

 Signed-off-by: Eric Dumazet <eduma...@google.com>
 Signed-off-by: David S. Miller <da...@davemloft.net>

:04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 
e0c014ae8b7e2867256eff60f6210821d36eacef M  net


I will be happy to test patches or try to get any other r

Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

2018-01-23 Thread Ben Greear

On 01/23/2018 02:29 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote:

On 01/23/2018 02:07 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



Hello Eric, looks like it is one of your commits that causes the issue
I see.

Here are some more details on my specific test case I used to bisect:

I have two ixgbe ports looped back, configured on same subnet, but with 
different IPs.
Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client 
and server
side let me send-to-self over the external looped cable.

I have 2 mac-vlans on each physical interface.

I created 5 server-side connections on one physical port, and two more on one 
of the mac-vlans.

On the client-side, I create a process that spawns 5000 connections to the 
corresponding server side.

End result is 25,000 connections on one pair of real interfaces, and 10,000 
connections on the
mac-vlan ports.

In the passing case, I get very close to all 5000 connections on all endpoints 
quickly.

In the failing case, I get a max of around 16k connections on the two physical 
ports.  The two mac-vlans have 10k connections
across them working reliably.  It seems to be an issue with 'connect' failing.

connect(2074, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
fcntl(2075, F_GETFD)= 0
fcntl(2075, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2075, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2075, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
fcntl(2076, F_GETFD)= 0
fcntl(2076, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2076, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2076, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign 
requested address)



ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet <eduma...@google.com>
Date:   Thu Feb 11 16:28:50 2016 -0800

 tcp/dccp: better use of ephemeral ports in bind()

 Implement strategy used in __inet_hash_connect() in opposite way :

 Try to find a candidate using odd ports, then fallback to even ports.

 We no longer disable BH for whole traversal, but one bucket at a time.
 We also use cond_resched() to yield cpu to other tasks if needed.

 I removed one indentation level and tried to mirror the loop we have
 in __inet_hash_connect() and variable names to ease code maintenance.

 Signed-off-by: Eric Dumazet <eduma...@google.com>
 Signed-off-by: David S. Miller <da...@davemloft.net>

:04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 
e0c014ae8b7e2867256eff60f6210821d36eacef M  net


I will be happy to test patches or try to get any other results that might help 
diagnose
this problem better.


Problem is I do not see anything obvious here.

P

Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

2018-01-23 Thread Ben Greear

On 01/23/2018 02:07 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



Hello Eric, looks like it is one of your commits that causes the issue
I see.

Here are some more details on my specific test case I used to bisect:

I have two ixgbe ports looped back, configured on same subnet, but with 
different IPs.
Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client 
and server
side let me send-to-self over the external looped cable.

I have 2 mac-vlans on each physical interface.

I created 5 server-side connections on one physical port, and two more on one 
of the mac-vlans.

On the client-side, I create a process that spawns 5000 connections to the 
corresponding server side.

End result is 25,000 connections on one pair of real interfaces, and 10,000 
connections on the
mac-vlan ports.

In the passing case, I get very close to all 5000 connections on all endpoints 
quickly.

In the failing case, I get a max of around 16k connections on the two physical 
ports.  The two mac-vlans have 10k connections
across them working reliably.  It seems to be an issue with 'connect' failing.

connect(2074, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
fcntl(2075, F_GETFD)= 0
fcntl(2075, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2075, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2075, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
fcntl(2076, F_GETFD)= 0
fcntl(2076, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2076, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2076, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign 
requested address)



ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet <eduma...@google.com>
Date:   Thu Feb 11 16:28:50 2016 -0800

 tcp/dccp: better use of ephemeral ports in bind()

 Implement strategy used in __inet_hash_connect() in opposite way :

 Try to find a candidate using odd ports, then fallback to even ports.

 We no longer disable BH for whole traversal, but one bucket at a time.
 We also use cond_resched() to yield cpu to other tasks if needed.

 I removed one indentation level and tried to mirror the loop we have
 in __inet_hash_connect() and variable names to ease code maintenance.

 Signed-off-by: Eric Dumazet <eduma...@google.com>
 Signed-off-by: David S. Miller <da...@davemloft.net>

:04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 
e0c014ae8b7e2867256eff60f6210821d36eacef M  net


I will be happy to test patches or try to get any other results that might help 
diagnose
this problem better.


Problem is I do not see anything obvious here.

Please provide /proc/sys/net/ipv4/ip_local_port_range


[root@lf1003-e3v2-13100124-f20x64 ~]#

Re: TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-23 Thread Ben Greear

On 01/22/2018 10:46 AM, Josh Hunt wrote:

On Mon, Jan 22, 2018 at 10:30 AM, Ben Greear <gree...@candelatech.com> wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:


On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:


My test case is to have 6 processes each create 5000 TCP IPv4 connections
to each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a
4.7 kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running
out of tcp memory,
but even after forcing those values higher, the max connections we can
get is around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is
my fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in
more recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



I am sending to self, but over external network interfaces, by using
routing tables and rules and such.

On 4.13.16+, I see the Intel driver bouncing when I try to start 20k
connections.  In this case, I have a pair of 10G ports doing 15k, and then
I try to start 5k on two of the 1G ports

Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3
(e1000e): transmit queue 0 timed out, trans_s...es: 1
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0
eth3: Reset adapter unexpectedly



Ben

We had an interface doing this and grabbing these commits resolved it for us:

4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts
19110cfbb34d e1000e: Separate signaling for link check/link up
d3509f8bc7b0 e1000e: Fix return value test
65a29da1f5fd e1000e: Fix wrong comment related to link detection
c4c40e51f9c3 e1000e: Fix error path in link detection

They are in the LTS kernels now, but don't believe they were when we
first hit this problem.


Thanks a lot for the suggestions, I can confirm that these patches applied to 
my 4.13.16+
tree does indeed seem to fix the problem.

Thanks,
Ben



Josh




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

2018-01-23 Thread Ben Greear

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



Hello Eric, looks like it is one of your commits that causes the issue
I see.

Here are some more details on my specific test case I used to bisect:

I have two ixgbe ports looped back, configured on same subnet, but with 
different IPs.
Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client 
and server
side let me send-to-self over the external looped cable.

I have 2 mac-vlans on each physical interface.

I created 5 server-side connections on one physical port, and two more on one 
of the mac-vlans.

On the client-side, I create a process that spawns 5000 connections to the 
corresponding server side.

End result is 25,000 connections on one pair of real interfaces, and 10,000 
connections on the
mac-vlan ports.

In the passing case, I get very close to all 5000 connections on all endpoints 
quickly.

In the failing case, I get a max of around 16k connections on the two physical 
ports.  The two mac-vlans have 10k connections
across them working reliably.  It seems to be an issue with 'connect' failing.

connect(2074, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
fcntl(2075, F_GETFD)= 0
fcntl(2075, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2075, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2075, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
fcntl(2076, F_GETFD)= 0
fcntl(2076, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2076, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2076, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign 
requested address)



ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet <eduma...@google.com>
Date:   Thu Feb 11 16:28:50 2016 -0800

tcp/dccp: better use of ephemeral ports in bind()

Implement strategy used in __inet_hash_connect() in opposite way :

Try to find a candidate using odd ports, then fallback to even ports.

We no longer disable BH for whole traversal, but one bucket at a time.
We also use cond_resched() to yield cpu to other tasks if needed.

I removed one indentation level and tried to mirror the loop we have
in __inet_hash_connect() and variable names to ease code maintenance.

Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: David S. Miller <da...@davemloft.net>

:04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 
e0c014ae8b7e2867256eff60f6210821d36eacef M  net


I will be happy to test patches or try to get any other results that might help 
diagnose
this problem better.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-22 Thread Ben Greear

On 01/22/2018 10:30 AM, Ben Greear wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



I am sending to self, but over external network interfaces, by using
routing tables and rules and such.

On 4.13.16+, I see the Intel driver bouncing when I try to start 20k
connections.  In this case, I have a pair of 10G ports doing 15k, and then
I try to start 5k on two of the 1G ports

Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 
(e1000e): transmit queue 0 timed out, trans_s...es: 1
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: 
Reset adapter unexpectedly


System reports 10+GB RAM free in this case, btw.

Actually, maybe the good kernel was even older than 4.7...I see same resets and 
inability to do a full 20k
connections on 4.7 too.   I double-checked with system-test and it seems 4.4 
was a good kernel.  I'll test
that next.  Here is splat from 4.7:

[  238.921679] [ cut here ]
[  238.921689] WARNING: CPU: 0 PID: 3 at 
/home/greearb/git/linux-bisect/net/sched/sch_generic.c:272 
dev_watchdog+0xd4/0x12f
[  238.921690] NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out
[  238.921691] Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink 
nf_defrag_ipv4 cfg80211 macvlan pktgen bnep bluetooth fuse coretemp intel_rapl
ftdi_sio x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm iTCO_wdt 
iTCO_vendor_support joydev ie31200_edac ipmi_devintf irqbypass serio_raw 
ipmi_si edac_core
shpchp fjes video i2c_i801 tpm_tis lpc_ich ipmi_msghandler tpm nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core
e1000e ixgbe mdio hwmon dca ptp pps_core ipv6 [last unloaded: nf_conntrack]
[  238.921720] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.7.0 #62
[  238.921721] Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 
09/17/2012
[  238.921723]   88041cdd7cd8 81352a23 
88041cdd7d28
[  238.921725]   88041cdd7d18 810ea5dd 
01101cdd7d90
[  238.921727]  880417a84000 0100 8163ecff 
880417a84440
[  238.921728] Call Trace:
[  238.921733]  [] dump_stack+0x61/0x7d
[  238.921736]  [] __warn+0xbd/0xd8
[  238.921738]  [] ? netif_tx_lock+0x81/0x81
[  238.921740]  [] warn_slowpath_fmt+0x46/0x4e
[  238.921741]  [] ? netif_tx_lock+0x74/0x81
[  238.921743]  [] dev_watchdog+0xd4/0x12f
[  238.921746]  [] call_timer_fn+0x65/0x11b
[  238.921748]  [] ? netif_tx_lock+0x81/0x81
[  238.921749]  [] run_timer_softirq+0x1ad/0x1d7
[  238.921751]  [] __do_softirq+0xfb/0x25c
[  238.921752]  [] run_ksoftirqd+0x19/0x35
[  238.921755]  [] smpboot_thread_fn+0x169/0x1a9
[  238.921756]  [] ? sort_range+0x1d/0x1d
[  238.921759]  [] kthread+0xa0/0xa8
[  238.921763]  [] ret_from_fork+0x1f/0x40
[  238.921764]  [] ? init_completion+0x24/0x24
[  238.921765] ---[ end trace 933912956c6ee5ff ]---
[  238.961672] e1000e :07:00.0 eth3: Reset adapter unexpectedly


So, on 4.4.8+, I see this and other splats related to e1000e.  I guess that is 
a separate
issue.  I can easily start 40k connections however, 30k across the two 10G 
ports,
and 10k more across a pair of mac-vlans on the 10G ports (since I was out of
address space to add a full 40k on the two physical ports).


Looks like the e1000e problem is a separate issue, so

Re: TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-22 Thread Ben Greear

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



I am sending to self, but over external network interfaces, by using
routing tables and rules and such.

On 4.13.16+, I see the Intel driver bouncing when I try to start 20k
connections.  In this case, I have a pair of 10G ports doing 15k, and then
I try to start 5k on two of the 1G ports

Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 
(e1000e): transmit queue 0 timed out, trans_s...es: 1
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: 
Reset adapter unexpectedly


System reports 10+GB RAM free in this case, btw.

Actually, maybe the good kernel was even older than 4.7...I see same resets and 
inability to do a full 20k
connections on 4.7 too.   I double-checked with system-test and it seems 4.4 
was a good kernel.  I'll test
that next.  Here is splat from 4.7:

[  238.921679] [ cut here ]
[  238.921689] WARNING: CPU: 0 PID: 3 at 
/home/greearb/git/linux-bisect/net/sched/sch_generic.c:272 
dev_watchdog+0xd4/0x12f
[  238.921690] NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out
[  238.921691] Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 macvlan pktgen bnep bluetooth fuse coretemp intel_rapl 
ftdi_sio x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm iTCO_wdt iTCO_vendor_support joydev ie31200_edac ipmi_devintf irqbypass serio_raw ipmi_si edac_core 
shpchp fjes video i2c_i801 tpm_tis lpc_ich ipmi_msghandler tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core 
e1000e ixgbe mdio hwmon dca ptp pps_core ipv6 [last unloaded: nf_conntrack]

[  238.921720] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.7.0 #62
[  238.921721] Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 
09/17/2012
[  238.921723]   88041cdd7cd8 81352a23 
88041cdd7d28
[  238.921725]   88041cdd7d18 810ea5dd 
01101cdd7d90
[  238.921727]  880417a84000 0100 8163ecff 
880417a84440
[  238.921728] Call Trace:
[  238.921733]  [] dump_stack+0x61/0x7d
[  238.921736]  [] __warn+0xbd/0xd8
[  238.921738]  [] ? netif_tx_lock+0x81/0x81
[  238.921740]  [] warn_slowpath_fmt+0x46/0x4e
[  238.921741]  [] ? netif_tx_lock+0x74/0x81
[  238.921743]  [] dev_watchdog+0xd4/0x12f
[  238.921746]  [] call_timer_fn+0x65/0x11b
[  238.921748]  [] ? netif_tx_lock+0x81/0x81
[  238.921749]  [] run_timer_softirq+0x1ad/0x1d7
[  238.921751]  [] __do_softirq+0xfb/0x25c
[  238.921752]  [] run_ksoftirqd+0x19/0x35
[  238.921755]  [] smpboot_thread_fn+0x169/0x1a9
[  238.921756]  [] ? sort_range+0x1d/0x1d
[  238.921759]  [] kthread+0xa0/0xa8
[  238.921763]  [] ret_from_fork+0x1f/0x40
[  238.921764]  [] ? init_completion+0x24/0x24
[  238.921765] ---[ end trace 933912956c6ee5ff ]---
[  238.961672] e1000e :07:00.0 eth3: Reset adapter unexpectedly


Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-22 Thread Ben Greear

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



fm10k cannot get link

2017-10-31 Thread Ben Greear

Hello,

We're trying to get an Intel 100G NIC to work, and so far, cannot get it to 
link.

The cable is:  X0016I4AO3 QSFP28 10Gtek  (any suggestions for a 
better/different one?)

[5.022681] fm10k :05:00.0: PCI Express bandwidth of 64GT/s available
[5.022683] fm10k :05:00.0: (Speed:8.0GT/s, Width: x8, Encoding 
Loss:<2%, Payload:256B)
[5.022684] fm10k :05:00.0: 00:e0:ed:54:78:f2
[5.027864] fm10k :06:00.0: PCI Express bandwidth of 64GT/s available
[5.027865] fm10k :06:00.0: (Speed:8.0GT/s, Width: x8, Encoding 
Loss:<2%, Payload:256B)
[5.027866] fm10k :06:00.0: 00:e0:ed:54:78:f3
[6.057950] Modules linked in: ioatdma(+) shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper ttm igb drm i2c_algo_bit i2c_core ixgbe mdio 
hwmon fm10k ptp pps_core dca fjes ipv6 crc_ccitt

[7.294441] fm10k :05:00.0 eth0.r: renamed from eth0
[   14.044914] fm10k :05:00.0 eth2: renamed from eth0.r
[   14.107798] fm10k :06:00.0 eth1.r: renamed from eth1
[   14.178217] fm10k :06:00.0 eth3: renamed from eth1.r


[root@lf1005c-is14120020 ~]# ethtool eth3
Settings for eth3:
Current message level: 0x0007 (7)
   drv probe link
Link detected: no


[root@lf1005c-is14120020 ~]# uname -a
Linux lf1005c-is14120020 4.9.29+ #46 SMP PREEMPT Wed Jul 26 17:48:57 PDT 2017 
x86_64 x86_64 x86_64 GNU/Linux

[root@lf1005c-is14120020 ~]# ethtool -i eth3
driver: fm10k
version: 0.21.2-k
firmware-version:
bus-info: :06:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

[root@lf1005c-is14120020 ~]# lspci|grep 06
06:00.0 Ethernet controller: Intel Corporation Device 15a4


Please let me know if you have any suggestions.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Ethtool question

2017-10-16 Thread Ben Greear

On 10/12/2017 03:00 PM, Roopa Prabhu wrote:

On Thu, Oct 12, 2017 at 2:45 PM, Ben Greear <gree...@candelatech.com> wrote:

On 10/11/2017 01:49 PM, David Miller wrote:


From: "John W. Linville" <linvi...@tuxdriver.com>
Date: Wed, 11 Oct 2017 16:44:07 -0400


On Wed, Oct 11, 2017 at 09:51:56AM -0700, Ben Greear wrote:


I noticed today that setting some ethtool settings to the same value
returns an error code.  I would think this should silently return
success instead?  Makes it easier to call it from scripts this way:

[root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1
combined unmodified, ignoring
no channel parameters changed, aborting
current values: tx 0 rx 0 other 1 combined 1
[root@lf0313-6477 lanforge]# echo $?
1



I just had this discussion a couple of months ago with someone. My
initial feeling was like you, a no-op is not a failure. But someone
convinced me otherwise...I will now endeavour to remember who that
was and how they convinced me...

Anyone else have input here?



I guess this usually happens when drivers don't support changing the
settings at all.  So they just make their ethtool operation for the
'set' always return an error.

We could have a generic ethtool helper that does "get" and then if the
"set" request is identical just return zero.

But from another perspective, the error returned from the "set" in this
situation also indicates to the user that the driver does not support
the "set" operation which has value and meaning in and of itself.  And
we'd lose that with the given suggestion.



In my case, the driver (igb) does support the set, my program just made the
same
ethtool call several times and it fails after the initial change (that
actually
changes something), as best as I can figure.



This error is returned by ethtool user-space. It does a get, check and
then set if user has requested changes.



So, should we fix ethtool to return 0 in this case instead of an error code?

I think so.  If the driver itself returns an error, then probably return the
error code and/or fix the driver as seems appropriate.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Ethtool question

2017-10-12 Thread Ben Greear

On 10/11/2017 01:49 PM, David Miller wrote:

From: "John W. Linville" <linvi...@tuxdriver.com>
Date: Wed, 11 Oct 2017 16:44:07 -0400


On Wed, Oct 11, 2017 at 09:51:56AM -0700, Ben Greear wrote:

I noticed today that setting some ethtool settings to the same value
returns an error code.  I would think this should silently return
success instead?  Makes it easier to call it from scripts this way:

[root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1
combined unmodified, ignoring
no channel parameters changed, aborting
current values: tx 0 rx 0 other 1 combined 1
[root@lf0313-6477 lanforge]# echo $?
1


I just had this discussion a couple of months ago with someone. My
initial feeling was like you, a no-op is not a failure. But someone
convinced me otherwise...I will now endeavour to remember who that
was and how they convinced me...

Anyone else have input here?


I guess this usually happens when drivers don't support changing the
settings at all.  So they just make their ethtool operation for the
'set' always return an error.

We could have a generic ethtool helper that does "get" and then if the
"set" request is identical just return zero.

But from another perspective, the error returned from the "set" in this
situation also indicates to the user that the driver does not support
the "set" operation which has value and meaning in and of itself.  And
we'd lose that with the given suggestion.


In my case, the driver (igb) does support the set, my program just made the same
ethtool call several times and it fails after the initial change (that actually
changes something), as best as I can figure.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Ethtool question

2017-10-11 Thread Ben Greear

I noticed today that setting some ethtool settings to the same value
returns an error code.  I would think this should silently return
success instead?  Makes it easier to call it from scripts this way:

[root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1
combined unmodified, ignoring
no channel parameters changed, aborting
current values: tx 0 rx 0 other 1 combined 1
[root@lf0313-6477 lanforge]# echo $?
1

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?

2017-09-26 Thread Ben Greear

On 09/12/2017 01:26 PM, Michal Kubecek wrote:

On Tue, Sep 12, 2017 at 11:54:43AM -0700, Ben Greear wrote:

It does not appear to work on Fedora-26, and I'm curious if someone
knows what needs doing to get this support working?


It's rather complicated. The "vlan" and "vlan " filters didn't
handle the case when vlan information is passed in metadata until commit
04660eb1e561 ("Use BPF extensions in compiled filters"), i.e. libpcap
1.7.0. Unfortunately that commit made libpcap always check only metadata
for the first outermost vlan tag so that it broke the case when vlan
information is passed in packet itself (which is less frequent today).

To handle both cases correctly, you would need libpcap with commits
d739b068ac29 ("Make VLAN filter handle both metadata and inline tags")
and 7c7a19fbd9af ("Fix logic of combined VLAN test") and also the
optimizer fix from

  https://github.com/the-tcpdump-group/libpcap/pull/582/commits/075015a3d17a

(without it the filters generate incorrect BPF in some cases unless the
optimizer is disabled). As far as I can see, these commits are not in
any release yet.

   Michal Kubecek



So, I cloned the latest libpcap, and I'm going to start poking at this.

Do you happen to know if I need to do anything special other than
'pcap_compile()'?  I'm curious how the library would know if it can use
newer kernel API or not...or maybe it is somehow magically backwards/forward
compatible?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?

2017-09-12 Thread Ben Greear

On 09/12/2017 11:54 AM, Ben Greear wrote:

It does not appear to work on Fedora-26, and I'm curious if someone knows what 
needs
doing to get this support working?

Thanks,
Ben




Gah, I spoke too soon.  system-test guy says it works on cmd-line, but
not when we try to make it work in another way...could be local bug,
I'll poke at this more.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Can libpcap filter on vlan tags when vlans are hardware-accelerated?

2017-09-12 Thread Ben Greear

It does not appear to work on Fedora-26, and I'm curious if someone knows what 
needs
doing to get this support working?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] Fix build on fedora-14 (and other older systems)

2017-09-03 Thread Ben Greear



On 09/03/2017 08:50 AM, Stephen Hemminger wrote:

On Sat,  2 Sep 2017 07:15:02 -0700
gree...@candelatech.com wrote:


diff --git a/include/linux/sysinfo.h b/include/linux/sysinfo.h
index 934335a..3596b02 100644
--- a/include/linux/sysinfo.h
+++ b/include/linux/sysinfo.h
@@ -3,6 +3,14 @@

 #include 

+/* So we can compile on older OSs, hopefully this is correct. --Ben */
+#ifndef __kernel_long_t
+typedef long __kernel_long_t;
+#endif
+#ifndef __kernel_ulong_t
+typedef unsigned long __kernel_ulong_t;
+#endif
+
 #define SI_LOAD_SHIFT  16
 struct sysinfo {
__kernel_long_t uptime; /* Seconds since boot */


I am not accepting this patch because all files in include/linux are 
automatically
regenerated from kernel 'make install_headers'. No exceptions. If you want to 
change
a header in include/linux it has to go through upstream kernel inclusion.


It would be wrong to add this to the actual kernel header I think.

Do you have another suggestion for fixing iproute2 compile?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: Problem compiling iproute2 on older systems

2017-09-02 Thread Ben Greear



On 09/02/2017 12:55 AM, Michal Kubecek wrote:

On Fri, Sep 01, 2017 at 04:52:20PM -0700, Ben Greear wrote:

In the patch below, usage of __kernel_ulong_t and __kernel_long_t is
introduced, but that is not available on older system (fedora-14, at least).

It is not a #define, so I am having trouble finding a quick hack
around this.

Any ideas on how to make this work better on older OSs running
modern kernels?


Author: Stephen Hemminger <step...@networkplumber.org>  2017-01-12 17:54:39
Committer: Stephen Hemminger <step...@networkplumber.org>  2017-01-12 17:54:39
Child:  c7ec7697e3f000359aa317394e6dd972e35c1f84 (Fix build on fedora-14 (and 
other older systems))
Branches: master, remotes/origin/master
Follows: v3.10.0
Precedes:

add more uapi header files

In order to ensure no backward/forward compatiablity problems,
make sure that all kernel headers used come from the local copy.

Signed-off-by: Stephen Hemminger <step...@networkplumber.org>

--- include/linux/sysinfo.h ---
new file mode 100644
index 000..934335a
@@ -0,0 +1,24 @@
+#ifndef _LINUX_SYSINFO_H
+#define _LINUX_SYSINFO_H
+
+#include 
+
+#define SI_LOAD_SHIFT  16
+struct sysinfo {
+   __kernel_long_t uptime; /* Seconds since boot */
+   __kernel_ulong_t loads[3];  /* 1, 5, and 15 minute load averages */
+   __kernel_ulong_t totalram;  /* Total usable main memory size */
+   __kernel_ulong_t freeram;   /* Available memory size */
+   __kernel_ulong_t sharedram; /* Amount of shared memory */
+   __kernel_ulong_t bufferram; /* Memory used by buffers */
+   __kernel_ulong_t totalswap; /* Total swap space size */
+   __kernel_ulong_t freeswap;  /* swap space still available */
+   __u16 procs;/* Number of current processes */
+   __u16 pad;  /* Explicit padding for m68k */
+   __kernel_ulong_t totalhigh; /* Total high memory size */
+   __kernel_ulong_t freehigh;  /* Available high memory size */
+   __u32 mem_unit; /* Memory unit size in bytes */
+   char _f[20-2*sizeof(__kernel_ulong_t)-sizeof(__u32)];   /* Padding: 
libc5 uses this.. */
+};
+
+#endif /* _LINUX_SYSINFO_H */


I've been already thinking about this a bit. Normally, we would simply
add the file where __kernel_long_t and __kernel_ulong_t are defined.
The problem is this is  which depends on
architecture - which is the point of these types.

Good thing is iproute2 doesn't actually use struct sysinfo anywhere so
we don't need to have them defined correctly. One possible workaround
would therefore be defining them as long and unsigned long. As long as
we don't use the types anywhere, we would be fine.

Another option would be to replace include/linux/sysinfo.h with an empty
file. The problem I can see with this is that if someone uses a script
to refresh all copies of uapi headers automatically, the script would
have to be aware that it must not update this file and preserve the fake
empty one.


I just sent a patch that appears to compile on all of my build systems, which 
are
generally fedora-14 to fedora-24 currently.

I haven't actually tested functionality yet, but if you say it is unused, then
it is very likely to be OK, and even if not, I think it will be fine unless
someone is trying to cross-compile.  And in that case, probably more than one
issue involved...

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Problem compiling iproute2 on older systems

2017-09-01 Thread Ben Greear

In the patch below, usage of __kernel_ulong_t and __kernel_long_t is
introduced, but that is not available on older system (fedora-14, at least).

It is not a #define, so I am having trouble finding a quick hack
around this.

Any ideas on how to make this work better on older OSs running
modern kernels?


Author: Stephen Hemminger <step...@networkplumber.org>  2017-01-12 17:54:39
Committer: Stephen Hemminger <step...@networkplumber.org>  2017-01-12 17:54:39
Child:  c7ec7697e3f000359aa317394e6dd972e35c1f84 (Fix build on fedora-14 (and 
other older systems))
Branches: master, remotes/origin/master
Follows: v3.10.0
Precedes:

add more uapi header files

In order to ensure no backward/forward compatiablity problems,
make sure that all kernel headers used come from the local copy.

Signed-off-by: Stephen Hemminger <step...@networkplumber.org>

--- include/linux/sysinfo.h ---
new file mode 100644
index 000..934335a
@@ -0,0 +1,24 @@
+#ifndef _LINUX_SYSINFO_H
+#define _LINUX_SYSINFO_H
+
+#include 
+
+#define SI_LOAD_SHIFT  16
+struct sysinfo {
+   __kernel_long_t uptime; /* Seconds since boot */
+   __kernel_ulong_t loads[3];  /* 1, 5, and 15 minute load averages */
+   __kernel_ulong_t totalram;  /* Total usable main memory size */
+   __kernel_ulong_t freeram;   /* Available memory size */
+   __kernel_ulong_t sharedram; /* Amount of shared memory */
+   __kernel_ulong_t bufferram; /* Memory used by buffers */
+   __kernel_ulong_t totalswap; /* Total swap space size */
+   __kernel_ulong_t freeswap;  /* swap space still available */
+   __u16 procs;/* Number of current processes */
+   __u16 pad;  /* Explicit padding for m68k */
+   __kernel_ulong_t totalhigh; /* Total high memory size */
+   __kernel_ulong_t freehigh;  /* Available high memory size */
+   __u32 mem_unit; /* Memory unit size in bytes */
+   char _f[20-2*sizeof(__kernel_ulong_t)-sizeof(__u32)];   /* Padding: 
libc5 uses this.. */
+};
+
+#endif /* _LINUX_SYSINFO_H */


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers

2017-08-16 Thread Ben Greear

On 08/16/2017 08:18 PM, Dan Williams wrote:

On Wed, 2017-08-16 at 19:36 -0700, Ben Greear wrote:

On 08/16/2017 07:11 PM, Dan Williams wrote:

On Wed, 2017-08-16 at 14:31 -0700, David Miller wrote:

From: Dan Williams <d...@redhat.com>
Date: Wed, 16 Aug 2017 16:22:41 -0500


My biggest suggestion is that perhaps bonding should grow


hysteresis

for link speeds. Since WiFi can change speed every packet, you


probably

don't want the bond characteristics changing every couple
seconds


just

in case your WiFi link is jumping around.  Ethernet won't
bounce


around

that much, so the hysteresis would have no effect there.  Or,
if


people

are concerned about response time to speed changes on ethernet


(where

you probably do want an instant switch-over) some new flag to


indicate

that certain devices don't have stable speeds over time.


Or just report the average of the range the wireless link can
hit,
and
be done with it.

I think you guys are overcomplicating things.


That range can be from 1 to > 800Mb/s.  No, it won't usually be all
over that range, but it won't be uncommon to fluctuate by hundreds
of
Mb/s.  I'm not sure a simple average is really the answer
here.  Even
doing that would require new knobs to ethtool, since the rate
depends
heavily on card capabilities and also what AP you're connected to
*at
that moment*.  If you roam to another AP, then the max speed can
certainly change.

You'll probably say "aim for the 75% case" or something like that,
which is fine, but then you're depending on your 75% case to be (a)
single AP, (b) never move (eg, only bond wifi + ethernet), (c)
little
radio interference.  I'm not sure I'd buy that.  If I've put words
in
your mouth, forgive me.


If you keep ethtool API simple and just return the last (rx-rate +
tx-rate) / 2, or the rate averaged
over the last 100 frames or 10 seconds, then the caller can do longer
term averaging
as it sees fit.  Probably no need for lots of averaging complexity in
the kernel.


Yeah, that works too, but I was thinking it was better to present the
actual data through ethtool so that things other than bonding could use
it, and since bonding is the thing that actually cares about the
fluctuation, make it do the more extensive processing.


What do you mean by 'actual data'?  If you want to know the most accurate
transmit/rx rate info, then you need to pay attention to each and every frame's 
tx/rx rate, as
well as it's ampdu/amsdu, retries, etc.  It is virtually impossible.

So, you will have to settle for something less...  I suggest something simple
to calculate, similar to existing values that are available via debugfs and/or
'iw dev foo station dump', etc.  Let higher layers manipulate the raw data
as they see fit (they can query ethtool as often as they like).

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers

2017-08-16 Thread Ben Greear

On 08/16/2017 07:11 PM, Dan Williams wrote:

On Wed, 2017-08-16 at 14:31 -0700, David Miller wrote:

From: Dan Williams <d...@redhat.com>
Date: Wed, 16 Aug 2017 16:22:41 -0500


My biggest suggestion is that perhaps bonding should grow

hysteresis

for link speeds. Since WiFi can change speed every packet, you

probably

don't want the bond characteristics changing every couple seconds

just

in case your WiFi link is jumping around.  Ethernet won't bounce

around

that much, so the hysteresis would have no effect there.  Or, if

people

are concerned about response time to speed changes on ethernet

(where

you probably do want an instant switch-over) some new flag to

indicate

that certain devices don't have stable speeds over time.


Or just report the average of the range the wireless link can hit,
and
be done with it.

I think you guys are overcomplicating things.


That range can be from 1 to > 800Mb/s.  No, it won't usually be all
over that range, but it won't be uncommon to fluctuate by hundreds of
Mb/s.  I'm not sure a simple average is really the answer here.  Even
doing that would require new knobs to ethtool, since the rate depends
heavily on card capabilities and also what AP you're connected to *at
that moment*.  If you roam to another AP, then the max speed can
certainly change.

You'll probably say "aim for the 75% case" or something like that,
which is fine, but then you're depending on your 75% case to be (a)
single AP, (b) never move (eg, only bond wifi + ethernet), (c) little
radio interference.  I'm not sure I'd buy that.  If I've put words in
your mouth, forgive me.


If you keep ethtool API simple and just return the last (rx-rate + tx-rate) / 
2, or the rate averaged
over the last 100 frames or 10 seconds, then the caller can do longer term 
averaging
as it sees fit.  Probably no need for lots of averaging complexity in the 
kernel.

rate-ctrl for wifi basically doesn't happen until you transmit or receive a
fairly steady stream, so it will fluctuate a lot.

Thanks,
Ben



Dan




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-20 Thread Ben Greear

On 06/20/2017 11:05 AM, Michal Kubecek wrote:

On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:

On 06/14/2017 03:25 PM, David Ahern wrote:

On 6/14/17 4:23 PM, Ben Greear wrote:

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no longer
reproduce the problem.



ok, let's figure out which one is doing that. There are 3 debug
statements. I suspect fib6_del_route is the one setting the state to
FWS_U. Can you remove the debug prints in fib6_repair_tree and
fib6_walk_continue and try again?


We cannot reproduce with just that one printf in the kernel either.  It
must change the timing too much to trigger the bug.


You might try trace_printk() which should have less impact (don't forget
to enable /proc/sys/kernel/ftrace_dump_on_oops).


We cannot reproduce with trace_printk() either.

Thanks,
Ben



Michal Kubecek




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-20 Thread Ben Greear



On 06/14/2017 03:25 PM, David Ahern wrote:

On 6/14/17 4:23 PM, Ben Greear wrote:

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no longer
reproduce the problem.



ok, let's figure out which one is doing that. There are 3 debug
statements. I suspect fib6_del_route is the one setting the state to
FWS_U. Can you remove the debug prints in fib6_repair_tree and
fib6_walk_continue and try again?


We cannot reproduce with just that one printf in the kernel either.  It
must change the timing too much to trigger the bug.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-14 Thread Ben Greear

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no longer
reproduce the problem.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-13 Thread Ben Greear

On 06/13/2017 01:28 PM, David Ahern wrote:

On 6/13/17 2:16 PM, Ben Greear wrote:

On 06/09/2017 02:25 PM, Eric Dumazet wrote:

On Fri, 2017-06-09 at 07:27 -0600, David Ahern wrote:

On 6/8/17 11:55 PM, Cong Wang wrote:

On Thu, Jun 8, 2017 at 2:27 PM, Ben Greear <gree...@candelatech.com>
wrote:


As far as I can tell, the patch did not help, or at least we still
reproduce
the
crash easily.


netlink dump is serialized by nlk->cb_mutex so I don't think that
patch makes any sense w.r.t race condition.


From what I can see fn_sernum should be accessed under table lock, so
when saving and checking it during a walk make sure it the lock is held.
That has nothing to do with the netlink dump, but the table changing
during a walk.



Yes, your patch makes total sense, of course.


I guess someone should go ahead and make an official patch and
submit it, even if it doesn't fix my problem.


I can do that; was hoping to root cause the problem first.





(gdb) l *(fib6_walk_continue+0x76)
0x188c6 is in fib6_walk_continue
(/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593).
1588if (fn == w->root)
1589return 0;
1590pn = fn->parent;
1591w->node = pn;
1592#ifdef CONFIG_IPV6_SUBTREES
1593if (FIB6_SUBTREE(pn) == fn) {


Apparently fn->parent is NULL here for some reason, but
I don't know if that is expected or not. If a simple NULL check
is not enough here, we have to trace why it is NULL.


From my understanding, parent should not be null hence the attempts to
fix access to table nodes under a lock. ie., figuring out why it is null
here.


If someone has more suggestions, I'll be happy to test.


I have looked at the code again and nothing is jumping out. Will look
again later today.



I noticed there is some code to help fix up the walkers when nodes are deleted. 
 They
use lock:   read_lock(>ipv6.fib6_walker_lock);

The code you were tweaking uses a different lock:  
read_lock_bh(>tb6_lock);

In is certainly not simple code, so I don't know if that is correct or not, but
might possibly be a place to start looking.

I'm going to re-test with a WARN_ON to see if that triggers since previous 
suggestion
was that f->parent was NULL.


diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 51cd637..86295df 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1571,6 +1571,10 @@ static int fib6_walk_continue(struct fib6_walker *w)
case FWS_U:
if (fn == w->root)
return 0;
+   if (!fn->parent) {
+   WARN_ON_ONCE(0);
+   return 0;
+   }
pn = fn->parent;
w->node = pn;
 #ifdef CONFIG_IPV6_SUBTREES


Thanks,
Ben

Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-13 Thread Ben Greear

On 06/09/2017 02:25 PM, Eric Dumazet wrote:

On Fri, 2017-06-09 at 07:27 -0600, David Ahern wrote:

On 6/8/17 11:55 PM, Cong Wang wrote:

On Thu, Jun 8, 2017 at 2:27 PM, Ben Greear <gree...@candelatech.com> wrote:


As far as I can tell, the patch did not help, or at least we still reproduce
the
crash easily.


netlink dump is serialized by nlk->cb_mutex so I don't think that
patch makes any sense w.r.t race condition.


From what I can see fn_sernum should be accessed under table lock, so
when saving and checking it during a walk make sure it the lock is held.
That has nothing to do with the netlink dump, but the table changing
during a walk.



Yes, your patch makes total sense, of course.


I guess someone should go ahead and make an official patch and
submit it, even if it doesn't fix my problem.


(gdb) l *(fib6_walk_continue+0x76)
0x188c6 is in fib6_walk_continue
(/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593).
1588if (fn == w->root)
1589return 0;
1590pn = fn->parent;
1591w->node = pn;
1592#ifdef CONFIG_IPV6_SUBTREES
1593if (FIB6_SUBTREE(pn) == fn) {


Apparently fn->parent is NULL here for some reason, but
I don't know if that is expected or not. If a simple NULL check
is not enough here, we have to trace why it is NULL.


From my understanding, parent should not be null hence the attempts to
fix access to table nodes under a lock. ie., figuring out why it is null
here.


If someone has more suggestions, I'll be happy to test.

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-08 Thread Ben Greear
fib+0x1ab)
0x1939b is in inet6_dump_fib 
(/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:392).
387 w->skip = w->count;
388 } else
389 w->skip = 0;
390 
391 res = fib6_walk_continue(w);
392 read_unlock_bh(>tb6_lock);
393 if (res <= 0) {
394 fib6_walker_unlink(net, w);
395 cb->args[4] = 0;
396 }
(gdb)

[greearb@ben-dt3 linux-2.6]$ git diff
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index d4bf2c6..4e32a16 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -372,12 +372,13 @@ static int fib6_dump_table(struct fib6_table *table, 
struct sk_buff *skb,

read_lock_bh(>tb6_lock);
res = fib6_walk(net, w);
-   read_unlock_bh(>tb6_lock);
if (res > 0) {
cb->args[4] = 1;
cb->args[5] = w->root->fn_sernum;
}
+   read_unlock_bh(>tb6_lock);
} else {
+   read_lock_bh(>tb6_lock);
if (cb->args[5] != w->root->fn_sernum) {
/* Begin at the root if the tree changed */
cb->args[5] = w->root->fn_sernum;
@@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table, struct 
sk_buff *skb,
} else
w->skip = 0;

-   read_lock_bh(>tb6_lock);
    res = fib6_walk_continue(w);
read_unlock_bh(>tb6_lock);
if (res <= 0) {


Thanks,
Ben



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-06 Thread Ben Greear

On 06/06/2017 05:27 PM, Eric Dumazet wrote:

On Tue, 2017-06-06 at 18:00 -0600, David Ahern wrote:

On 6/6/17 3:06 PM, Ben Greear wrote:

This bug has been around forever, and we recently got an intern and
stuck him with
trying to reproduce it on the latest kernel.  It is still here.  I'm not
super excited
about trying to fix this, but we can easily test patches if someone has a
patch to try.


Can you try this (whitespace damaged on paste, but it is moving the lock
ahead of the fn_sernum check):

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index deea901746c8..7a44c49055c0 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -378,6 +378,7 @@ static int fib6_dump_table(struct fib6_table *table,
struct sk_buff *skb,
cb->args[5] = w->root->fn_sernum;
}
} else {
+   read_lock_bh(>tb6_lock);
if (cb->args[5] != w->root->fn_sernum) {
/* Begin at the root if the tree changed */
cb->args[5] = w->root->fn_sernum;
@@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table,
struct sk_buff *skb,
} else
w->skip = 0;

-   read_lock_bh(>tb6_lock);
res = fib6_walk_continue(w);
read_unlock_bh(>tb6_lock);
if (res <= 0) {



Good catch, but it looks like similar fix is needed a few lines before.


We will test this tomorrow.

Thanks,
Ben




diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 
deea901746c8570c5e801e40592c91e3b62812e0..b214443dc8346cef3690df7f27cc48a864028865
 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -372,12 +372,13 @@ static int fib6_dump_table(struct fib6_table *table, 
struct sk_buff *skb,

read_lock_bh(>tb6_lock);
res = fib6_walk(net, w);
-   read_unlock_bh(>tb6_lock);
if (res > 0) {
cb->args[4] = 1;
cb->args[5] = w->root->fn_sernum;
}
+   read_unlock_bh(>tb6_lock);
} else {
+   read_lock_bh(>tb6_lock);
if (cb->args[5] != w->root->fn_sernum) {
/* Begin at the root if the tree changed */
cb->args[5] = w->root->fn_sernum;
@@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table, struct 
sk_buff *skb,
} else
w->skip = 0;

-   read_lock_bh(>tb6_lock);
res = fib6_walk_continue(w);
read_unlock_bh(>tb6_lock);
if (res <= 0) {





--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-06 Thread Ben Greear

Hello,

This bug has been around forever, and we recently got an intern and stuck him 
with
trying to reproduce it on the latest kernel.  It is still here.  I'm not super 
excited
about trying to fix this, but we can easily test patches if someone has a
patch to try.

Test case is to create 1000 mac-vlans and bring them up, with user-space
tools running lots of 'dump' related commands as part of bringing up the
interfaces and configuring some special source-based routing tables.

(gdb) l *(inet6_dump_fib+0x109)
0x192f9 is in inet6_dump_fib 
(/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:392).
387 } else
388 w->skip = 0;
389 
390 read_lock_bh(>tb6_lock);
391 res = fib6_walk_continue(w);
392 read_unlock_bh(>tb6_lock);
393 if (res <= 0) {
394 fib6_walker_unlink(net, w);
395 cb->args[4] = 0;
396 }

(gdb) l *(fib6_walk_continue+0x76)
0x188c6 is in fib6_walk_continue 
(/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593).
1588if (fn == w->root)
1589return 0;
1590pn = fn->parent;
1591w->node = pn;
1592#ifdef CONFIG_IPV6_SUBTREES
1593if (FIB6_SUBTREE(pn) == fn) {
1594WARN_ON(!(fn->fn_flags & RTN_ROOT));
1595w->state = FWS_L;
1596continue;
1597}

[root@ct524-ffb0 ~]# BUG: unable to handle kernel NULL pointer dereference at 
0018
IP: fib6_walk_continue+0x76/0x180 [ipv6]
PGD 3d9226067
P4D 3d9226067
PUD 3d9020067
PMD 0

Oops:  [#1] PREEMPT SMP
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c bnep fuse macvlan pktgen cfg80211 ipmi_ssif iTCO_wdt iTCO_vendor_support 
coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass joydev i2c_i801 ie31200_edac intel_pch_thermal shpchp hci_uart ipmi_si btbcm 
btqca ipmi_devintf btintel ipmi_msghandler bluetooth pinctrl_sunrisepoint acpi_als pinctrl_intel video tpm_tis intel_lpss_acpi kfifo_buf tpm_tis_core intel_lpss 
industrialio tpm acpi_pad acpi_power_meter sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca 
i2c_algo_bit i2c_hid i2c_core ipv6 crc_ccitt [last unloaded: nf_conntrack]

CPU: 1 PID: 996 Comm: ip Not tainted 4.12.0-rc4+ #32
Hardware name: Supermicro Super Server/X11SSM-F, BIOS 1.0b 12/29/2015
task: 8803d4d61dc0 task.stack: c9000970c000
RIP: 0010:fib6_walk_continue+0x76/0x180 [ipv6]
RSP: 0018:c9000970fbb8 EFLAGS: 00010283
RAX: 8803de84b020 RBX: 8803e0756f00 RCX: 
RDX:  RSI: c9000970fc00 RDI: 81eee280
RBP: c9000970fbc0 R08: 0008 R09: 8803d4fbbf31
R10: c9000970fb68 R11:  R12: 0001
R13: 0001 R14: 8803e0756f00 R15: 8803d9345b18
FS:  7f32ca4ec700() GS:88047784() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0018 CR3: 0003ddacc000 CR4: 003406e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 inet6_dump_fib+0x109/0x290 [ipv6]
 netlink_dump+0x11d/0x290
 netlink_recvmsg+0x260/0x3f0
 sock_recvmsg+0x38/0x40
 ___sys_recvmsg+0xe9/0x230
 ? alloc_pages_vma+0x9d/0x260
 ? page_add_new_anon_rmap+0x88/0xc0
 ? lru_cache_add_active_or_unevictable+0x31/0xb0
 ? __handle_mm_fault+0xce3/0xf70
 __sys_recvmsg+0x3d/0x70
 ? __sys_recvmsg+0x3d/0x70
 SyS_recvmsg+0xd/0x20
 do_syscall_64+0x56/0xc0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f32c9e21050
RSP: 002b:7fff96401de8 EFLAGS: 0246 ORIG_RAX: 002f
RAX: ffda RBX:  RCX: 7f32c9e21050
RDX:  RSI: 7fff96401e50 RDI: 0004
RBP: 7fff96405e74 R08: 3fe4 R09: 
R10: 7fff96401e90 R11: 0246 R12: 0064f3a0
R13: 7fff96405ee0 R14: 3fe4 R15: 
Code: f6 40 2a 04 74 11 8b 53 30 85 d2 0f 84 02 01 00 00 83 ea 01 89 53 30 c7 43 28 04 00 00 00 48 39 43 10 74 33 48 8b 10 48 89 53 18 <48> 39 42 18 0f 84 a3 00 
00 00 48 39 42 08 0f 84 ae 00 00 00 48

RIP: fib6_walk_continue+0x76/0x180 [ipv6] RSP: c9000970fbb8
CR2: 0018
---[ end trace 5ebbc4ee97bea64e ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 10 seconds..
ACPI MEMORY or I/O RESET_REG.


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: 'iw events' stops receiving events after a while on 4.9 + hacks

2017-05-31 Thread Ben Greear



On 05/31/2017 01:18 AM, Bastian Bittorf wrote:

* Johannes Berg <johan...@sipsolutions.net> [31.05.2017 10:09]:

Is there any way to dump out the socket information if we reproduce
the problem?


I have no idea, sorry.

If you or Bastian can tell me how to reproduce the problem, I can try
to investigate it.


there was an interesting fix regarding the shell-builtin 'read' in
busybox[1]. I will retest again and report if this changes anything.

bye, bastian

PS: @ben: are you also using 'iw event | while read -r LINE ...'?


I'm using a perl script to read the output, and not using busybox.

I have not seen the problem again, so it is not easy for me to reproduce.

If you reproduce it, maybe check 'strace' on the 'iw' process to see if it is
hung on writing output to the pipe or reading input?  In my case, it appeared
to be hung reading input from netlink, input that never arrived.

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: 'iw events' stops receiving events after a while on 4.9 + hacks

2017-05-17 Thread Ben Greear

On 05/17/2017 06:30 AM, Johannes Berg wrote:

On Wed, 2017-05-17 at 12:08 +0200, Bastian Bittorf wrote:

* Ben Greear <gree...@candelatech.com> [17.05.2017 11:51]:

I have been keeping an 'iw events' program running with a perl
script gathering its
output and post-processing it.  This has been working for several
years on 4.7 and earlier
kernels, but when testing on 4.9 overnight, I notice that 'iw
events' is not showing any input.  'strace' shows
that it is waiting on recvmsg.  If I start a second 'iw events'
then it will get
wifi events as expected.


me too, also seen on 4.4 - i'am happy for debug ideas.


I've never seen this.

Does it happen when it's very long-running? Or when there are lots of
events?

Perhaps something in the socket buffer accounting is going wrong, so
that it's slowly decreasing to 0?


I saw it exactly once so far, and it happened overnight,
but we have not been doing a lot of work with the 4.9 kernel until recently.

I don't think there were many messages on this system, and certainly
others have run much longer on systems that should be generating many more
events without trouble.

Is there any way to dump out the socket information if we reproduce
the problem?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



'iw events' stops receiving events after a while on 4.9 + hacks

2017-05-16 Thread Ben Greear

I have been keeping an 'iw events' program running with a perl script gathering 
its
output and post-processing it.  This has been working for several years on 4.7 
and earlier
kernels, but when testing on 4.9 overnight, I notice that 'iw events' is not 
showing any input.  'strace' shows
that it is waiting on recvmsg.  If I start a second 'iw events' then it will get
wifi events as expected.

Are there any known issues in this area?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: How to debug DMAR errors?

2017-04-14 Thread Ben Greear



On 04/14/2017 09:24 AM, Alexander Duyck wrote:

On Fri, Apr 14, 2017 at 9:19 AM, Ben Greear <gree...@candelatech.com> wrote:



On 04/14/2017 08:45 AM, Alexander Duyck wrote:


On Thu, Apr 13, 2017 at 11:12 AM, Ben Greear <gree...@candelatech.com>
wrote:


Hello,

I have been seeing a regular occurrence of DMAR errors, looking something
like this when testing my ath10k driver/firmware under some specific
loads
(maximum receive of 512 byte frames in AP mode):

DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [05:00.0] fault addr fd99f000 [fault
reason
06] PTE Read access is not set
ath10k_pci :05:00.0: firmware crashed! (uuid
594b1393-ae35-42b5-9dec-74ff0c6791ff)

So, I am wondering if there is any way I can get more information about
what
this fd99f000 address
is?

Once this problem hits, the entire OS locks hard (not even sysrq-boot
will
do anything),
so I guess I would need the DMAR logic to print out more info on that
address somehow.

Thanks,
Ben



There isn't much more info to give you. The problem is that the device
at 5:00.0 attempted to read at fd99f000 even though it didn't have
permissions. In response this should trigger a PCI Master Abort
message to that function. It looks like the firmware for the device
doesn't handle that and so that is likely why things got hung.

Really you would need to interrogate the ath10k_pci to see if there
is/was a mapping somewhere for that address and what it was supposed
to be used for.



I'm working on a hook in DMAR logic to call into ath10k_pci when the
error is seen, so the ath10k can dump debug info, including recent DMA
addresses.

My code is an awful hack so far, but if someone could add a clean way to
register
DMAR error callbacks, I think that would be very welcome.  It might could
tie into
automated dma map/unmap debugging logic, and at the least, someone could
write custom debugging callbacks
for the driver(s) in question.

Thanks,
Ben



You might look at coding up something to add pci_error_handlers for
the pci_driver in the ath10k_pci driver. The PCI Master Abort should
trigger an error that you could then capture in the driver and handle
at least dumping it via your own implementation of the error handlers.
If nothing else I suspect there are probably some sort of descriptor
rings you could probably dump. I'm suspecting this is some sort of Tx
issue since the problem was a read fault, but I suppose there are
other paths in the driver that might trigger DMA read requests.


This is a thick firmware driver, so the firmware could also be screwing up
and accessing something it should not.  There are some existing work-arounds
in it to deal with sketchy behaviour already, maybe more are needed.

Anyway, once I added the debugging code, I didn't see it crash again, so
might be a while before I know more.

Thanks,
Ben



- Alex



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: How to debug DMAR errors?

2017-04-14 Thread Ben Greear



On 04/14/2017 08:45 AM, Alexander Duyck wrote:

On Thu, Apr 13, 2017 at 11:12 AM, Ben Greear <gree...@candelatech.com> wrote:

Hello,

I have been seeing a regular occurrence of DMAR errors, looking something
like this when testing my ath10k driver/firmware under some specific loads
(maximum receive of 512 byte frames in AP mode):

DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [05:00.0] fault addr fd99f000 [fault reason
06] PTE Read access is not set
ath10k_pci :05:00.0: firmware crashed! (uuid
594b1393-ae35-42b5-9dec-74ff0c6791ff)

So, I am wondering if there is any way I can get more information about what
this fd99f000 address
is?

Once this problem hits, the entire OS locks hard (not even sysrq-boot will
do anything),
so I guess I would need the DMAR logic to print out more info on that
address somehow.

Thanks,
Ben


There isn't much more info to give you. The problem is that the device
at 5:00.0 attempted to read at fd99f000 even though it didn't have
permissions. In response this should trigger a PCI Master Abort
message to that function. It looks like the firmware for the device
doesn't handle that and so that is likely why things got hung.

Really you would need to interrogate the ath10k_pci to see if there
is/was a mapping somewhere for that address and what it was supposed
to be used for.


I'm working on a hook in DMAR logic to call into ath10k_pci when the
error is seen, so the ath10k can dump debug info, including recent DMA
addresses.

My code is an awful hack so far, but if someone could add a clean way to 
register
DMAR error callbacks, I think that would be very welcome.  It might could tie 
into
automated dma map/unmap debugging logic, and at the least, someone could write 
custom debugging callbacks
for the driver(s) in question.

Thanks,
Ben



- Alex



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


How to debug DMAR errors?

2017-04-13 Thread Ben Greear

Hello,

I have been seeing a regular occurrence of DMAR errors, looking something
like this when testing my ath10k driver/firmware under some specific loads
(maximum receive of 512 byte frames in AP mode):

DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [05:00.0] fault addr fd99f000 [fault reason 06] 
PTE Read access is not set
ath10k_pci :05:00.0: firmware crashed! (uuid 
594b1393-ae35-42b5-9dec-74ff0c6791ff)

So, I am wondering if there is any way I can get more information about what 
this fd99f000 address
is?

Once this problem hits, the entire OS locks hard (not even sysrq-boot will do 
anything),
so I guess I would need the DMAR logic to print out more info on that address 
somehow.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Horrid balance-rr bonding udp throughput

2017-04-10 Thread Ben Greear

On 04/10/2017 11:50 AM, Jarod Wilson wrote:

On 2017-04-08 7:33 PM, Jarod Wilson wrote:

I'm digging into some bug reports covering performance issues with balance-rr, 
and discovered something even worse than the reporter. My test setup has a pair
of NICs, one e1000e, one e1000 (but dual e1000e seems the same). When I do a 
test run in LNST with bonding mode balance-rr and either miimon or arpmon, the
throughput of the UDP_STREAM netperf test is absolutely horrible:

TCP: 941.19 +-0.88 mbits/sec
UDP: 45.42 +-4.59 mbits/sec

I figured I'd try LNST's packet capture mode, so exact same test, add the -p 
flag and I get:

TCP: 941.21 +-0.82 mbits/sec
UDP: 961.54 +-0.01 mbits/sec

Uh. What? So yeah. I can't capture the traffic in the bad case, but I guess 
that gives some potential insight into what's not happening correctly in either
the bonding driver or the NIC drivers... More digging forthcoming, but first I 
have a flooded basement to deal with, so if in the interim, anyone has some
insight, I'd be happy to hear it. :)


Okay, ignore the bit about bonding, I should have eliminated the bond from the 
picture entirely. I think the traffic simply ended up on the e1000 on the
non-capture test and on the e1000e for the capture test, as those numbers match 
perfectly with straight NIC to NIC testing, no bond involved. That said, really
odd that the e1000 is so severely crippled for UDP, while TCP is still 
respectable. Not sure if I have a flaky NIC or what...

For reference, e1000 to e1000e netperf:

TCP_STREAM: Measured rate was 849.95 +-1.32 mbits/sec
UDP_STREAM: Measured rate was 44.73 +-5.73 mbits/sec


Maybe check that you have re-ordering issues?  I ran into that with igb
recently and it took a while to realize my problem!

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC 2/3] genetlink: pass extended error report down

2017-04-07 Thread Ben Greear

On 04/07/2017 12:12 PM, Johannes Berg wrote:

On Fri, 2017-04-07 at 11:37 -0700, Ben Greear wrote:


I guess the error string must be constant and always available in
memory in this implementation?


Yes.


I think it would be nice to dynamically create strings (malloc,
snprintf, etc) and have the err_str logic free it when done?


We can think about that later, but I don't actually think it makes a
lot of sense - if we point to the attribute and/or offset you really
ought to have enough info to figure out what's up.


We can think about it later, but lots of things in the wifi stack
could use a descriptive message specific to the failure.  Often these
messages are much more useful if you explain why the failure conflicts
with regulatory, channel, virtual-dev combination, etc info, so that needs
to be dynamic.  The code that is failing knows, so I'd like to pass it
back to user-space.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC 2/3] genetlink: pass extended error report down

2017-04-07 Thread Ben Greear

On 04/07/2017 11:26 AM, Johannes Berg wrote:

From: Johannes Berg <johannes.b...@intel.com>

Signed-off-by: Johannes Berg <johannes.b...@intel.com>
---
 include/net/genetlink.h | 27 +++
 net/netlink/genetlink.c |  6 --
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/net/genetlink.h b/include/net/genetlink.h
index a34275be3600..67ad2326cfa6 100644
--- a/include/net/genetlink.h
+++ b/include/net/genetlink.h
@@ -84,6 +84,7 @@ struct nlattr **genl_family_attrbuf(const struct genl_family 
*family);
  * @attrs: netlink attributes
  * @_net: network namespace
  * @user_ptr: user pointers
+ * @exterr: extended error report struct
  */
 struct genl_info {
u32 snd_seq;
@@ -94,6 +95,7 @@ struct genl_info {
struct nlattr **attrs;
possible_net_t  _net;
void *  user_ptr[2];
+   struct netlink_ext_err *exterr;
 };

 static inline struct net *genl_info_net(struct genl_info *info)
@@ -106,6 +108,31 @@ static inline void genl_info_net_set(struct genl_info 
*info, struct net *net)
write_pnet(>_net, net);
 }

+static inline int genl_err_str(struct genl_info *info, int err,
+  const char *msg)
+{
+   info->exterr->msg = msg;
+
+   return err;
+}


I guess the error string must be constant and always available in memory
in this implementation?

I think it would be nice to dynamically create strings (malloc, snprintf, etc)
and have the err_str logic free it when done?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] igb: add module param to set max-rss-queues.

2017-03-24 Thread Ben Greear



On 03/24/2017 04:14 PM, Stephen Hemminger wrote:

On Fri, 24 Mar 2017 14:20:56 -0700
Ben Greear <gree...@candelatech.com> wrote:


On 03/24/2017 02:12 PM, David Miller wrote:

From: gree...@candelatech.com
Date: Fri, 24 Mar 2017 13:58:47 -0700


From: Ben Greear <gree...@candelatech.com>

In systems where you may have a very large number of network
adapters, certain drivers may consume an unfair amount of
IRQ resources.  So, allow a module param that will limit the
number of IRQs at driver load time.  This way, other drivers
(40G Ethernet, for instance), which probably will need the
multiple IRQs more, will not be starved of IRQ resources.

Signed-off-by: Ben Greear <gree...@candelatech.com>


Sorry, no module params.

Use generic run-time facilities such as ethtool to configure
such things.


You cannot call ethtool before module load time, and that is when
the IRQs are first acquired.  It may be way more useful to give each
of 20 network adapters 2 irqs than have the first few grab 16 and the rest
get lumped into legacy crap.


Almost all network devices do not acquire interrupts until device is brought up.
I.e request_irq is called from open not probe. This is done so that 
configuration
can be done and also so that unused ports don't consume interrupt space.


If I ever have to deal with this on stock kernels again I'll keep that in mind.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH] igb: add module param to set max-rss-queues.

2017-03-24 Thread Ben Greear

On 03/24/2017 02:12 PM, David Miller wrote:

From: gree...@candelatech.com
Date: Fri, 24 Mar 2017 13:58:47 -0700


From: Ben Greear <gree...@candelatech.com>

In systems where you may have a very large number of network
adapters, certain drivers may consume an unfair amount of
IRQ resources.  So, allow a module param that will limit the
number of IRQs at driver load time.  This way, other drivers
(40G Ethernet, for instance), which probably will need the
multiple IRQs more, will not be starved of IRQ resources.

Signed-off-by: Ben Greear <gree...@candelatech.com>


Sorry, no module params.

Use generic run-time facilities such as ethtool to configure
such things.


You cannot call ethtool before module load time, and that is when
the IRQs are first acquired.  It may be way more useful to give each
of 20 network adapters 2 irqs than have the first few grab 16 and the rest
get lumped into legacy crap.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Performance issue with igb with lots of different src-ip addrs.

2017-03-24 Thread Ben Greear

On 03/16/2017 08:51 PM, Ben Greear wrote:

I think we can, might take us a day or two to get time to do it.

Thanks,
Ben

On 03/16/2017 08:05 PM, Alexander Duyck wrote:

I'm not really interested in installing a custom version of pktgen.
Any chance you can recreate the issue with standard pktgen?

You might try running perf to get a snapshot of what is using CPU time
on the system.  It will probably give you a pretty good idea where the
code is that is eating up all your CPU time.


So, I had time to dig into this today.

Turns out that our tool was reporting drops because of sequence number
gaps, which in turn were caused by out-of-order frames...not actually
dropping frames.

If I force the rss_queues to one, the problem goes away.

Sorry for mis-reporting a bug.

Thanks,
Ben




- Alex

On Thu, Mar 16, 2017 at 7:46 PM, Ben Greear <gree...@candelatech.com> wrote:

I'm actually using a hacked up version of pktgen nicely driven by our
GUI tool, but the crux is that you need to set min and max src IP to some
large
range.

We are driving pktgen from a separate machine.  Stock pktgen isn't good at
reporting
received pkts last I checked, so it may be more difficult to easily view the
problem.

I'll be happy to set up my tool on your Fedora 24 or similar VM or machine
if you
want.

Thanks,
Ben


On 03/16/2017 07:35 PM, Alexander Duyck wrote:


Can you include the pktgen script you are running?

Also when you say you are driving traffic through the bridge are you
sending from something external on the system or are you actually
directing the traffic from pktgen into the bridge directly?

- Alex

On Thu, Mar 16, 2017 at 3:49 PM, Ben Greear <gree...@candelatech.com>
wrote:


Hello,

We notice that when using two igb ports as a bridge, if we use pktgen to
drive traffic through the bridge and randomize (or use a very large
range)
for the source IP addr in pktgen, then performance of igb is very poor
(like
150Mbps
throughput instead of 1Gbps).  It runs right at line speed if we use same
src/dest
IP addr in pktgen.  So, seems it is related to lots of src/dest IP
addresses.

We see same problem when using pktgen to send to itself, and we see this
in
several different kernels.  We specifically tested bridge mode in this
stock
Fedora kernel:

 Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC
2017
x86_64 x86_64 x86_64 GNU/Linux

e1000e does not show this problem in our testing.

Any ideas what the issue might be and how to fix it?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com





--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com







--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Performance issue with igb with lots of different src-ip addrs.

2017-03-16 Thread Ben Greear

I think we can, might take us a day or two to get time to do it.

Thanks,
Ben

On 03/16/2017 08:05 PM, Alexander Duyck wrote:

I'm not really interested in installing a custom version of pktgen.
Any chance you can recreate the issue with standard pktgen?

You might try running perf to get a snapshot of what is using CPU time
on the system.  It will probably give you a pretty good idea where the
code is that is eating up all your CPU time.

- Alex

On Thu, Mar 16, 2017 at 7:46 PM, Ben Greear <gree...@candelatech.com> wrote:

I'm actually using a hacked up version of pktgen nicely driven by our
GUI tool, but the crux is that you need to set min and max src IP to some
large
range.

We are driving pktgen from a separate machine.  Stock pktgen isn't good at
reporting
received pkts last I checked, so it may be more difficult to easily view the
problem.

I'll be happy to set up my tool on your Fedora 24 or similar VM or machine
if you
want.

Thanks,
Ben


On 03/16/2017 07:35 PM, Alexander Duyck wrote:


Can you include the pktgen script you are running?

Also when you say you are driving traffic through the bridge are you
sending from something external on the system or are you actually
directing the traffic from pktgen into the bridge directly?

- Alex

On Thu, Mar 16, 2017 at 3:49 PM, Ben Greear <gree...@candelatech.com>
wrote:


Hello,

We notice that when using two igb ports as a bridge, if we use pktgen to
drive traffic through the bridge and randomize (or use a very large
range)
for the source IP addr in pktgen, then performance of igb is very poor
(like
150Mbps
throughput instead of 1Gbps).  It runs right at line speed if we use same
src/dest
IP addr in pktgen.  So, seems it is related to lots of src/dest IP
addresses.

We see same problem when using pktgen to send to itself, and we see this
in
several different kernels.  We specifically tested bridge mode in this
stock
Fedora kernel:

 Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC
2017
x86_64 x86_64 x86_64 GNU/Linux

e1000e does not show this problem in our testing.

Any ideas what the issue might be and how to fix it?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com





--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: Performance issue with igb with lots of different src-ip addrs.

2017-03-16 Thread Ben Greear

I'm actually using a hacked up version of pktgen nicely driven by our
GUI tool, but the crux is that you need to set min and max src IP to some large
range.

We are driving pktgen from a separate machine.  Stock pktgen isn't good at 
reporting
received pkts last I checked, so it may be more difficult to easily view the 
problem.

I'll be happy to set up my tool on your Fedora 24 or similar VM or machine if 
you
want.

Thanks,
Ben

On 03/16/2017 07:35 PM, Alexander Duyck wrote:

Can you include the pktgen script you are running?

Also when you say you are driving traffic through the bridge are you
sending from something external on the system or are you actually
directing the traffic from pktgen into the bridge directly?

- Alex

On Thu, Mar 16, 2017 at 3:49 PM, Ben Greear <gree...@candelatech.com> wrote:

Hello,

We notice that when using two igb ports as a bridge, if we use pktgen to
drive traffic through the bridge and randomize (or use a very large range)
for the source IP addr in pktgen, then performance of igb is very poor (like
150Mbps
throughput instead of 1Gbps).  It runs right at line speed if we use same
src/dest
IP addr in pktgen.  So, seems it is related to lots of src/dest IP
addresses.

We see same problem when using pktgen to send to itself, and we see this in
several different kernels.  We specifically tested bridge mode in this stock
Fedora kernel:

 Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux

e1000e does not show this problem in our testing.

Any ideas what the issue might be and how to fix it?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com





--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: netdev level filtering? perhaps pushing socket filters down?

2017-03-16 Thread Ben Greear

On 03/16/2017 03:33 PM, Johannes Berg wrote:

Hi all,

Occasionally - we just had another case - people want to hook into
packets received and processed by the mac80211 stack, but because they
don't need all of them (e.g. not data packets), even adding a monitor
interface and bringing it up has too high a cost because SKBs need to
be prepared to send them to the monitor interface, even if no socket is
consuming them.

Ideally, we'd be able to detect that there are filter programs attached
to the socket(s) that are looking at the frames coming in on the
monitor interface, and we could somehow magically run those before we
create a new SKB.
One problem here is that we wouldn't really want to prepare all the
radiotap header just to throw it away, so we'd have to be able to
analyse the filter program to make sure it doesn't access anything but
the radiotap header length, and that only in order to jump over it.
That seems ... difficult, but we don't even know the header length -
although we could fudge that and make a very long constant-size header,
which might make it possible to do such analysis, or handle it by
trapping on such access. But it seems rather difficult to implement
this.

The next best thing would be to install a filter program on the virtual
monitor *interface* (netdev), but say that it doesn't get frames with
radiotap, but pure 802.11 frames. We already have those in SKB format
at this point, so it'd be simple to run such a program and only pass
the SKB to the monitor netdev's RX when the program asked to do that.

This now seems a bit like XDP, but for XDP this header difference
doesn't seem appropriate either.

Anyone have any other thoughts?


Attach at just above the driver, before it ever gets to stations/vdevs,
and ignore radiotap headers and/or add special processing for metadata like
rx-info?

Thanks,
Ben



Thanks,
johannes




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Performance issue with igb with lots of different src-ip addrs.

2017-03-16 Thread Ben Greear

Hello,

We notice that when using two igb ports as a bridge, if we use pktgen to
drive traffic through the bridge and randomize (or use a very large range)
for the source IP addr in pktgen, then performance of igb is very poor (like 
150Mbps
throughput instead of 1Gbps).  It runs right at line speed if we use same 
src/dest
IP addr in pktgen.  So, seems it is related to lots of src/dest IP addresses.

We see same problem when using pktgen to send to itself, and we see this in
several different kernels.  We specifically tested bridge mode in this stock
Fedora kernel:

 Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux

e1000e does not show this problem in our testing.

Any ideas what the issue might be and how to fix it?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 1/3] ath10k: remove ath10k_vif_to_arvif()

2017-02-10 Thread Ben Greear

On 02/09/2017 11:03 PM, Valo, Kalle wrote:

Ben Greear <gree...@candelatech.com> writes:


On 02/07/2017 01:14 AM, Valo, Kalle wrote:

Adrian Chadd <adr...@freebsd.org> writes:


Removing this method makes the diff to FreeBSD larger, as "vif" in
FreeBSD is a different pointer.

(Yes, I have ath10k on freebsd working and I'd like to find a way to
reduce the diff moving forward.)


I don't like this "(void *) vif->drv_priv" style that much either but
apparently it's commonly used in Linux wireless code and already parts
of ath10k. So this patch just unifies the coding style.


Surely the code compiles to the same thing, so why add a patch that
makes it more difficult for Adrian and makes the code no easier to read
for the rest of us?


Because that's the coding style used already in Linux. It's great to see
that parts of ath10k can be used also in other systems but in principle
I'm not very fond of the idea starting to reject valid upstream patches
because of driver forks.


There are lots of people trying to maintain out-of-tree or backported patches 
to ath10k,
and every time there is a meaningless style change, that just makes us
waste more time on useless work instead of having time to work on more important
matters.

Thanks,
Ben


I think backports project is doing it right, it's not limiting upstream
development in any way and handles all the API changes internally. Maybe
FreeBSD could do something similar?




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 1/3] ath10k: remove ath10k_vif_to_arvif()

2017-02-07 Thread Ben Greear



On 02/07/2017 01:14 AM, Valo, Kalle wrote:

Adrian Chadd <adr...@freebsd.org> writes:


Removing this method makes the diff to FreeBSD larger, as "vif" in
FreeBSD is a different pointer.

(Yes, I have ath10k on freebsd working and I'd like to find a way to
reduce the diff moving forward.)


I don't like this "(void *) vif->drv_priv" style that much either but
apparently it's commonly used in Linux wireless code and already parts
of ath10k. So this patch just unifies the coding style.


Surely the code compiles to the same thing, so why add a patch that
makes it more difficult for Adrian and makes the code no easier to read
for the rest of us?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: Commit 1fe8e0... (include more headers in if_tunnel.h) breaks my user-space build.

2017-01-13 Thread Ben Greear

On 01/13/2017 02:08 PM, Stephen Hemminger wrote:

On Fri, 13 Jan 2017 11:50:32 -0800
Ben Greear <gree...@candelatech.com> wrote:


On 01/13/2017 11:41 AM, Stephen Hemminger wrote:

On Fri, 13 Jan 2017 11:12:32 -0800
Ben Greear <gree...@candelatech.com> wrote:


I am including netinet/ip.h, and also linux/if_tunnel.h, and the linux/ip.h 
conflicts with
netinet/ip.h.

Maybe my build environment is screwed up, but maybe also it would be better to
just let the user include appropriate headers before including if_tunnel.h
and revert this patch?


include/uapi/linux/if_tunnel.h: include linux/if.h, linux/ip.h and linux/in6.h

 Fixes userspace compilation errors like:

 error: field ‘iph’ has incomplete type
 error: field ‘prefix’ has incomplete type

 Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi>
 Signed-off-by: David S. Miller <da...@davemloft.net>

Thanks,
Ben



What I ended up doing for iproute2 was including all headers used by the source
based on sanitized kernel headers.  Basically
  $ git grep '^#include .*$//' | \
sort -u >linux.headers
   $ for f in $(cat linux.headers)
 do cp ~/kernel/net-next/usr/include/$f include/$f
 done

You can't take only some of the headers, once you decide to diverge from glibc 
provided
headers, you got to take them all.



I do grab a copy of the linux kernel headers and compile against that, but 
netinet/ip.h is
coming from the OS.  Do you mean I should not include netinet/ip.h and instead 
use linux/ip.h?


I don't think you can mix netinet/ip.h and linux/ip.h, yes that is a mess.



Well, I still like the idea of reverting this patch..that way user-space does 
not have to use
linux/ip.h, and that lets them use netinet/ip.h and if_tunnel.h.

Anyway, I'll let Dave and/or the original committer decideI've reverted it 
in my local tree
so I am able to build again...

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Commit 1fe8e0... (include more headers in if_tunnel.h) breaks my user-space build.

2017-01-13 Thread Ben Greear

On 01/13/2017 11:41 AM, Stephen Hemminger wrote:

On Fri, 13 Jan 2017 11:12:32 -0800
Ben Greear <gree...@candelatech.com> wrote:


I am including netinet/ip.h, and also linux/if_tunnel.h, and the linux/ip.h 
conflicts with
netinet/ip.h.

Maybe my build environment is screwed up, but maybe also it would be better to
just let the user include appropriate headers before including if_tunnel.h
and revert this patch?


include/uapi/linux/if_tunnel.h: include linux/if.h, linux/ip.h and linux/in6.h

 Fixes userspace compilation errors like:

 error: field ‘iph’ has incomplete type
 error: field ‘prefix’ has incomplete type

 Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi>
 Signed-off-by: David S. Miller <da...@davemloft.net>

Thanks,
Ben



What I ended up doing for iproute2 was including all headers used by the source
based on sanitized kernel headers.  Basically
  $ git grep '^#include .*$//' | \
sort -u >linux.headers
   $ for f in $(cat linux.headers)
 do cp ~/kernel/net-next/usr/include/$f include/$f
 done

You can't take only some of the headers, once you decide to diverge from glibc 
provided
headers, you got to take them all.



I do grab a copy of the linux kernel headers and compile against that, but 
netinet/ip.h is
coming from the OS.  Do you mean I should not include netinet/ip.h and instead 
use linux/ip.h?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Commit 1fe8e0... (include more headers in if_tunnel.h) breaks my user-space build.

2017-01-13 Thread Ben Greear

On 01/13/2017 11:12 AM, Ben Greear wrote:

I am including netinet/ip.h, and also linux/if_tunnel.h, and the linux/ip.h 
conflicts with
netinet/ip.h.

Maybe my build environment is screwed up, but maybe also it would be better to
just let the user include appropriate headers before including if_tunnel.h
and revert this patch?


include/uapi/linux/if_tunnel.h: include linux/if.h, linux/ip.h and linux/in6.h

Fixes userspace compilation errors like:

error: field ‘iph’ has incomplete type
error: field ‘prefix’ has incomplete type

Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi>
Signed-off-by: David S. Miller <da...@davemloft.net>

Thanks,
Ben



I forgot the full commit ID, my abbreviation was not sufficient to be unique it 
seems:

1fe8e0f074c77aa41aaa579345a9e675acbebfa9

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Commit 1fe8e0... (include more headers in if_tunnel.h) breaks my user-space build.

2017-01-13 Thread Ben Greear

I am including netinet/ip.h, and also linux/if_tunnel.h, and the linux/ip.h 
conflicts with
netinet/ip.h.

Maybe my build environment is screwed up, but maybe also it would be better to
just let the user include appropriate headers before including if_tunnel.h
and revert this patch?


include/uapi/linux/if_tunnel.h: include linux/if.h, linux/ip.h and linux/in6.h

Fixes userspace compilation errors like:

error: field ‘iph’ has incomplete type
error: field ‘prefix’ has incomplete type

Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi>
Signed-off-by: David S. Miller <da...@davemloft.net>

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



ixgbe Port cannot load, "failed to register GSI"

2016-12-06 Thread Ben Greear

We put 3 10-g dual-port ixgbe NICs and 4 4-port I350 NICs in a 2U rackmount, 
and one of the ixgbe ports
fails to come up.  This previously worked before reboot, so maybe it is a race 
somehow.  Kernel is 4.4.11+,
but not hacks to ixgbe or I350 drivers.

Anyone know if there is some sort of way to make this work reliably?

dmesg | grep ixgbe

[5.803307] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 
4.2.1-k
[5.803309] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
[5.952119] ixgbe :04:00.0: Multiqueue Enabled: Rx Queue count = 8, Tx 
Queue count = 8
[5.952245] ixgbe :04:00.0: PCI Express bandwidth of 32GT/s available
[5.952246] ixgbe :04:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
[5.952328] ixgbe :04:00.0: MAC: 2, PHY: 15, SFP+: 5, PBA No: FF-0FF
[5.952330] ixgbe :04:00.0: 00:e0:ed:77:09:16
[5.954004] ixgbe :04:00.0: Intel(R) 10 Gigabit Network Connection
[6.102346] ixgbe :04:00.1: Multiqueue Enabled: Rx Queue count = 8, Tx 
Queue count = 8
[6.102475] ixgbe :04:00.1: PCI Express bandwidth of 32GT/s available
[6.102478] ixgbe :04:00.1: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
[6.102562] ixgbe :04:00.1: MAC: 2, PHY: 15, SFP+: 6, PBA No: FF-0FF
[6.102564] ixgbe :04:00.1: 00:e0:ed:77:09:17
[6.104869] ixgbe :04:00.1: Intel(R) 10 Gigabit Network Connection
[6.253429] ixgbe :05:00.0: Multiqueue Enabled: Rx Queue count = 8, Tx 
Queue count = 8
[6.253558] ixgbe :05:00.0: PCI Express bandwidth of 32GT/s available
[6.253560] ixgbe :05:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
[6.253644] ixgbe :05:00.0: MAC: 2, PHY: 15, SFP+: 5, PBA No: FF-0FF
[6.253646] ixgbe :05:00.0: 00:e0:ed:79:06:50
[6.255855] ixgbe :05:00.0: Intel(R) 10 Gigabit Network Connection
[6.404128] ixgbe :05:00.1: Multiqueue Enabled: Rx Queue count = 8, Tx 
Queue count = 8
[6.404254] ixgbe :05:00.1: PCI Express bandwidth of 32GT/s available
[6.404255] ixgbe :05:00.1: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
[6.404337] ixgbe :05:00.1: MAC: 2, PHY: 15, SFP+: 6, PBA No: FF-0FF
[6.404339] ixgbe :05:00.1: 00:e0:ed:79:06:51
[6.405914] ixgbe :05:00.1: Intel(R) 10 Gigabit Network Connection
[6.554373] ixgbe :06:00.0: Multiqueue Enabled: Rx Queue count = 8, Tx 
Queue count = 8
[6.554501] ixgbe :06:00.0: PCI Express bandwidth of 32GT/s available
[6.554504] ixgbe :06:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
[6.554588] ixgbe :06:00.0: MAC: 2, PHY: 15, SFP+: 5, PBA No: FF-0FF
[6.554590] ixgbe :06:00.0: 00:e0:ed:79:06:56
[6.556994] ixgbe :06:00.0: Intel(R) 10 Gigabit Network Connection
[6.557160] ixgbe :06:00.1: PCI INT B: failed to register GSI
[6.557169] ixgbe: probe of :06:00.1 failed with error -28

Thanks,
Ben
--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-19 Thread Ben Greear



On 10/19/2016 08:08 AM, Ard Biesheuvel wrote:

On 19 October 2016 at 08:43, Johannes Berg <johan...@sipsolutions.net> wrote:

On Wed, 2016-10-19 at 11:31 +0800, Herbert Xu wrote:



We could probably make mac80211 do that too, but can we guarantee in-
order processing? Anyway, it's pretty low priority, maybe never
happening, since hardly anyone really uses "software" crypto, the wifi
devices mostly have it built in anyway.



Indeed. The code is now correct in terms of API requirements, so let's
just wait for someone to complain about any performance regressions.


Do you actually expect performance regressions?  I'll be complaining if
so, but will test first :)

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


ethtool.h compile warning on c++

2016-10-14 Thread Ben Greear

I am getting warnings about sign missmatch.

Maybe make SPEED_UNKNOWN be ((__u32)(0x)) ?

from ethtool.h:

#define SPEED_UNKNOWN   -1

static inline int ethtool_validate_speed(__u32 speed)
{
return speed <= INT_MAX || speed == SPEED_UNKNOWN;
}

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] ath10k: fix system hang at qca99x0 probe on x86 platform (DMA32 issue)

2016-09-26 Thread Ben Greear

On 07/20/2016 10:02 AM, Adrian Chadd wrote:

Hi,

The "right" way for the target CPU to interact with host CPU memory
(and vice versa, for mostly what it's worth) is to have the copy
engine copy (ie, "DMA") the pieces between them. This may be for
diagnostic purposes, but it's not supposed to be used like this for
doing wifi data exchange, right? :-P

Now, there /may/ be some alignment hilarity in various bits of code
and/or hardware. Eg, Merlin (AR9280) requires its descriptors to be
within a 4k block - the code to iterate through the descriptor
physical address space didn't do a "val = val + offset", it did
something in verilog like "val = (val & 0xc000) | (offset &
0x3fff)". This meant if you allocated a descriptor that started just
before the end of a 4k physmem aligned block, you'd end up with
exciting results. I don't know if there are any situations like this
in the ath10k hardware, but I'm sure there will be some gotchas
somewhere.

In any case, if ath10k is consuming too much bounce buffers, the calls
to allocate memory aren't working right and should be restricted to 32
bit addresses. Whether that's by using the DMA memory API (before it's
mapped) or passing in GFP_DMA32 is a fun debate.

(My test hardware arrived, so I'll test this all out today on
Peregrine-v2 and see if the driver works.)


I have been running this patch for a while:

ath10k:  Use GPF_DMA32 for firmware swap memory.

This fixes OS crash when using QCA 9984 NIC on x86-64 system
without vt-d enabled.

Also tested on ea8500 with 9980, and x86-64 with 9980 and 9880.

All tests were with CT firmware.

Signed-off-by: Ben Greear <gree...@candelatech.com>

 drivers/net/wireless/ath/ath10k/wmi.c 
index e20aa39..727b3aa 100644
@@ -4491,7 +4491,7 @@ static int ath10k_wmi_alloc_chunk(struct ath10k *ar, u32 
req_id,
if (!pool_size)
return -EINVAL;

-   vaddr = kzalloc(pool_size, GFP_KERNEL | __GFP_NOWARN);
+   vaddr = kzalloc(pool_size, GFP_KERNEL | __GFP_NOWARN | 
GFP_DMA32);
if (!vaddr)
num_units /= 2;
}


It mostly seems to work, but then sometimes I get a splat like this below.  It 
appears
it is invalid to actually do kzalloc with GFP_DMA32 (based on that BUG_ON that
hit in the new_slab method)??

Any idea for a more proper way to do this?



gfp: 4
[ cut here ]
kernel BUG at /home/greearb/git/linux-4.7.dev.y/mm/slub.c:1508!
invalid opcode:  [#1] PREEMPT SMP
Modules linked in: coretemp hwmon ath9k intel_rapl ath10k_pci x86_pkg_temp_thermal ath9k_common ath10k_core intel_powerclamp ath9k_hw ath kvm iTCO_wdt mac80211 
iTCO_vendor_support irqbypass snd_hda_codec_hdmi 6

CPU: 2 PID: 268 Comm: kworker/u8:5 Not tainted 4.7.2+ #16
Hardware name: To be filled by O.E.M. To be filled by O.E.M./ChiefRiver, BIOS 
4.6.5 06/07/2013
Workqueue: ath10k_aux_wq ath10k_wmi_event_service_ready_work [ath10k_core]
task: 880036433a00 ti: 88003644 task.ti: 88003644
RIP: 0010:[]  [] new_slab+0x39a/0x410
RSP: 0018:880036443b58  EFLAGS: 00010092
RAX: 0006 RBX: 024082c4 RCX: 
RDX: 0006 RSI: 88021e30dd08 RDI: 88021e30dd08
RBP: 880036443b90 R08:  R09: 
R10:  R11: 0372 R12: 88021dc01200
R13: 88021dc00cc0 R14: 88021dc01200 R15: 0001
FS:  () GS:88021e30() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f3e65c1c730 CR3: 01e06000 CR4: 001406e0
Stack:
 8127a4fc 0a01ff10 024082c4 88021dc01200
 88021dc00cc0 88021dc01200 0001 880036443c58
 81247ac6 88021e31b360 880036433a00 880036433a00
Call Trace:
 [] ? __d_lookup+0x9c/0x160
 [] ___slab_alloc+0x396/0x4a0
 [] ? ath10k_wmi_event_service_ready_work+0x5ad/0x800 
[ath10k_core]
 [] ? alloc_kmem_pages+0x9/0x10
 [] ? kmalloc_order+0x13/0x40
 [] ? ath10k_wmi_event_service_ready_work+0x5ad/0x800 
[ath10k_core]
 [] __slab_alloc.isra.72+0x26/0x40
 [] __kmalloc+0x147/0x1b0
 [] ath10k_wmi_event_service_ready_work+0x5ad/0x800 
[ath10k_core]
 [] ? dequeue_entity+0x261/0xac0
 [] process_one_work+0x148/0x420
 [] worker_thread+0x49/0x480
 [] ? rescuer_thread+0x330/0x330
 [] kthread+0xc4/0xe0
 [] ret_from_fork+0x1f/0x40
 [] ? kthread_create_on_node+0x170/0x170
Code: e9 65 fd ff ff 49 8b 57 20 48 8d 42 ff 83 e2 01 49 0f 44 c7 f0 80 08 40 e9 6f fd ff ff 89 c6 48 c7 c7 01 36 c7 81 e8 e8 40 fa ff <0f> 0b ba 00 10 00 00 be 
5a 00 00 00 48 89 c7 48 d3 e2 e8 bf 18

RIP  [] new_slab+0x39a/0x410
 RSP 
---[ end trace ea3b0043b2911d93 ]---


static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
{
if (unlikely(flags & GFP_SLAB_BUG_MASK)) {

Fwd: Re: nfs broken on Fedora-24, 32-bit?

2016-09-22 Thread Ben Greear


This is probably not an NFS specific issue, though I guess possibly it is.

Forwarding to netdev in case someone wants to take a look at it.

Thanks,
Ben


 Forwarded Message 
Subject: Re: nfs broken on Fedora-24, 32-bit?
Date: Fri, 16 Sep 2016 16:31:51 -0700
From: Ben Greear <gree...@candelatech.com>
Organization: Candela Technologies
To: Trond Myklebust <tron...@primarydata.com>
CC: List Linux NFS Mailing <linux-...@vger.kernel.org>

On 09/15/2016 04:06 PM, Ben Greear wrote:

On 09/15/2016 04:00 PM, Trond Myklebust wrote:

Hi Ben,


On Sep 15, 2016, at 18:32, Ben Greear <gree...@candelatech.com> wrote:

I have a Fedora-24 machine mounting an NFS server running Fedora-13 (kernel 
2.6.34.9-69.fc13.x86_64).

F24 machine has this in /etc/fstab:

192.168.100.3:/mnt/d2 /mnt/d2   nfs nfsvers=3   0 0

When I copy a file from f24-32 to the F-13 machine, the file size is the same,
but the file is corrupted on the file server.  I see a different md5sum each 
time.

Various other systems (F21, F19, etc) can all copy to the F13 machine fine.

And, F24-64 machine can copy to the F13 machine fine.

Anyone seen something similar?


Do you know if the corruption is happening on the read()s or on the write()s? 
Do you, for instance get the same corruption if you copy from a local file on
the F-24 client to the server? ..or if you copy from a file on the server to a 
local directory on the F-24 client?

Cheers
  Trond



Seems to be a write issue:

# This is the nfs server:

[greearb@fs3 candela_cdrom.5.3.5]$ md5sum gua-f21-32
ad4073fa8b806bb82b85a645e21f5e67  gua-f21-32
[greearb@fs3 candela_cdrom.5.3.5]$ md5sum ../greearb/tmp/gua-f21-32
582bfea0cc8cc52aa38dc0f5048d0156  ../greearb/tmp/gua-f21-32
[greearb@fs3 candela_cdrom.5.3.5]$


# This is the v-f24-32 client:

greearb@v-f24-32 ~]$ cp /mnt/d2/pub/candela_cdrom.5.3.5/gua-f21-32 ./
[greearb@v-f24-32 ~]$ md5sum gua-f21-32
ad4073fa8b806bb82b85a645e21f5e67  gua-f21-32
[greearb@v-f24-32 ~]$ cp gua-f21-32 /mnt/d2/pub/greearb/tmp/
[greearb@v-f24-32 ~]$ md5sum /mnt/d2/pub/greearb/tmp/gua-f21-32
ad4073fa8b806bb82b85a645e21f5e67  /mnt/d2/pub/greearb/tmp/gua-f21-32


Interesting that the client reads back the file it copied over as if it were 
correct, but
it shows up wrong on the nfs server.  Maybe it is just reading a local cache?

Thanks,
Ben



Here is some more info on this:

We can only reproduce this on virtual machines using the KVM infrastructure, 
and only
when we use the rtl8139 virtual hardware (in bridge mode).  With the e1000 
virtual hardware
we cannot reproduce the problem.

Also, multiple different nfs servers (including much newer kernels) all show 
the same
behaviour with this broken nfs client.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Buggy rhashtable walking

2016-08-05 Thread Ben Greear



On 08/05/2016 03:50 AM, Johannes Berg wrote:

On Fri, 2016-08-05 at 18:48 +0800, Herbert Xu wrote:

On Fri, Aug 05, 2016 at 08:16:53AM +0200, Johannes Berg wrote:


Hm. Would you rather allocate a separate head entry for the
hashtable,
or chain the entries?


My plan is to build support for this directly into rhashtable.
So I'm adding a struct rhlist_head that would be used in place
of rhash_head for these cases and it'll carry an extra pointer
for the list of identical entries.

I will then add an additional layer of insert/lookup interfaces
for rhlist_head.


Herbert, thank you for fixing this!

It would not be fun to have to revert to the old way of hashing
stations in mac80211...

I'll be happy to test the patches when you have them ready.

Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


pktgen issue with "Observe needed_headroom of the device"

2016-05-23 Thread Ben Greear

Regarding this commit:

879c7220e828af8bd82ea9d774c7e45c46b976e4

net: pktgen: Observe needed_headroom of the device

Allocate enough space so as not to force the outgoing net device to do
skb_realloc_headroom().

Signed-off-by: Bogdan Hamciuc <bogdan.hamc...@freescale.com>
Signed-off-by: David S. Miller <da...@davemloft.net>


I think it may be incorrect.  It seems that pkt_overhead is meant to be
the data-portion of the skb, not lower-level padding?

For instance:

int pkt_overhead;   /* overhead for MPLS, VLANs, IPSEC etc */
...

/* Eth + IPh + UDPh + mpls */
datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 -
  pkt_dev->pkt_overhead;

So, maybe we need to add that LL_RESERVED_SPACE to the size when allocating
the skb in pktgen_alloc_skb and leave it out of pkt_overhead?

And for that matter, what is that '+ 64 +' for in the size calculation?
Looks a lot like some fudge factor from long ago?

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Make TCP work better with re-ordered frames?

2016-05-18 Thread Ben Greear

On 05/18/2016 08:25 AM, Eric Dumazet wrote:

On Wed, 2016-05-18 at 08:07 -0700, Ben Greear wrote:


On 05/18/2016 07:29 AM, Eric Dumazet wrote:

On Wed, 2016-05-18 at 07:00 -0700, Ben Greear wrote:

We are investigating a system that has fairly poor TCP throughput
with the 3.17 and 4.0 kernels, but evidently it worked pretty well
with 3.14 (I should be able to verify 3.14 later today).

One thing I notice is that a UDP download test shows lots of reordered
frames, so I am thinking maybe TCP is running slow because of this.

(We see about 800Mbps UDP download, but only 500Mbps TCP, even when
using 100 concurrent TCP streams.)

Is there some way to tune the TCP stack to better handle reordered frames?


Nothing yet. Are you the sender or the receiver ?

You really want to avoid reorders as much as possible.

Are you telling us something broke in networking layers between 3.14 and
3.17 leadings to reorders ?


I am both sender and receiver, through an access-controller and wifi AP as DUT.
The sender is Intel 1G NIC, so I suspect it is not causing reordering, which
indicates most likely DUT is to blame.

Using several off-the-shelf APs in our lab we do not see this problem.

I am not certain yet what is the difference, but customer reports 600+Mbps
with their older code, and best I can get is around 500Mbps with newer stuff.

Lots of stuff changed though (ath10k firmware, user-space at least slightly,
kernel, etc), so possibly the regression is elsewhere.



You possibly could send me some pcap (limited to the headers, using -s
128 for example) and limited to few flows, not the whole of them ;)

TCP reorders are tricky for the receiver : It sends a lot of SACK (one
for every incoming packet, instead of the normal rule of sending one ACK
for two incoming packets)

Increasing number of ACK might impact half-duplex networks, but also
considerably increase cpu processing time.


I will work on captures...do you care if it is from transmitter or receiver's 
perspective?

Thanks,
Ben








--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Make TCP work better with re-ordered frames?

2016-05-18 Thread Ben Greear



On 05/18/2016 07:29 AM, Eric Dumazet wrote:

On Wed, 2016-05-18 at 07:00 -0700, Ben Greear wrote:

We are investigating a system that has fairly poor TCP throughput
with the 3.17 and 4.0 kernels, but evidently it worked pretty well
with 3.14 (I should be able to verify 3.14 later today).

One thing I notice is that a UDP download test shows lots of reordered
frames, so I am thinking maybe TCP is running slow because of this.

(We see about 800Mbps UDP download, but only 500Mbps TCP, even when
   using 100 concurrent TCP streams.)

Is there some way to tune the TCP stack to better handle reordered frames?


Nothing yet. Are you the sender or the receiver ?

You really want to avoid reorders as much as possible.

Are you telling us something broke in networking layers between 3.14 and
3.17 leadings to reorders ?


I am both sender and receiver, through an access-controller and wifi AP as DUT.
The sender is Intel 1G NIC, so I suspect it is not causing reordering, which
indicates most likely DUT is to blame.

Using several off-the-shelf APs in our lab we do not see this problem.

I am not certain yet what is the difference, but customer reports 600+Mbps
with their older code, and best I can get is around 500Mbps with newer stuff.

Lots of stuff changed though (ath10k firmware, user-space at least slightly,
kernel, etc), so possibly the regression is elsewhere.

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Make TCP work better with re-ordered frames?

2016-05-18 Thread Ben Greear

We are investigating a system that has fairly poor TCP throughput
with the 3.17 and 4.0 kernels, but evidently it worked pretty well
with 3.14 (I should be able to verify 3.14 later today).

One thing I notice is that a UDP download test shows lots of reordered
frames, so I am thinking maybe TCP is running slow because of this.

(We see about 800Mbps UDP download, but only 500Mbps TCP, even when
 using 100 concurrent TCP streams.)

Is there some way to tune the TCP stack to better handle reordered frames?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 3.2 085/115] veth: don???t modify ip_summed; doing so treats packets with bad checksums as good.

2016-05-13 Thread Ben Greear

On 05/13/2016 11:21 AM, David Miller wrote:

From: Ben Greear <gree...@candelatech.com>
Date: Fri, 13 May 2016 09:57:19 -0700


How do you feel about a new socket-option to allow a socket to
request the old veth behaviour?


I depend upon the opinions of the experts who work upstream on and
maintain these components, since it is an area I am not so familiar
with.

Generally speaking asking me directly for opinions on matters like
this isn't the way to go, in fact I kind of find it irritating.  It
can't all be on me.



Fair enough, thanks for your time.

Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 3.2 085/115] veth: don???t modify ip_summed; doing so treats packets with bad checksums as good.

2016-05-13 Thread Ben Greear

Mr Miller:

How do you feel about a new socket-option to allow a socket to
request the old veth behaviour?

Thanks,
Ben

On 04/30/2016 10:30 PM, Willy Tarreau wrote:

On Sat, Apr 30, 2016 at 03:43:51PM -0700, Ben Greear wrote:

On 04/30/2016 03:01 PM, Vijay Pandurangan wrote:

Consider:

- App A  sends out corrupt packets 50% of the time and discards inbound data.

(...)

How can you make a generic app C know how to do this?  The path could be,
for instance:

eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> vethC <-> 
vethD <-> appC

There are no sockets on vethB, but it does need to have special behaviour to 
elide
csums.  Even if appC is hacked to know how to twiddle some thing on it's veth 
port,
mucking with vethD will have no effect on vethB.

With regard to your example above, why would A corrupt packets?  My guess:

1)  It has bugs (so, fix the bugs, it could equally create incorrect data with 
proper checksums,
 so just enabling checksumming adds no useful protection.)


I agree with Ben here, what he needs is the ability for userspace to be
trusted when *forwarding* a packet. Ideally you'd only want to receive
the csum status per packet on the packet socket and pass the same value
on the vethA interface, with this status being kept when the packet
reaches vethB.

If A purposely corrupts packet, it's A's problem. It's similar to designing
a NIC which intentionally corrupts packets and reports "checksum good".

The real issue is that in order to do things right, the userspace bridge
(here, "A") would really need to pass this status. In Ben's case as he
says, bad checksum packets are dropped before reaching A, so that
simplifies the process quite a bit and that might be what causes some
confusion, but ideally we'd rather have recvmsg() and sendmsg() with
these flags.

I faced the exact same issue 3 years ago when playing with netmap, it was
slow as hell because it would lose all checksum information when packets
were passing through userland, resulting in GRO/GSO etc being disabled,
and had to modify it to let userland preserve it. That's especially
important when you have to deal with possibly corrupted packets not yet
detected in the chain because the NIC did not validate their checksums.

Willy




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Performance suggestions for bridging module?

2016-05-02 Thread Ben Greear

Hello!

I have a network emulator module that acts a lot like an ethernet bridge.

It is implemented roughly like this:

Hook into the rx logic and steal packets in the rx-all logic, similar to how 
sniffers
work.

Then, it puts the packet onto a queue for transmit.

A kernel thread services this queue transmitting frames on a different NIC.

I am using spin-locks to protect this queue.

I am disabling LRO/GRO etc on the ixgbe NICs so that I don't have
to deal with linearization when trying to do corruptions and such.  Re-enabling
LRO/GRO makes the transmit logic use less CPU, but the RX logic is the 
bottleneck
anyway it seems.

The code, which is GPL, is here, in case someone wants to take a look:

http://www.candelatech.com/downloads/wanlink/

What I see is that this is very sensitive to which CPU core does what.
If I run the transmitter thread on cpu-0, performance is awful.  If I run
it on 1, then it is good.  Sometimes, though hard to reproduce, I can run
right at 10Gbps bi-directional throughput.  More often, it is stuck at
around 7Gbps bi-directional throughput.

I tried adding some prefetch logic, and that helped when emulating very long
latency (like, 10 seconds worth), but not sure I am really doing that optimally
either.

My basic question is:  Any suggestion for an optimal CPU core configuration
(most likely including binding a NIC's irqs to a particular core)??

Any other suggestions for things to look for?

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Ben Greear



On 04/30/2016 03:01 PM, Vijay Pandurangan wrote:

On Sat, Apr 30, 2016 at 5:52 PM, Ben Greear <gree...@candelatech.com> wrote:


Good point, so if you had:

eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <->
userspace-stub <->eth1

and user-space hub enabled this elide flag, things would work, right?
Then, it seems like what we need is a way to tell the kernel
router/bridge logic to follow elide signals in packets coming from
veth. I'm not sure what the best way to do this is because I'm less
familiar with conventions in that part of the kernel, but assuming
there's a way to do this, would it be acceptable?



You cannot receive on one veth without transmitting on the other, so
I think the elide csum logic can go on the raw-socket, and apply to packets
in the transmit-from-user-space direction.  Just allowing the socket to make
the veth behave like it used to before this patch in question should be good
enough, since that worked for us for years.  So, just an option to modify
the
ip_summed for pkts sent on a socket is probably sufficient.


I don't think this is right. Consider:

- App A  sends out corrupt packets 50% of the time and discards inbound data.
- App B doesn't care about corrupt packets and is happy to receive
them and has some way of dealing with them (special case)
- App C is a regular app, say nc or something.

In your world, where A decides what happens to data it transmits,
then
A<--veth-->B and A<---wire-->B will have the same behaviour

but

A<-- veth --> C and A<-- wire --> C will have _different_ behaviour: C
will behave incorrectly if it's connected over veth but correctly if
connected with a wire. That is a bug.

Since A cannot know what the app it's talking to will desire, I argue
that both sides of a message must be opted in to this optimization.


How can you make a generic app C know how to do this?  The path could be,
for instance:

eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> vethC <-> 
vethD <-> appC

There are no sockets on vethB, but it does need to have special behaviour to 
elide
csums.  Even if appC is hacked to know how to twiddle some thing on it's veth 
port,
mucking with vethD will have no effect on vethB.

With regard to your example above, why would A corrupt packets?  My guess:

1)  It has bugs (so, fix the bugs, it could equally create incorrect data with 
proper checksums,
so just enabling checksumming adds no useful protection.)

2)  It means to corrupt frames.  In that case, someone must expect that C 
should receive incorrect
frames, otherwise why bother making App-A corrupt them in the first place?

3)  You are explicitly trying to test the kernel checksum logic, so you want 
the kernel to
detect the bad checksum and throw away the packet.  In this case, just 
don't set the socket
option in appA to elide checksums and the packet will be thrown away.

Any other cases you can think of?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Ben Greear



On 04/30/2016 02:36 PM, Vijay Pandurangan wrote:

On Sat, Apr 30, 2016 at 5:29 PM, Ben Greear <gree...@candelatech.com> wrote:



On 04/30/2016 02:13 PM, Vijay Pandurangan wrote:


On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear <gree...@candelatech.com>
wrote:




On 04/30/2016 12:54 PM, Tom Herbert wrote:



We've put considerable effort into cleaning up the checksum interface
to make it as unambiguous as possible, please be very careful to
follow it. Broken checksum processing is really hard to detect and
debug.

CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
(indicated by csum_level) have been verified to be correct in a
packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
checksum it would refer to has not been verified and is incorrect this
is a major bug.




Suppose I know that the packet received on a packet-socket has
already been verified by a NIC that supports hardware checksumming.

Then, I want to transmit it on a veth interface using a second
packet socket.  I do not want veth to recalculate the checksum on
transmit, nor to validate it on the peer veth on receive, because I do
not want to waste the CPU cycles.  I am assuming that my app is not
accidentally corrupting frames, so the checksum can never be bad.

How should the checksumming be configured for the packets going into
the packet-socket from user-space?




It seems like that only the receiver should decide whether or not to
checksum packets on the veth, not the sender.

How about:

We could add a receiving socket option for "don't checksum packets
received from a veth when the other side has marked them as
elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending
socket option for "mark all data sent via this socket to a veth as
elide-checksum-suggested".

So the process would be:

Writer:
1. open read socket
2. open write socket, with option elide-checksum-for-veth-suggested
3. write data

Reader:
1. open read socket with "follow-elide-checksum-suggestions-on-veth"
2. read data

The kernel / module would then need to persist the flag on all packets
that traverse a veth, and drop these data when they leave the veth
module.



I'm not sure this works completely.  In my app, the packet flow might be:

eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB
<-> [kernel router/bridge logic ...] <-> eth1


Good point, so if you had:

eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <->
userspace-stub <->eth1

and user-space hub enabled this elide flag, things would work, right?
Then, it seems like what we need is a way to tell the kernel
router/bridge logic to follow elide signals in packets coming from
veth. I'm not sure what the best way to do this is because I'm less
familiar with conventions in that part of the kernel, but assuming
there's a way to do this, would it be acceptable?


You cannot receive on one veth without transmitting on the other, so
I think the elide csum logic can go on the raw-socket, and apply to packets
in the transmit-from-user-space direction.  Just allowing the socket to make
the veth behave like it used to before this patch in question should be good
enough, since that worked for us for years.  So, just an option to modify the
ip_summed for pkts sent on a socket is probably sufficient.


There may be no sockets on the vethB port.  And reader/writer is not
a good way to look at it since I am implementing a bi-directional bridge in
user-space and each packet-socket is for both rx and tx.


Sure, but we could model a bidrectional connection as two
unidirectional sockets for our discussions here, right?


Best not to I think, you want to make sure that one socket can
correctly handle tx and rx.  As long as that works, then using
uni-directional sockets should work too.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Ben Greear



On 04/30/2016 02:13 PM, Vijay Pandurangan wrote:

On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear <gree...@candelatech.com> wrote:



On 04/30/2016 12:54 PM, Tom Herbert wrote:


We've put considerable effort into cleaning up the checksum interface
to make it as unambiguous as possible, please be very careful to
follow it. Broken checksum processing is really hard to detect and
debug.

CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
(indicated by csum_level) have been verified to be correct in a
packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
checksum it would refer to has not been verified and is incorrect this
is a major bug.



Suppose I know that the packet received on a packet-socket has
already been verified by a NIC that supports hardware checksumming.

Then, I want to transmit it on a veth interface using a second
packet socket.  I do not want veth to recalculate the checksum on
transmit, nor to validate it on the peer veth on receive, because I do
not want to waste the CPU cycles.  I am assuming that my app is not
accidentally corrupting frames, so the checksum can never be bad.

How should the checksumming be configured for the packets going into
the packet-socket from user-space?



It seems like that only the receiver should decide whether or not to
checksum packets on the veth, not the sender.

How about:

We could add a receiving socket option for "don't checksum packets
received from a veth when the other side has marked them as
elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending
socket option for "mark all data sent via this socket to a veth as
elide-checksum-suggested".

So the process would be:

Writer:
1. open read socket
2. open write socket, with option elide-checksum-for-veth-suggested
3. write data

Reader:
1. open read socket with "follow-elide-checksum-suggestions-on-veth"
2. read data

The kernel / module would then need to persist the flag on all packets
that traverse a veth, and drop these data when they leave the veth
module.


I'm not sure this works completely.  In my app, the packet flow might be:

eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB <-> 
[kernel router/bridge logic ...] <-> eth1

There may be no sockets on the vethB port.  And reader/writer is not
a good way to look at it since I am implementing a bi-directional bridge in
user-space and each packet-socket is for both rx and tx.


Also, I might want to send raw frames that do have
broken checksums (lets assume a real NIC, not veth), and I want them
to hit the wire with those bad checksums.


How do I configure the checksumming in this case?



Correct me if I'm wrong but I think this is already possible now. You
can have packets with incorrect checksum hitting the wire as is. What
you cannot do is instruct the receiving end to ignore the checksum
from the sending end when using a physical device (and something I
think we should mimic on the sending device).


Yes, it does work currently (or, last I checked)...I just want to make sure it 
keeps working.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Ben Greear


On 04/30/2016 12:54 PM, Tom Herbert wrote:

We've put considerable effort into cleaning up the checksum interface
to make it as unambiguous as possible, please be very careful to
follow it. Broken checksum processing is really hard to detect and
debug.

CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
(indicated by csum_level) have been verified to be correct in a
packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
checksum it would refer to has not been verified and is incorrect this
is a major bug.


Suppose I know that the packet received on a packet-socket has
already been verified by a NIC that supports hardware checksumming.

Then, I want to transmit it on a veth interface using a second
packet socket.  I do not want veth to recalculate the checksum on
transmit, nor to validate it on the peer veth on receive, because I do
not want to waste the CPU cycles.  I am assuming that my app is not
accidentally corrupting frames, so the checksum can never be bad.

How should the checksumming be configured for the packets going into
the packet-socket from user-space?

Also, I might want to send raw frames that do have
broken checksums (lets assume a real NIC, not veth), and I want them
to hit the wire with those bad checksums.

How do I configure the checksumming in this case?


Thanks,
Ben




Tom

On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear <gree...@candelatech.com> wrote:



On 04/30/2016 11:33 AM, Ben Hutchings wrote:


On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote:


Hello,





http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a


Actually, no, this is not really a regression.


[...]

It really is.  Even though the old behaviour was a bug (raw packets
should not be changed), if there are real applications that depend on
that then we have to keep those applications working somehow.



To be honest, I fail to see why the old behaviour is a bug when sending
raw packets from user-space.  If raw packets should not be changed, then
we need some way to specify what the checksum setting is to begin with,
otherwise, user-space has not enough control.

A socket option for new programs, and sysctl configurable defaults for raw
sockets
for old binary programs would be sufficient I think.


Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


  1   2   3   4   5   >