Re: [RFC 2/7] ath10k: Add support to process rx packet in thread

2021-03-22 Thread Ben Greear

On 3/22/21 6:20 PM, Brian Norris wrote:

On Mon, Mar 22, 2021 at 4:58 PM Ben Greear  wrote:

On 7/22/20 6:00 AM, Felix Fietkau wrote:

On 2020-07-22 14:55, Johannes Berg wrote:

On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:


I'm considering testing a different approach (with mt76 initially):
- Add a mac80211 rx function that puts processed skbs into a list
instead of handing them to the network stack directly.


Would this be *after* all the mac80211 processing, i.e. in place of the
rx-up-to-stack?

Yes, it would run all the rx handlers normally and then put the
resulting skbs into a list instead of calling netif_receive_skb or
napi_gro_frags.


Whatever came of this?  I realized I'm running Felix's patch since his mt76
driver needs it.  Any chance it will go upstream?


If you're asking about $subject (moving NAPI/RX to a thread), this
landed upstream recently:
http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715

It needs a bit of coaxing to work on a WiFi driver (including: WiFi
drivers tend to have a different netdev for NAPI than they expose to
/sys/class/net/), but it's there.

I'm not sure if people had something else in mind in the stuff you're
quoting though.


No, I got it confused with something Felix did:

https://github.com/greearb/mt76/blob/master/patches/0001-net-add-support-for-threaded-NAPI-polling.patch

Maybe the NAPI/RX to a thread thing superceded Felix's patch?

Thanks,
Ben



Brian




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [RFC 2/7] ath10k: Add support to process rx packet in thread

2021-03-22 Thread Ben Greear

On 7/22/20 6:00 AM, Felix Fietkau wrote:

On 2020-07-22 14:55, Johannes Berg wrote:

On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:


I'm considering testing a different approach (with mt76 initially):
- Add a mac80211 rx function that puts processed skbs into a list
instead of handing them to the network stack directly.


Would this be *after* all the mac80211 processing, i.e. in place of the
rx-up-to-stack?

Yes, it would run all the rx handlers normally and then put the
resulting skbs into a list instead of calling netif_receive_skb or
napi_gro_frags.


Whatever came of this?  I realized I'm running Felix's patch since his mt76
driver needs it.  Any chance it will go upstream?

Thanks,
Ben



- Felix




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: VRF: ssh port forwarding between non-vrf and vrf interface.

2021-01-26 Thread Ben Greear

On 1/22/21 8:02 AM, David Ahern wrote:

On 1/22/21 8:45 AM, Ben Greear wrote:

Hello,

I have a system with a management interface that is not in any VRF, and
then I have
a port that *is* in a VRF.  I'd like to be able to set up ssh port
forwarding so that
when I log into the system on the management interface it will
automatically forward to
an IP accessible through the VRF interface.

Is there a way to do such a thing?



For a while I had a system setup with eth0 in a management VRF and setup
to do NAT and port forwarding of incoming ssh connections, redirecting
to VMs running in a different namespace. Crossing VRFs with netfilter
most likely will not work without some development. You might be able to
do it with XDP - rewrite packet headers and redirect. That too might
need a bit of development depending on the netdevs involved.



Maybe easier to improve ssh so that it could specify a netdev to bind to when
making the call to the redirected destination?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH net] iwlwifi: provide gso_type to GSO packets

2021-01-25 Thread Ben Greear

On 1/25/21 7:09 AM, Eric Dumazet wrote:

From: Eric Dumazet 

net/core/tso.c got recent support for USO, and this broke iwlfifi
because the driver implemented a limited form of GSO.

Providing ->gso_type allows for skb_is_gso_tcp() to provide
a correct result.

Fixes: 3d5b459ba0e3 ("net: tso: add UDP segmentation support")
Signed-off-by: Eric Dumazet 
Reported-by: Ben Greear 
Bisected-by: Ben Greear 


I appreciate the credit, but the bisect and some other initial bug hunting was
done by people on this thread:

https://bugzilla.kernel.org/show_bug.cgi?id=209913

Thanks,
Ben


Tested-by: Ben Greear 
Cc: Luca Coelho 
Cc: linux-wirel...@vger.kernel.org
Cc: Johannes Berg 
---
  drivers/net/wireless/intel/iwlwifi/mvm/tx.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c 
b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
index 
a983c215df310776ffe67f3b3ffa203eab609bfc..3712adc3ccc2511d46bcc855efbfba41c487d8e6
 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
@@ -773,6 +773,7 @@ iwl_mvm_tx_tso_segment(struct sk_buff *skb, unsigned int 
num_subframes,
  
  	next = skb_gso_segment(skb, netdev_flags);

skb_shinfo(skb)->gso_size = mss;
+   skb_shinfo(skb)->gso_type = ipv4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6;
if (WARN_ON_ONCE(IS_ERR(next)))
return -EINVAL;
else if (next)
@@ -795,6 +796,8 @@ iwl_mvm_tx_tso_segment(struct sk_buff *skb, unsigned int 
num_subframes,
  
  		if (tcp_payload_len > mss) {

skb_shinfo(tmp)->gso_size = mss;
+   skb_shinfo(tmp)->gso_type = ipv4 ? SKB_GSO_TCPV4 :
+  SKB_GSO_TCPV6;
} else {
if (qos) {
            u8 *qc;




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


VRF: ssh port forwarding between non-vrf and vrf interface.

2021-01-22 Thread Ben Greear

Hello,

I have a system with a management interface that is not in any VRF, and then I 
have
a port that *is* in a VRF.  I'd like to be able to set up ssh port forwarding 
so that
when I log into the system on the management interface it will automatically 
forward to
an IP accessible through the VRF interface.

Is there a way to do such a thing?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: 5.10.4+ hang with 'rmmod nf_conntrack'

2021-01-08 Thread Ben Greear

On 1/7/21 10:16 PM, Florian Westphal wrote:

Ben Greear  wrote:

I noticed my system has a hung process trying to 'rmmod nf_conntrack'.

I've generally been doing the script that calls rmmod forever,
but only extensively tested on 5.4 kernel and earlier.

If anyone has any ideas, please let me know.  This is from 'sysrq t'.  I don't 
see
any hung-task splats in dmesg.


rmmod on conntrack loops forever until the active conntrack object count 
reaches 0.
(plus a walk of the conntrack table to evict/put all entries).


I'll see if it is reproducible and if so will try
with lockdep enabled...


No idea, there was a regression in 5.6, but that was fixed by the time
5.7 was released.

Can't reproduce hangs with a script that injects a few dummy entries
and then removes the module:

added=0

add_and_rmmod()
{
 while [ $added -lt 1000 ]; do
 conntrack -I -s 
$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%255+1)) \
 -d 
$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%255+1)) \
  --protonum 6 --timeout $(((RANDOM%120) + 240)) --state 
ESTABLISHED --sport $RANDOM --dport $RANDOM 2> /dev/null || break

 added=$((added + 1))
 if [ $((added % 1000)) -eq 0 ];then
 echo $added
 fi
 done

 echo rmmod after adding $added entries
 conntrack -C
 rmmod nf_conntrack_netlink
 rmmod nf_conntrack
}

add_and_rmmod

I don't see how it would make a difference, but do you have any special 
conntrack features enabled
at run time, e.g. reliable netlink events? (If you don't know what I mean the 
answer is no).


Not that I know of, but I am using lots of VRF devices, each with their own 
routing table, as well
as some wifi stations and AP netdevs.

I'll let you know if I can reproduce it again..

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


5.10.4+ hang with 'rmmod nf_conntrack'

2021-01-07 Thread Ben Greear

I noticed my system has a hung process trying to 'rmmod nf_conntrack'.

I've generally been doing the script that calls rmmod forever,
but only extensively tested on 5.4 kernel and earlier.

If anyone has any ideas, please let me know.  This is from 'sysrq t'.  I don't 
see
any hung-task splats in dmesg.  I'll see if it is reproducible and if so will 
try
with lockdep enabled...

21497 Jan 07 16:12:05 TR-398 kernel: task:rmmod   state:R  running task 
stack:0 pid: 4107 ppid:  4054 flags:0x4084
21498 Jan 07 16:12:05 TR-398 kernel: Call Trace:
21499 Jan 07 16:12:05 TR-398 kernel:  ? do_softirq_own_stack+0x32/0x40
21500 Jan 07 16:12:05 TR-398 kernel:  ? irq_exit_rcu+0x39/0x90
21501 Jan 07 16:12:05 TR-398 kernel:  ? sysvec_apic_timer_interrupt+0x34/0x80
21502 Jan 07 16:12:05 TR-398 kernel:  ? 
asm_sysvec_apic_timer_interrupt+0x12/0x20
21503 Jan 07 16:12:05 TR-398 kernel:  ? nf_conntrack_attach+0x30/0x30 
[nf_conntrack]
21504 Jan 07 16:12:05 TR-398 kernel:  ? _raw_spin_lock+0x12/0x20
21505 Jan 07 16:12:05 TR-398 kernel:  ? do_softirq_own_stack+0x32/0x40
21506 Jan 07 16:12:05 TR-398 kernel:  ? nf_conntrack_lock+0x9/0x40 
[nf_conntrack]
21507 Jan 07 16:12:05 TR-398 kernel:  ? nf_ct_iterate_cleanup+0x88/0x140 
[nf_conntrack]
21508 Jan 07 16:12:05 TR-398 kernel:  ? nf_conntrack_cleanup_net_list+0x36/0xc0 
[nf_conntrack]
21509 Jan 07 16:12:05 TR-398 kernel:  ? unregister_pernet_operations+0xcc/0x130
21510 Jan 07 16:12:05 TR-398 kernel:  ? unregister_pernet_subsys+0x18/0x30
21511 Jan 07 16:12:05 TR-398 kernel:  ? nf_conntrack_standalone_fini+0x11/0x425 
[nf_conntrack]
21512 Jan 07 16:12:05 TR-398 kernel:  ? __x64_sys_delete_module+0x131/0x270
21513 Jan 07 16:12:05 TR-398 kernel:  ? syscall_trace_enter.isra.21+0xf9/0x190
21514 Jan 07 16:12:05 TR-398 kernel:  ? do_syscall_64+0x2d/0x70
21515 Jan 07 16:12:05 TR-398 kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: net: tso: add UDP segmentation support: adds regression for ax200 upload

2020-12-24 Thread Ben Greear

On 12/21/20 12:01 PM, Rainer Suhm wrote:

Am 21.12.20 um 20:14 schrieb Eric Dumazet:

On Mon, Dec 21, 2020 at 8:04 PM Eric Dumazet  wrote:


On Mon, Dec 21, 2020 at 7:46 PM Eric Dumazet  wrote:


On Sat, Dec 19, 2020 at 5:55 PM Ben Greear  wrote:


On 12/19/20 7:18 AM, Johannes Berg wrote:

On Fri, 2020-12-18 at 12:16 -0800, Jakub Kicinski wrote:

On Thu, 17 Dec 2020 12:40:26 -0800 Ben Greear wrote:

On 12/17/20 10:20 AM, Eric Dumazet wrote:

On Thu, Dec 17, 2020 at 7:13 PM Ben Greear  wrote:

It is the iwlwifi/mvm logic that supports ax200.


Let me ask again :

I see two different potential call points :

drivers/net/wireless/intel/iwlwifi/pcie/tx.c:1529:
tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len);
drivers/net/wireless/intel/iwlwifi/queue/tx.c:427:
tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len);

To the best of your knowledge, which one would be used in your case ?

Both are horribly complex, I do not want to spend time studying two
implementations.


It is the queue/tx.c code that executes on my system, verified with
printk.


Not sure why Intel's not on CC here.


Heh :)

Let's also add linux-wireless.


Luca, is the ax200 TSO performance regression with recent kernel on your
radar?


It wasn't on mine for sure, so far. But it's supposed to be Christmas
vacation, so haven't checked our bug tracker etc. I see Emmanuel was at
least looking at the bug report, but not sure what else happened yet.


Not to bitch and moan too much, but even the most basic of testing would
have shown this, how can testing be so poor on the ax200 driver?

It even shows up with the out-of-tree ax200 driver.


Off the top of my head, I don't really see the issue. Does anyone have
the ability to capture the frames over the air (e.g. with another AX200
in monitor mode, load the driver with amsdu_size=3 module parameter to
properly capture A-MSDUs)?


I can do that at some point, and likely it could be reproduced with an /n or /ac
AP and those are a lot easier to sniff.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


It seems the problem comes from some skbs reaching the driver with
gso_type == 0,
meaning skb_is_gso_tcp() is fuzzy. (net/core/tso.c is only one of the
skb_is_gso_tcp() users)

Local TCP stack should provide either SKB_GSO_TCPV4 or SKB_GSO_TCPV6
for GSO packets.

So maybe the issue is coming from traffic coming from a VM through a
tun device or something,
and our handling of GSO_ROBUST / DODGY never cared about setting
SKB_GSO_TCPV4 or SKB_GSO_TCPV6 if not already given by user space ?

Or a plain bug somewhere, possibly overwriting  gso_type with 0 or garbage...


Oh well, iwl_mvm_tx_tso_segment() 'builds' a fake gso packet.

I suspect this will fix the issue :

diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
index 
a983c215df310776ffe67f3b3ffa203eab609bfc..e7ad6367c88de4aff700c630d850760d1d3bf011
100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
@@ -773,6 +773,7 @@ iwl_mvm_tx_tso_segment(struct sk_buff *skb,
unsigned int num_subframes,

 next = skb_gso_segment(skb, netdev_flags);
 skb_shinfo(skb)->gso_size = mss;
+   skb_shinfo(skb)->gso_type = ipv4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6;
 if (WARN_ON_ONCE(IS_ERR(next)))
 return -EINVAL;
 else if (next)



Or more precisely :

diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
index 
a983c215df310776ffe67f3b3ffa203eab609bfc..11145bf29f3cbeefcce1a05cc81fd90978f2cbfe
100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
@@ -773,6 +773,7 @@ iwl_mvm_tx_tso_segment(struct sk_buff *skb,
unsigned int num_subframes,

 next = skb_gso_segment(skb, netdev_flags);
 skb_shinfo(skb)->gso_size = mss;
+   skb_shinfo(skb)->gso_type = ipv4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6;
 if (WARN_ON_ONCE(IS_ERR(next)))
 return -EINVAL;
 else if (next)
@@ -795,6 +796,7 @@ iwl_mvm_tx_tso_segment(struct sk_buff *skb,
unsigned int num_subframes,

 if (tcp_payload_len > mss) {
 skb_shinfo(tmp)->gso_size = mss;
+   skb_shinfo(tmp)->gso_type = ipv4 ?
SKB_GSO_TCPV4 : SKB_GSO_TCPV6;
 } else {
 if (qos) {
 u8 *qc;




This looks good to me.
Transmission rate is in the expected range. iperf3 shows no retries anymore.

Here is my kernel log with the above changes applied, and the debug patches 
from Eric.


I tested this successfully as well.

Eric:  Thanks for the patch!

--Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: net: tso: add UDP segmentation support: adds regression for ax200 upload

2020-12-19 Thread Ben Greear

On 12/19/20 7:18 AM, Johannes Berg wrote:

On Fri, 2020-12-18 at 12:16 -0800, Jakub Kicinski wrote:

On Thu, 17 Dec 2020 12:40:26 -0800 Ben Greear wrote:

On 12/17/20 10:20 AM, Eric Dumazet wrote:

On Thu, Dec 17, 2020 at 7:13 PM Ben Greear  wrote:

It is the iwlwifi/mvm logic that supports ax200.


Let me ask again :

I see two different potential call points :

drivers/net/wireless/intel/iwlwifi/pcie/tx.c:1529:
tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len);
drivers/net/wireless/intel/iwlwifi/queue/tx.c:427:
tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len);

To the best of your knowledge, which one would be used in your case ?

Both are horribly complex, I do not want to spend time studying two
implementations.


It is the queue/tx.c code that executes on my system, verified with
printk.


Not sure why Intel's not on CC here.


Heh :)

Let's also add linux-wireless.


Luca, is the ax200 TSO performance regression with recent kernel on your
radar?


It wasn't on mine for sure, so far. But it's supposed to be Christmas
vacation, so haven't checked our bug tracker etc. I see Emmanuel was at
least looking at the bug report, but not sure what else happened yet.


Not to bitch and moan too much, but even the most basic of testing would
have shown this, how can testing be so poor on the ax200 driver?

It even shows up with the out-of-tree ax200 driver.


Off the top of my head, I don't really see the issue. Does anyone have
the ability to capture the frames over the air (e.g. with another AX200
in monitor mode, load the driver with amsdu_size=3 module parameter to
properly capture A-MSDUs)?


I can do that at some point, and likely it could be reproduced with an /n or /ac
AP and those are a lot easier to sniff.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 0/3] mac80211: Trigger disconnect for STA during recovery

2020-12-17 Thread Ben Greear

On 12/17/20 2:24 PM, Brian Norris wrote:

On Tue, Dec 15, 2020 at 10:23:33AM -0800, Ben Greear wrote:

On 12/15/20 9:21 AM, Youghandhar Chintala wrote:

From: Rakesh Pillai 

Currently in case of target hardware restart ,we just reconfig and
re-enable the security keys and enable the network queues to start
data traffic back from where it was interrupted.


Are there any known mac80211 radios/drivers that *can* support seamless 
restarts?

If not, then just could always enable this feature in mac80211?


I'm quite sure that iwlwifi intentionally supports a seamless restart.
 From my experience with dealing with user reports, I don't recall any
issues where restart didn't function as expected, unless there was some
deeper underlying failure (e.g., hardware/power failure; driver bugs /
lockups).

I don't have very good stats for ath10k/QCA6174, but it survives
our testing OK and I again don't recall any user-reported complaints in
this area. I'd say this is a weaker example though, as I don't have as
clear of data. (By contrast, ath10k/WCN399x, which Rakesh, et al, are
patching here, does not pass our tests at all, and clearly fails to
recover from "seamless" restarts, as noted in patch 3.)

I'd also note that we don't operate in AP mode -- only STA -- and IIRC
Ben, you've complained about AP mode in the past.


I complain about all sorts of things, but I'm usually running
station mode :)

Do you actually see iwlwifi stations stay associated through
firmware crashes?

Anyway, happy to hear some have seamless recovery, and in that case,
I have no objections to the patch.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: net: tso: add UDP segmentation support: adds regression for ax200 upload

2020-12-17 Thread Ben Greear

On 12/17/20 10:20 AM, Eric Dumazet wrote:

On Thu, Dec 17, 2020 at 7:13 PM Ben Greear  wrote:






It is the iwlwifi/mvm logic that supports ax200.


Let me ask again :

I see two different potential call points :

drivers/net/wireless/intel/iwlwifi/pcie/tx.c:1529:
tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len);
drivers/net/wireless/intel/iwlwifi/queue/tx.c:427:
tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len);

To the best of your knowledge, which one would be used in your case ?

Both are horribly complex, I do not want to spend time studying two
implementations.


It is the queue/tx.c code that executes on my system, verified with
printk.

Thanks,
Ben



Thanks.




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: net: tso: add UDP segmentation support: adds regression for ax200 upload

2020-12-17 Thread Ben Greear



On 12/17/2020 10:07 AM, Eric Dumazet wrote:

On Thu, Dec 17, 2020 at 6:56 PM Ben Greear  wrote:

On 12/17/20 2:11 AM, Eric Dumazet wrote:

On Thu, Dec 17, 2020 at 12:59 AM Ben Greear  wrote:

On 12/16/20 3:09 PM, Ben Greear wrote:

Hello Eric,

The patch below evidently causes TCP throughput to be about 50Mbps instead of 
700Mbps
when using ax200 to upload tcp traffic.

When I disable TSO, performance goes back up to around 700Mbps.

As a followup, when I revert the patch, upload speed goes to ~900Mbps,
so even better than just disabling TSO (I left TSO enabled after reverting the 
patch).

Thanks,
Ben


Thanks for the report !

It seems drivers/net/wireless/intel/iwlwifi/pcie/tx.c:iwl_fill_data_tbs_amsdu()
calls tso_build_hdr() with extra bytes (SNAP header),
it is not yet clear to me what is broken :/

Your patch is guessing tcp vs udp by looking at header length
from what I could tell.  So if something uses a different size,
it probably gets confused?

I do not think so, my patch selects TCP vs UDP by using standard GSO
helper skb_is_gso_tcp(skb)

tso->tlen is initialized from tso_start() :

int tlen = skb_is_gso_tcp(skb) ? tcp_hdrlen(skb) : sizeof(struct udphdr);

tso->tlen = tlen;

Maybe for some reason skb_is_gso_tcp(skb) returns false in your case,
some debugging would help.


Can you confirm which driver is used for ax200 ?

I see tso_build_hdr() also being used from
drivers/net/wireless/intel/iwlwifi/queue/tx.c

I tested against the un-modified ax200 5.10.0 kernel driver, and it has the 
issue.

The ax200 backports release/core56 driver acts a bit different (poorer 
performance over all than
in-kernel driver), but has similar upstream issues that are mitigated by
disabling TSO.

Sorry, I can not find ax200 driver.


It is the iwlwifi/mvm logic that supports ax200.

Thanks,

Ben




Re: net: tso: add UDP segmentation support: adds regression for ax200 upload

2020-12-17 Thread Ben Greear

On 12/17/20 2:11 AM, Eric Dumazet wrote:

On Thu, Dec 17, 2020 at 12:59 AM Ben Greear  wrote:


On 12/16/20 3:09 PM, Ben Greear wrote:

Hello Eric,

The patch below evidently causes TCP throughput to be about 50Mbps instead of 
700Mbps
when using ax200 to upload tcp traffic.

When I disable TSO, performance goes back up to around 700Mbps.


As a followup, when I revert the patch, upload speed goes to ~900Mbps,
so even better than just disabling TSO (I left TSO enabled after reverting the 
patch).

Thanks,
Ben



Thanks for the report !

It seems drivers/net/wireless/intel/iwlwifi/pcie/tx.c:iwl_fill_data_tbs_amsdu()
calls tso_build_hdr() with extra bytes (SNAP header),
it is not yet clear to me what is broken :/


Your patch is guessing tcp vs udp by looking at header length
from what I could tell.  So if something uses a different size,
it probably gets confused?



Can you confirm which driver is used for ax200 ?

I see tso_build_hdr() also being used from
drivers/net/wireless/intel/iwlwifi/queue/tx.c


I tested against the un-modified ax200 5.10.0 kernel driver, and it has the 
issue.

The ax200 backports release/core56 driver acts a bit different (poorer 
performance over all than
in-kernel driver), but has similar upstream issues that are mitigated by
disabling TSO.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: net: tso: add UDP segmentation support: adds regression for ax200 upload

2020-12-16 Thread Ben Greear

On 12/16/20 3:09 PM, Ben Greear wrote:

Hello Eric,

The patch below evidently causes TCP throughput to be about 50Mbps instead of 
700Mbps
when using ax200 to upload tcp traffic.

When I disable TSO, performance goes back up to around 700Mbps.


As a followup, when I revert the patch, upload speed goes to ~900Mbps,
so even better than just disabling TSO (I left TSO enabled after reverting the 
patch).

Thanks,
Ben



I recall ~5 years ago we had similar TCP related performance issues with ath10k.
I vaguely recall that there might be some driver-level socket pacing tuning 
value, but I cannot
find the right thing to search for.  Is this really a thing?  If so, maybe it 
will
be a way to resolve this issue?

See this more thorough bug report:

https://bugzilla.kernel.org/show_bug.cgi?id=209913

Patch description:
net: tso: add UDP segmentation support
Note that like TCP, we do not support additional encapsulations,
and that checksums must be offloaded to the NIC.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 

Thanks,
Ben





net: tso: add UDP segmentation support: adds regression for ax200 upload

2020-12-16 Thread Ben Greear

Hello Eric,

The patch below evidently causes TCP throughput to be about 50Mbps instead of 
700Mbps
when using ax200 to upload tcp traffic.

When I disable TSO, performance goes back up to around 700Mbps.

I recall ~5 years ago we had similar TCP related performance issues with ath10k.
I vaguely recall that there might be some driver-level socket pacing tuning 
value, but I cannot
find the right thing to search for.  Is this really a thing?  If so, maybe it 
will
be a way to resolve this issue?

See this more thorough bug report:

https://bugzilla.kernel.org/show_bug.cgi?id=209913

Patch description:
net: tso: add UDP segmentation support
Note that like TCP, we do not support additional encapsulations,
and that checksums must be offloaded to the NIC.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 0/3] mac80211: Trigger disconnect for STA during recovery

2020-12-15 Thread Ben Greear

On 12/15/20 9:21 AM, Youghandhar Chintala wrote:

From: Rakesh Pillai 

Currently in case of target hardware restart ,we just reconfig and
re-enable the security keys and enable the network queues to start
data traffic back from where it was interrupted.


Are there any known mac80211 radios/drivers that *can* support seamless 
restarts?

If not, then just could always enable this feature in mac80211?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v2 1/3] ath10k: Add history for tracking certain events

2020-07-31 Thread Ben Greear

On 7/31/20 11:27 AM, Rakesh Pillai wrote:

Add history for tracking the below events
- register read
- register write
- IRQ trigger
- NAPI poll
- CE service
- WMI cmd
- WMI event
- WMI tx completion

This will help in debugging any crash or any
improper behaviour.

Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.3.1-01040-QCAHLSWMTPLZ-1

Signed-off-by: Rakesh Pillai 
---
  drivers/net/wireless/ath/ath10k/ce.c  |   1 +
  drivers/net/wireless/ath/ath10k/core.h|  74 +
  drivers/net/wireless/ath/ath10k/debug.c   | 133 ++
  drivers/net/wireless/ath/ath10k/debug.h   |  74 +
  drivers/net/wireless/ath/ath10k/snoc.c|  15 +++-
  drivers/net/wireless/ath/ath10k/wmi-tlv.c |   1 +
  drivers/net/wireless/ath/ath10k/wmi.c |  10 +++
  7 files changed, 307 insertions(+), 1 deletion(-)




+void ath10k_record_wmi_event(struct ath10k *ar, enum ath10k_wmi_type type,
+u32 id, unsigned char *data)
+{
+   struct ath10k_wmi_event_entry *entry;
+   u32 idx;
+
+   if (type == ATH10K_WMI_EVENT) {
+   if (!ar->wmi_event_history.record)
+   return;


This check above is duplicated below, add it once at top of the method
instead.


+
+   spin_lock_bh(&ar->wmi_event_history.hist_lock);
+   idx = ath10k_core_get_next_idx(&ar->reg_access_history.index,
+  
ar->wmi_event_history.max_entries);
+   spin_unlock_bh(&ar->wmi_event_history.hist_lock);
+   entry = &ar->wmi_event_history.record[idx];
+   } else {
+   if (!ar->wmi_cmd_history.record)
+   return;
+
+   spin_lock_bh(&ar->wmi_cmd_history.hist_lock);
+   idx = ath10k_core_get_next_idx(&ar->reg_access_history.index,
+  ar->wmi_cmd_history.max_entries);
+   spin_unlock_bh(&ar->wmi_cmd_history.hist_lock);
+   entry = &ar->wmi_cmd_history.record[idx];
+   }
+
+   entry->timestamp = ath10k_core_get_timestamp();
+   entry->cpu_id = smp_processor_id();
+   entry->type = type;
+   entry->id = id;
+   memcpy(&entry->data, data + 4, ATH10K_WMI_DATA_LEN);
+}
+EXPORT_SYMBOL(ath10k_record_wmi_event);



@@ -1660,6 +1668,11 @@ static int ath10k_snoc_probe(struct platform_device 
*pdev)
ar->ce_priv = &ar_snoc->ce;
msa_size = drv_data->msa_size;
  
+	ath10k_core_reg_access_history_init(ar, ATH10K_REG_ACCESS_HISTORY_MAX);

+   ath10k_core_wmi_event_history_init(ar, ATH10K_WMI_EVENT_HISTORY_MAX);
+   ath10k_core_wmi_cmd_history_init(ar, ATH10K_WMI_CMD_HISTORY_MAX);
+   ath10k_core_ce_event_history_init(ar, ATH10K_CE_EVENT_HISTORY_MAX);


Maybe only enable this once user turns it on?  It sucks up a bit of memory?


+
ath10k_snoc_quirks_init(ar);
  
  	ret = ath10k_snoc_resource_init(ar);

diff --git a/drivers/net/wireless/ath/ath10k/wmi-tlv.c 
b/drivers/net/wireless/ath/ath10k/wmi-tlv.c
index 932266d..9df5748 100644
--- a/drivers/net/wireless/ath/ath10k/wmi-tlv.c
+++ b/drivers/net/wireless/ath/ath10k/wmi-tlv.c
@@ -627,6 +627,7 @@ static void ath10k_wmi_tlv_op_rx(struct ath10k *ar, struct 
sk_buff *skb)
if (skb_pull(skb, sizeof(struct wmi_cmd_hdr)) == NULL)
goto out;
  
+	ath10k_record_wmi_event(ar, ATH10K_WMI_EVENT, id, skb->data);

trace_ath10k_wmi_event(ar, id, skb->data, skb->len);
  
  	consumed = ath10k_tm_event_wmi(ar, id, skb);

diff --git a/drivers/net/wireless/ath/ath10k/wmi.c 
b/drivers/net/wireless/ath/ath10k/wmi.c
index a81a1ab..8ebd05c 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -1802,6 +1802,15 @@ struct sk_buff *ath10k_wmi_alloc_skb(struct ath10k *ar, 
u32 len)
  
  static void ath10k_wmi_htc_tx_complete(struct ath10k *ar, struct sk_buff *skb)

  {
+   struct wmi_cmd_hdr *cmd_hdr;
+   enum wmi_tlv_event_id id;
+
+   cmd_hdr = (struct wmi_cmd_hdr *)skb->data;
+   id = MS(__le32_to_cpu(cmd_hdr->cmd_id), WMI_CMD_HDR_CMD_ID);
+
+   ath10k_record_wmi_event(ar, ATH10K_WMI_TX_COMPL, id,
+   skb->data + sizeof(struct wmi_cmd_hdr));
+
dev_kfree_skb(skb);
  }


I think guard the above new code with if 
(unlikely(ar->ce_event_history.record)) { ... }

All in all, I think I'd want to compile this out (while leaving other debug 
compiled
in) since it seems this stuff would be rarely used and it adds method calls to 
hot
paths.

That is a decision for Kalle though, so see what he says...

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v3 0/8] kernel: taint when the driver firmware crashes

2020-05-28 Thread Ben Greear




On 05/28/2020 07:27 AM, Luis Chamberlain wrote:

On Wed, May 27, 2020 at 02:36:42PM -0700, Jakub Kicinski wrote:

On Wed, 27 May 2020 03:19:18 + Luis Chamberlain wrote:

I read your patch, and granted, I will accept I was under the incorrect
assumption that this can only be used by networking devices, however it
the devlink approach achieves getting userspace the ability with
iproute2 devlink util to query a device health, on to which we can peg
firmware health. But *this* patch series is not about health status and
letting users query it, its about a *critical* situation which has come up
with firmware requiring me to reboot my system, and the lack of *any*
infrastructure in the kernel today to inform userspace about it.

So say we use netlink to report a critical health situation, how are we
informing userspace with your patch series about requring a reboot?


One of main features of netlink is pub/sub model of notifications.

Whatever you imagine listening to your uevent can listen to
devlink-health notifications via devlink.

In fact I've shown this off in the RFC patches I sent to you, see
the devlink mon health command being used.


Yes but I looked at iputils2 devlink and seems I made an incorrect
assumption this can only be used for a network device rather than
a struct device.

I'll take a second look.


Hello Jakub,

I'm thinking about something similar to what Luis is proposing, but in
my case I'd like to report just when the driver knows the hardware is gone
and cannot be recovered, like when this is reported:

[ 2548.851832] WARNING: CPU: 3 PID: 98 at 
backports-4.19.98-1/net/mac80211/util.c:2040 ieee80211_reconfig+0x98/0xb64 
[mac80211]
[ 2548.856020] Hardware became unavailable during restart.

I'd like to be able to tie this into a watch-dog program to allow automatic 
reboot
of the system soon after this event is seen, for instance.

Could you post your devlink RFC patches somewhere public?

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [RFC 1/2] devlink: add simple fw crash helpers

2020-05-25 Thread Ben Greear




On 05/25/2020 02:07 AM, Andy Shevchenko wrote:

On Fri, May 22, 2020 at 04:23:55PM -0700, Steve deRosier wrote:

On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain  wrote:



I had to go RTFM re: kernel taints because it has been a very long
time since I looked at them. It had always seemed to me that most were
caused by "kernel-unfriendly" user actions.  The most famous of course
is loading proprietary modules, out-of-tree modules, forced module
loads, etc...  Honestly, I had forgotten the large variety of uses of
the taint flags. For anyone who hasn't looked at taints recently, I
recommend: 
https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html

In light of this I don't object to setting a taint on this anymore.
I'm a little uneasy, but I've softened on it now, and now I feel it
depends on implementation.

Specifically, I don't think we should set a taint flag when a driver
easily handles a routine firmware crash and is confident that things
have come up just fine again. In other words, triggering the taint in
every driver module where it spits out a log comment that it had a
firmware crash and had to recover seems too much. Sure, firmware
shouldn't crash, sure it should be open source so we can fix it,
whatever...


While it may sound idealistic the firmware for the end-user, and even for mere
kernel developer like me, is a complete blackbox which has more access than
root user in the kernel. We have tons of firmwares and each of them potentially
dangerous beast. As a user I really care about my data and privacy (hacker can
oops a firmware in order to set a specific vector attack). So, tainting kernel
is _a least_ we can do there, the strict rules would be to reboot immediately.


those sort of wishful comments simply ignore reality and
our ability to affect effective change.


We can encourage users not to buy cheap crap for the starter.


There is no stable wifi firmware for any price.

There is also no obvious feedback from even name-brand NICs like ath10k or AX200
when you report a crash.

That said, at least in my experience with ath10k-ct, the OS normally recovers 
fine
from firmware crashes.  ath10k already reports full crash reports on udev, so
easy for user-space to notice and report bug reports upstream if it cares to.  
Probably
other NICs do the same, and if not, they certainly could.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

2020-05-18 Thread Ben Greear




On 05/18/2020 10:09 AM, Luis Chamberlain wrote:

On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:



On 05/18/2020 09:51 AM, Luis Chamberlain wrote:

On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:

On Fri, 2020-05-15 at 21:28 +, Luis Chamberlain wrote:> 
module_firmware_crashed

You didn't CC me or the wireless list on the rest of the patches, so I'm
replying to a random one, but ...

What is the point here?

This should in no way affect the integrity of the system/kernel, for
most devices anyway.


Keyword you used here is "most device". And in the worst case, *who*
knows what other odd things may happen afterwards.


So what if ath10k's firmware crashes? If there's a driver bug it will
not handle it right (and probably crash, WARN_ON, or something else),
but if the driver is working right then that will not affect the kernel
at all.


Sometimes the device can go into a state which requires driver removal
and addition to get things back up.


It would be lovely to be able to detect this case in the driver/system
somehow!  I haven't seen any such cases recently,


I assure you that I have run into it. Once it does again I'll report
the crash, but the problem with some of this is that unless you scrape
the log you won't know. Eventually, a uevent would indeed tell inform
me.


but in case there is
some common case you see, maybe we can think of a way to detect it?


ath10k is just one case, this patch series addresses a simple way to
annotate this tree-wide.


So maybe I can understand that maybe you want an easy way to discover -
per device - that the firmware crashed, but that still doesn't warrant a
complete kernel taint.


That is one reason, another is that a taint helps support cases *fast*
easily detect if the issue was a firmware crash, instead of scraping
logs for driver specific ways to say the firmware has crashed.


You can listen for udev events (I think that is the right term),
and find crashes that way.  You get the actual crash info as well.


My follow up to this was to add uevent to add_taint() as well, this way
these could generically be processed by userspace.


I'm not opposed to the taint, though I have not thought much on it.

But, if you can already get the crash info from uevent, and it automatically
comes without polling or scraping logs, then what benefit beyond that does
the taint give you?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

2020-05-18 Thread Ben Greear




On 05/18/2020 09:51 AM, Luis Chamberlain wrote:

On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:

On Fri, 2020-05-15 at 21:28 +, Luis Chamberlain wrote:> 
module_firmware_crashed

You didn't CC me or the wireless list on the rest of the patches, so I'm
replying to a random one, but ...

What is the point here?

This should in no way affect the integrity of the system/kernel, for
most devices anyway.


Keyword you used here is "most device". And in the worst case, *who*
knows what other odd things may happen afterwards.


So what if ath10k's firmware crashes? If there's a driver bug it will
not handle it right (and probably crash, WARN_ON, or something else),
but if the driver is working right then that will not affect the kernel
at all.


Sometimes the device can go into a state which requires driver removal
and addition to get things back up.


It would be lovely to be able to detect this case in the driver/system
somehow!  I haven't seen any such cases recently, but in case there is
some common case you see, maybe we can think of a way to detect it?




So maybe I can understand that maybe you want an easy way to discover -
per device - that the firmware crashed, but that still doesn't warrant a
complete kernel taint.


That is one reason, another is that a taint helps support cases *fast*
easily detect if the issue was a firmware crash, instead of scraping
logs for driver specific ways to say the firmware has crashed.


You can listen for udev events (I think that is the right term),
and find crashes that way.  You get the actual crash info as well.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH] ath10k: increase rx buffer size to 2048

2020-04-28 Thread Ben Greear




On 04/28/2020 05:01 AM, Kalle Valo wrote:

Sven Eckelmann  writes:


On Wednesday, 1 April 2020 09:00:49 CEST Sven Eckelmann wrote:

On Wednesday, 5 February 2020 20:10:43 CEST Linus Lüssing wrote:

From: Linus Lüssing 

Before, only frames with a maximum size of 1528 bytes could be
transmitted between two 802.11s nodes.

For batman-adv for instance, which adds its own header to each frame,
we typically need an MTU of at least 1532 bytes to be able to transmit
without fragmentation.

This patch now increases the maxmimum frame size from 1528 to 1656
bytes.

[...]

@Kalle, I saw that this patch was marked as deferred [1] but I couldn't find
any mail why it was done so. It seems like this currently creates real world
problems - so would be nice if you could explain shortly what is currently
blocking its acceptance.


Ping?


Sorry for the delay, my plan was to first write some documentation about
different hardware families but haven't managed to do that yet.

My problem with this patch is that I don't know what hardware and
firmware versions were tested, so it needs analysis before I feel safe
to apply it. The ath10k hardware families are very different that even
if a patch works perfectly on one ath10k hardware it could still break
badly on another one.

What makes me faster to apply ath10k patches is to have comprehensive
analysis in the commit log. This shows me the patch author has
considered about all hardware families, not just the one he is testing
on, and that I don't need to do the analysis myself.


It has been in ath10k-ct for a while, and that has some fairly wide coverage
in OpenWrt, so likely if there were problems we would have seen it already.

I did not make any specific changes to firmware to support this, so upstream
firmware should behave similarly.

Seems like upstream ath10k could really benefit from having some test beds
so you can actually test code on different chips and have confidence
in your changes!

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: Strange routing with VRF and 5.2.7+

2019-10-14 Thread Ben Greear

On 9/30/19 11:45 AM, Ben Greear wrote:

On 9/22/19 12:23 PM, David Ahern wrote:

On 9/20/19 9:57 AM, Ben Greear wrote:

On 9/10/19 6:08 PM, Ben Greear wrote:

On 9/10/19 3:17 PM, Ben Greear wrote:

Today we were testing creating 200 virtual station vdevs on ath9k,
and using
VRF for the routing.


Looks like the same issue happens w/out VRF, but there I have oodles
of routing
rules, so it is an area ripe for failure.

Will upgrade to 5.2.14+ and retest, and try 4.20 as well


Turns out, this was ipsec (strongswan) inserting a rule that pointed to
a table
that we then used for a vrf w/out realizing the rule was added.

Stopping strongswan and/or reconfiguring how routing tables are assigned
resolved the issue.



Hi Ben:

Since you are the pioneer with vrf and ipsec, can you add an ipsec
section with some notes to Documentation/networking/vrf.txt?


I need to to some more testing, an initial attempt to reproduce my working
config on another system did not work properly, and I have not yet dug into
it.


I'm still grinding out the bugs...  Here is my current quandry.

In the VRF I have the 'real' device, say eth4 with IP 192.168.5.5.  This talks 
to
the VPN gateway device at 192.168.5.1.

When I add the xfrm, it is given the address 192.168.10.100.

I need all traffic routing out the vrf to use the xfrm as source IP,
except the eth4 still needs to be able to talk to the 5.1 device
(I think?)

Evidently, adding this type of route below will do the trick, at least in
non-vrf setup, and with this route in its own table that is queried after
'local' routing table, but before the others via use of a fairly generic 
rule

default via 192.168.5.1 dev enp1s0 proto static src 192.168.10.100

I am guessing that in VRF world, I can get rid of the rule, and replace the
existing default route (given to eth4 when it does DHCP or is statically 
assigned)
with something like the above.  And, maybe I need a special route for the VPN
gateway itself as destination so that ipsec logic on eth4 can still talk to it?

(I am thinking of the case where the VPN gateway is not on the local subnet
and so we have to route to it special???)

Any insight is welcome.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: IPv6 addr and route is gone after adding port to vrf (5.2.0+)

2019-10-14 Thread Ben Greear

On 10/11/19 1:35 PM, David Ahern wrote:

On 10/11/19 7:57 AM, Ben Greear wrote:

The down-up cycling is done on purpose - to clear out neigh entries and
routes associated with the device under the old VRF. All entries must be
created with the device in the new VRF.


I believe I found another thing to be aware of relating to this.

My logic has been to do supplicant, then do DHCP, and only when DHCP
responds do I set up the networking for the wifi station.

It is at this time that I would be creating a VRF (or using routing rules
if not using VRF).

But, when I add the station to the newly created vrf, then it bounces it,
and that causes supplicant to have to re-associate  (I think, lots of
moving
pieces, so I could be missing something).

Any chance you could just clear the neighbor entries and routes w/out
bouncing
the interface?


yes, it is annoying. I have been meaning to fix that, but never found
the motivation to do it. If you have the time, it would be worth
avoiding the overhead.


I changed my code so that it adds to the vrf first, so I too am lacking
motivation and time to dig into the kernel at the moment.  I'll let you know
if I find time to work on it.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: IPv6 addr and route is gone after adding port to vrf (5.2.0+)

2019-10-11 Thread Ben Greear

On 8/16/19 2:48 PM, David Ahern wrote:

On 8/16/19 3:28 PM, Ben Greear wrote:

On 8/16/19 12:15 PM, David Ahern wrote:

On 8/16/19 1:13 PM, Ben Greear wrote:

I have a problem with a VETH port when setting up a somewhat complicated
VRF setup. I am loosing the global IPv6 addr, and also the route,
apparently
when I add the veth device to a vrf.  From my script's output:


Either enslave the device before adding the address or enable the
retention of addresses:

sysctl -q -w net.ipv6.conf.all.keep_addr_on_down=1



Thanks, I added it to the vrf first just in case some other logic was
expecting the routes to go away on network down.

That part now seems to be working.



The down-up cycling is done on purpose - to clear out neigh entries and
routes associated with the device under the old VRF. All entries must be
created with the device in the new VRF.


I believe I found another thing to be aware of relating to this.

My logic has been to do supplicant, then do DHCP, and only when DHCP
responds do I set up the networking for the wifi station.

It is at this time that I would be creating a VRF (or using routing rules
if not using VRF).

But, when I add the station to the newly created vrf, then it bounces it,
and that causes supplicant to have to re-associate  (I think, lots of moving
pieces, so I could be missing something).

Any chance you could just clear the neighbor entries and routes w/out bouncing
the interface?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Strange routing with VRF and 5.2.7+

2019-09-30 Thread Ben Greear

On 9/22/19 12:23 PM, David Ahern wrote:

On 9/20/19 9:57 AM, Ben Greear wrote:

On 9/10/19 6:08 PM, Ben Greear wrote:

On 9/10/19 3:17 PM, Ben Greear wrote:

Today we were testing creating 200 virtual station vdevs on ath9k,
and using
VRF for the routing.


Looks like the same issue happens w/out VRF, but there I have oodles
of routing
rules, so it is an area ripe for failure.

Will upgrade to 5.2.14+ and retest, and try 4.20 as well


Turns out, this was ipsec (strongswan) inserting a rule that pointed to
a table
that we then used for a vrf w/out realizing the rule was added.

Stopping strongswan and/or reconfiguring how routing tables are assigned
resolved the issue.



Hi Ben:

Since you are the pioneer with vrf and ipsec, can you add an ipsec
section with some notes to Documentation/networking/vrf.txt?


I need to to some more testing, an initial attempt to reproduce my working
config on another system did not work properly, and I have not yet dug into
it.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Strange routing with VRF and 5.2.7+

2019-09-20 Thread Ben Greear

On 9/10/19 6:08 PM, Ben Greear wrote:

On 9/10/19 3:17 PM, Ben Greear wrote:

Today we were testing creating 200 virtual station vdevs on ath9k, and using
VRF for the routing.


Looks like the same issue happens w/out VRF, but there I have oodles of routing
rules, so it is an area ripe for failure.

Will upgrade to 5.2.14+ and retest, and try 4.20 as well


Turns out, this was ipsec (strongswan) inserting a rule that pointed to a table
that we then used for a vrf w/out realizing the rule was added.

Stopping strongswan and/or reconfiguring how routing tables are assigned
resolved the issue.

Thanks,
Ben



Thanks,
Ben



This really slows down the machine in question.

During the minutes that it takes to bring these up and configure them,
we loose network connectivity on the management port.

If I do 'ip route show', it just shows the default route out of eth0, and
the subnet route.  But, if I try to ping the gateway, I get an ICMP error
coming back from the gateway of one of the virtual stations (which should be
safely using VRFs and so not in use when I do a plain 'ping' from the shell).

I tried running tshark on eth0 in the background and running ping, and it 
captures
no packets leaving eth0.

After some time (and during this time, my various scripts will be 
(re)configuring
vrfs and stations and related vrf routing tables and such,
but should *not* be messing with the main routing table, then suddenly
things start working again.

I am curious if anyone has seen anything similar or has suggestions for more
ways to debug this.  It seems reproducible, but it is a pain to
debug.

Thanks,
Ben







--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Strange routing with VRF and 5.2.7+

2019-09-10 Thread Ben Greear

On 9/10/19 3:17 PM, Ben Greear wrote:

Today we were testing creating 200 virtual station vdevs on ath9k, and using
VRF for the routing.


Looks like the same issue happens w/out VRF, but there I have oodles of routing
rules, so it is an area ripe for failure.

Will upgrade to 5.2.14+ and retest, and try 4.20 as well

Thanks,
Ben



This really slows down the machine in question.

During the minutes that it takes to bring these up and configure them,
we loose network connectivity on the management port.

If I do 'ip route show', it just shows the default route out of eth0, and
the subnet route.  But, if I try to ping the gateway, I get an ICMP error
coming back from the gateway of one of the virtual stations (which should be
safely using VRFs and so not in use when I do a plain 'ping' from the shell).

I tried running tshark on eth0 in the background and running ping, and it 
captures
no packets leaving eth0.

After some time (and during this time, my various scripts will be 
(re)configuring
vrfs and stations and related vrf routing tables and such,
but should *not* be messing with the main routing table, then suddenly
things start working again.

I am curious if anyone has seen anything similar or has suggestions for more
ways to debug this.  It seems reproducible, but it is a pain to
debug.

Thanks,
Ben




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Strange routing with VRF and 5.2.7+

2019-09-10 Thread Ben Greear

Today we were testing creating 200 virtual station vdevs on ath9k, and using
VRF for the routing.

This really slows down the machine in question.

During the minutes that it takes to bring these up and configure them,
we loose network connectivity on the management port.

If I do 'ip route show', it just shows the default route out of eth0, and
the subnet route.  But, if I try to ping the gateway, I get an ICMP error
coming back from the gateway of one of the virtual stations (which should be
safely using VRFs and so not in use when I do a plain 'ping' from the shell).

I tried running tshark on eth0 in the background and running ping, and it 
captures
no packets leaving eth0.

After some time (and during this time, my various scripts will be 
(re)configuring
vrfs and stations and related vrf routing tables and such,
but should *not* be messing with the main routing table, then suddenly
things start working again.

I am curious if anyone has seen anything similar or has suggestions for more
ways to debug this.  It seems reproducible, but it is a pain to
debug.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: VRF notes when using ipv6 and flushing tables.

2019-08-21 Thread Ben Greear

On 08/20/2019 08:02 PM, David Ahern wrote:

On 8/20/19 2:27 PM, Ben Greear wrote:

I recently spend a few days debugging what in the end was user error on
my part.

Here are my notes in hope they help someone else.

First, 'ip -6 route show vrf vrfX' will not show some of the
routes (like local routes) that will show up with
'ip -6 route show table X', where X == vrfX's table-id

If you run 'ip -6 route flush table X', then you will loose all of the auto
generated routes, including anycast, ff00::/8, and local routes.

ff00::/8 is needed for neigh discovery to work (probably among other
things)

local route is needed or packets won't actually be accepted up the stack
(I think that is the symptom at least)

Not sure exactly what anycast does, but I'm guessing it is required for
something useful.

You must manually re-add those to the table unless you for certain know
that
you do not need them for whatever reason.



sorry you went through such a long and painful debugging session.


No problem.  I learned some details of IPv6 I never realized before,
sure to come in useful some day!

Thanks,
Ben


yes, the kernel doc for VRF needs to be updated that 'ip route show vrf
X' and 'ip route show table X' are different ('show vrf' mimics the main
table in not showing local, broadcast, anycast; 'table vrf' shows all).

A suggestion for others: the documentation and selftests directory have
a lot of VRF examples now. If something basic is not working (e.g., arp
or neigh discovery), see if it works there and if so compare the outputs
of the route table along the way.




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



VRF notes when using ipv6 and flushing tables.

2019-08-20 Thread Ben Greear

I recently spend a few days debugging what in the end was user error on my part.

Here are my notes in hope they help someone else.

First, 'ip -6 route show vrf vrfX' will not show some of the
routes (like local routes) that will show up with
'ip -6 route show table X', where X == vrfX's table-id

If you run 'ip -6 route flush table X', then you will loose all of the auto
generated routes, including anycast, ff00::/8, and local routes.

ff00::/8 is needed for neigh discovery to work (probably among other things)

local route is needed or packets won't actually be accepted up the stack
(I think that is the symptom at least)

Not sure exactly what anycast does, but I'm guessing it is required for
something useful.

You must manually re-add those to the table unless you for certain know that
you do not need them for whatever reason.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: IPv6 addr and route is gone after adding port to vrf (5.2.0+)

2019-08-16 Thread Ben Greear

On 8/16/19 12:15 PM, David Ahern wrote:

On 8/16/19 1:13 PM, Ben Greear wrote:

I have a problem with a VETH port when setting up a somewhat complicated
VRF setup. I am loosing the global IPv6 addr, and also the route,
apparently
when I add the veth device to a vrf.  From my script's output:


Either enslave the device before adding the address or enable the
retention of addresses:

sysctl -q -w net.ipv6.conf.all.keep_addr_on_down=1



Thanks, I added it to the vrf first just in case some other logic was
expecting the routes to go away on network down.

That part now seems to be working.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



IPv6 addr and route is gone after adding port to vrf (5.2.0+)

2019-08-16 Thread Ben Greear

Hello,

I have a problem with a VETH port when setting up a somewhat complicated
VRF setup. I am loosing the global IPv6 addr, and also the route, apparently
when I add the veth device to a vrf.  From my script's output:

### commands to set up the veth 'rddVR0'

./local/sbin/ip link set rddVR0 down
./local/sbin/ip -4 addr flush dev rddVR0
./local/sbin/ip -6 addr flush dev rddVR0
echo 1 > /proc/sys/net/ipv4/conf/rddVR0/forwarding
echo 1 > /proc/sys/net/ipv6/conf/rddVR0/forwarding
./local/sbin/ip link set rddVR0 up
./local/sbin/ip -4 addr add 10.2.127.1/24 broadcast 10.2.127.255 dev rddVR0
./local/sbin/ip -6 addr add 2001:3::1/64 scope global dev rddVR0
./local/sbin/ip -6 addr add fe80::d0f8:6fff:fe06:8ae/64 scope link dev rddVR0
RTNETLINK answers: File exists
./local/sbin/ip -6 route add 2001:3::1/64 dev rddVR0 table 10001
./local/sbin/ip -6 route add fe80::d0f8:6fff:fe06:8ae/64 dev rddVR0 table 10001
./local/sbin/ip route add 10.2.127.0/24 dev rddVR0 table 10001
echo 1 > /proc/sys/net/ipv4/conf/rddVR0/arp_filter

#printRoutes for table 10001
broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.1 linkdown
10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.1 linkdown
local 10.2.1.1 dev eth1 proto kernel scope host src 10.2.1.1
broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.1 linkdown
broadcast 10.2.8.0 dev vap proto kernel scope link src 10.2.8.1 linkdown
10.2.8.0/24 dev vap proto kernel scope link src 10.2.8.1 linkdown
local 10.2.8.1 dev vap proto kernel scope host src 10.2.8.1
broadcast 10.2.8.255 dev vap proto kernel scope link src 10.2.8.1 linkdown
broadcast 10.2.9.0 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown
10.2.9.0/24 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown
local 10.2.9.1 dev vap0100 proto kernel scope host src 10.2.9.1
broadcast 10.2.9.255 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown
10.2.127.0/24 dev rddVR0 scope link
2001:3::/64 dev rddVR0 metric 1024 pref medium
fe80::/64 dev rddVR0 metric 1024 pref medium

 some other commands, route/ip is still there 

#printRoutes for table 10001
broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.1 linkdown
10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.1 linkdown
local 10.2.1.1 dev eth1 proto kernel scope host src 10.2.1.1
broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.1 linkdown
broadcast 10.2.8.0 dev vap proto kernel scope link src 10.2.8.1 linkdown
10.2.8.0/24 dev vap proto kernel scope link src 10.2.8.1 linkdown
local 10.2.8.1 dev vap proto kernel scope host src 10.2.8.1
broadcast 10.2.8.255 dev vap proto kernel scope link src 10.2.8.1 linkdown
broadcast 10.2.9.0 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown
10.2.9.0/24 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown
local 10.2.9.1 dev vap0100 proto kernel scope host src 10.2.9.1
broadcast 10.2.9.255 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown
10.2.127.0/24 dev rddVR0 scope link
2001:3::/64 dev rddVR0 metric 1024 pref medium
fe80::/64 dev rddVR0 metric 1024 pref medium


./local/sbin/ip link set rddVR0 vrf vrf10001

#printRoutes for table 10001
broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.1 linkdown
10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.1 linkdown
local 10.2.1.1 dev eth1 proto kernel scope host src 10.2.1.1
broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.1 linkdown
broadcast 10.2.8.0 dev vap proto kernel scope link src 10.2.8.1 linkdown
10.2.8.0/24 dev vap proto kernel scope link src 10.2.8.1 linkdown
local 10.2.8.1 dev vap proto kernel scope host src 10.2.8.1
broadcast 10.2.8.255 dev vap proto kernel scope link src 10.2.8.1 linkdown
broadcast 10.2.9.0 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown
10.2.9.0/24 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown
local 10.2.9.1 dev vap0100 proto kernel scope host src 10.2.9.1
broadcast 10.2.9.255 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown
broadcast 10.2.127.0 dev rddVR0 proto kernel scope link src 10.2.127.1
10.2.127.0/24 dev rddVR0 proto kernel scope link src 10.2.127.1
local 10.2.127.1 dev rddVR0 proto kernel scope host src 10.2.127.1
broadcast 10.2.127.255 dev rddVR0 proto kernel scope link src 10.2.127.1
fe80::/64 dev rddVR0 proto kernel metric 256 pref medium
ff00::/8 dev rddVR0 metric 256 pref medium


 Route is gone...
 2001:3::/64 dev rddVR0 metric 1024 pref medium


As far as I can tell, the same actions for a wifi AP interface do not hit this 
problem,
but not sure if that is luck or not at this point.

Any ideas what might be going on here?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



lockup in hacked 4.20.17+ kernel, maybe addrconf_verify_work related?

2019-07-10 Thread Ben Greear
[67044.714944]  sock_sendmsg+0x2b/0x40
[67044.714946]  ___sys_sendmsg+0x28a/0x2f0
[67044.714947]  ? ___sys_recvmsg+0x156/0x1d0
[67044.714950]  ? __alloc_pages_nodemask+0x111/0x280
[67044.714954]  ? alloc_pages_vma+0x6f/0x1c0
[67044.714957]  ? page_add_new_anon_rmap+0x72/0xb0
[67044.714958]  ? __handle_mm_fault+0x7db/0x12c0
[67044.714961]  __sys_sendmsg+0x52/0xa0
[67044.714964]  do_syscall_64+0x4a/0xf0
[67044.714967]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[67044.714969] RIP: 0033:0x7fa9c4af15a7
[67044.714972] Code: Bad RIP value.
[67044.714973] RSP: 002b:7fffdd7ac818 EFLAGS: 0246 ORIG_RAX: 
002e
[67044.714974] RAX: ffda RBX: 021ae990 RCX: 7fa9c4af15a7
[67044.714975] RDX:  RSI: 7fffdd7ac8b0 RDI: 0008
[67044.714976] RBP: 021b3d80 R08: 0004 R09: 7fa9c4dabf20
[67044.714976] R10: 0170 R11: 0246 R12: 021b3ec0
[67044.714977] R13: 7fffdd7ac8b0 R14: 021b3ec0 R15: 7fffdd7acb18
[67044.714980] INFO: task sshd:1763 blocked for more than 180 seconds.
[67044.720810]   Tainted: GW  O  4.20.17+ #30
[67044.725186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[67044.732027] sshdD0  1763   1355 0x0080
[67044.732029] Call Trace:
[67044.732038]  ? __schedule+0x29e/0x880
[67044.732040]  schedule+0x2a/0x80
[67044.732042]  schedule_preempt_disabled+0xc/0x20
[67044.732043]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.732046]  ? netlink_lookup+0x111/0x160
[67044.732048]  __netlink_dump_start+0x4f/0x1d0
[67044.732051]  ? rtnl_xdp_prog_skb+0x60/0x60
[67044.732052]  rtnetlink_rcv_msg+0x25c/0x390
[67044.732054]  ? rtnl_xdp_prog_skb+0x60/0x60
[67044.732055]  ? rtnl_calcit.isra.31+0x110/0x110
[67044.732057]  netlink_rcv_skb+0x44/0x120
[67044.732059]  netlink_unicast+0x18b/0x220
[67044.732060]  netlink_sendmsg+0x1ff/0x3d0
[67044.732064]  sock_sendmsg+0x2b/0x40
[67044.732066]  __sys_sendto+0xe9/0x150
[67044.732070]  ? __audit_syscall_exit+0x216/0x280
[67044.732071]  __x64_sys_sendto+0x1f/0x30
[67044.732075]  do_syscall_64+0x4a/0xf0
[67044.732077]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[67044.732079] RIP: 0033:0x7f16e29c765a
[67044.732082] Code: Bad RIP value.
[67044.732083] RSP: 002b:7ffe57e52e88 EFLAGS: 0246 ORIG_RAX: 
002c
[67044.732084] RAX: ffda RBX: 7ffe57e53f80 RCX: 7f16e29c765a
[67044.732085] RDX: 0014 RSI: 7ffe57e53f80 RDI: 0003
[67044.732085] RBP: 7ffe57e53fd0 R08: 7ffe57e53f24 R09: 000c
[67044.732086] R10:  R11: 0246 R12: 7ffe57e53f24
[67044.732087] R13: 7ffe57e54160 R14:  R15: 0003

Thanks,
Ben
--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



mgmt-tx issues with off-channel neighbor response on channel 100

2019-03-19 Thread Ben Greear

Hello,

I'm not sure if the fault is hostapd or the wireless stack (or something else),
but this is what I see:

I put an AP on channel 100, configured for RRM.

STA associates to it and sends a channel report request.

hostapd reports tx of the response frame failed with EBUSY (-16).

Debugging in the kernel (4.20.8+ hacks) shows it fails because
of the offchannel check.  This appears to be because hostapd marks
the frame as off-channel-OK, and nl80211 fails because of the
CAC logic (I think):

static bool cfg80211_off_channel_oper_allowed(struct wireless_dev *wdev)
{
ASSERT_WDEV_LOCK(wdev);

if (!cfg80211_beaconing_iface_active(wdev))
return true;

if (!(wdev->chandef.chan->flags & IEEE80211_CHAN_RADAR))
return true;

return regulatory_pre_cac_allowed(wdev->wiphy);
}

In this case, the packet is not actually off-channel, and CAC has already
completed successfully.

Any opinions on where to fix this?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Waiting for vrf to become free on rmmod of bridge...

2019-02-08 Thread Ben Greear

On 2/6/19 5:50 PM, David Ahern wrote:

On 2/6/19 3:20 PM, Ben Greear wrote:

Hello,

I just saw this warning on a system running a hacked 4.20.2+ kernel.
Any known bugs
of this nature in this (upstream) kernel?  The command that is blocked is:
'rmmod bridge llc'

[17069.299135] unregister_netdevice: waiting for _vrf13 to become free.
Usage count = 1
[17079.306438] unregister_netdevice: waiting for _vrf13 to become free.
Usage count = 1
[17089.314656] unregister_netdevice: waiting for _vrf13 to become free.
Usage count = 1
[17099.322870] unregister_netdevice: waiting for _vrf13 to become free.
Usage count = 1

Thanks,
Ben



No known refcount issues with vrf.

I use namespaces for testing which creates devices, adds routes, runs
traffic and deletes the device and namespace. That series in the tests
has been known to trigger refcount problems in the past.


I'm not using namespaces in my test, but it is fairly convoluted.  If I
figure out how to reproduce the issue I'll let you know.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Waiting for vrf to become free on rmmod of bridge...

2019-02-06 Thread Ben Greear

Hello,

I just saw this warning on a system running a hacked 4.20.2+ kernel.  Any known 
bugs
of this nature in this (upstream) kernel?  The command that is blocked is:
'rmmod bridge llc'

[17069.299135] unregister_netdevice: waiting for _vrf13 to become free. Usage 
count = 1
[17079.306438] unregister_netdevice: waiting for _vrf13 to become free. Usage 
count = 1
[17089.314656] unregister_netdevice: waiting for _vrf13 to become free. Usage 
count = 1
[17099.322870] unregister_netdevice: waiting for _vrf13 to become free. Usage 
count = 1

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Can NFS work with VRF?

2018-11-05 Thread Ben Greear

Hello,

I was trying to improve my old series of patches that binds NFS to
a particular source IP address so that it could work with VRF in a 4.16
kernel.  But, it seems a huge tangle to try to make NFS (and rpc, etc) able to 
bind to
a local netdevice, which I think is what would be needed to make it work with 
VRF.

Has anyone already worked on VRF support for NFS?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Anyone know if strongswan works with vrf?

2018-06-29 Thread Ben Greear

Hello,

We're trying to create lots of strongswan VPN tunnels on network devices
bound to different VRFs.  We are using Fedora-24 on the client side, with a 
4.16.15+ kernel
and updated 'ip' package, etc.

So far, no luck getting it to work.

Any idea if this is supported or not?

Thanks,
Ben
--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-11 Thread Ben Greear

On 06/10/2018 10:10 AM, Michał Kazior wrote:

Ben,

The patch is symptomatic. fq_tin_dequeue() already checks if the list
is empty before it tries to access first entry. I see no point in
using the _or_null() + WARN_ON.

The 0x3c deref is likely an offset off of NULL base pointer. Did you
check gdb/addr2line of the ieee80211_tx_dequeue+0xfb? Where did it
point to?


gdb pointed to one line above the flow dereference, which is why I was
going to put some debugging in there.



I suspect there's not enough synchronization between quescing the
device/ath10k after fw crashes and performing mac80211's reconfig
procedure.


I am already running this patch which helps with some of that.  That
patch never made it upstream, but it fixed problems for me earlier.

https://patchwork.kernel.org/patch/9457639/

Could easily be there are some more issues in that logic.

Someone else posted a patch to disable mac-80211 tx when FW crashes,
I think...I have not tried to backport that.

https://patchwork.kernel.org/patch/10411967/

Thanks,
Ben





Michał

On 8 June 2018 at 23:40, Arend van Spriel  wrote:

On 6/8/2018 5:17 PM, Ben Greear wrote:

I recalled an email from Michał leaving tieto so adding his alternate email
he provided back then.

Gr. AvS



On 06/07/2018 04:59 PM, Cong Wang wrote:


On Thu, Jun 7, 2018 at 4:48 PM,   wrote:


diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..cb911f0 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
return NULL;
}

-   flow = list_first_entry(head, struct fq_flow, flowchain);
+   flow = list_first_entry_or_null(head, struct fq_flow,
flowchain);
+
+   if (WARN_ON_ONCE(!flow))
+   return NULL;



This does not make sense either. list_first_entry_or_null()
returns NULL only when the list is empty, but we already check
list_empty() right before this code, and it is protected by fq->lock.



Hello Michal,

git blame shows you as the author of the fq_impl.h code.

I saw a crash when debugging funky ath10k firmware in a 4.16 + hacks
kernel.  There was an apparent
mostly-null deref in the fq_tin_dequeue method.  According to gdb, it
was within
1 line of the dereference of 'flow'.

My hack above is probably not that useful.  Cong thinks maybe the
locking is bad.

If you get a chance, please review this thread and see if you have any
ideas for
a better fix (or better debugging code).

As always, if you would like me to generate you a buggy firmware that
will crash
in the tx path and cause all sorts of mayhem in the ath10k driver and
wifi stack,
I will be happy to do so.

https://www.mail-archive.com/netdev@vger.kernel.org/msg239738.html

Thanks,
Ben







--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-08 Thread Ben Greear




On 06/07/2018 05:13 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 4:48 PM,   wrote:

From: Ben Greear 

While testing an ath10k firmware that often crashed under load,
I was seeing kernel crashes as well.  One of them appeared to
be a dereference of a NULL flow object in fq_tin_dequeue.

I have since fixed the firmware flaw, but I think it would be
worth adding the WARN_ON in case the problem appears again.

BUG: unable to handle kernel NULL pointer dereference at 003c
IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211]


Instead of adding WARN_ON(), you need to think about
the locking there, it is suspicious:

fq is from struct ieee80211_local:

struct fq *fq = &local->fq;

tin is from struct txq_info:

struct fq_tin *tin = &txqi->tin;

I don't know if fq and tin are supposed to be 1:1, if not there is
a bug in the locking, because ->new_flows and ->old_flows are
both inside tin instead of fq, but they are protected by fq->lock


Maybe whoever put this code together can take a stab at it.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-08 Thread Ben Greear




On 06/07/2018 04:59 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 4:48 PM,   wrote:

diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..cb911f0 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
return NULL;
}

-   flow = list_first_entry(head, struct fq_flow, flowchain);
+   flow = list_first_entry_or_null(head, struct fq_flow, flowchain);
+
+   if (WARN_ON_ONCE(!flow))
+   return NULL;


This does not make sense either. list_first_entry_or_null()
returns NULL only when the list is empty, but we already check
list_empty() right before this code, and it is protected by fq->lock.



Nevermind then.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Ben Greear

On 06/07/2018 02:52 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 2:41 PM, Ben Greear  wrote:

On 06/07/2018 02:29 PM, Cong Wang wrote:


On Thu, Jun 7, 2018 at 9:06 AM,   wrote:


--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,

flow = list_first_entry(head, struct fq_flow, flowchain);

+   if (WARN_ON_ONCE(!flow))
+   return NULL;
+



How could even possibly list_first_entry() returns NULL?
You need list_first_entry_or_null().



I don't know for certain flow as null, but something was NULL in this method
near that line and it looked like a likely culprit.

I guess possibly tin or fq was passed in as NULL?


A NULL pointer is not always 0. You can trigger a NULL-ptr-def with 0x3c
too, but you are checking against 0 in your patch, that is the problem and
that is why list_first_entry_or_null() exists.



Ahh, I see what you mean, and that is my mistake.  In my case, it did seem to
be a mostly-null deref, not a 0x0 deref.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Ben Greear

On 06/07/2018 02:29 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 9:06 AM,   wrote:

--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,

flow = list_first_entry(head, struct fq_flow, flowchain);

+   if (WARN_ON_ONCE(!flow))
+   return NULL;
+


How could even possibly list_first_entry() returns NULL?
You need list_first_entry_or_null().



I don't know for certain flow as null, but something was NULL in this method
near that line and it looked like a likely culprit.

I guess possibly tin or fq was passed in as NULL?

Anyway, if the patch seems worthless just ignore it.  I'll leave it in my tree
since it should be harmless and will let you know if I ever hit it.

If someone else hits a similar crash, hopefully they can report it.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Ben Greear

On 06/07/2018 09:17 AM, Eric Dumazet wrote:



On 06/07/2018 09:06 AM, gree...@candelatech.com wrote:

From: Ben Greear 

While testing an ath10k firmware that often crashed under load,
I was seeing kernel crashes as well.  One of them appeared to
be a dereference of a NULL flow object in fq_tin_dequeue.

I have since fixed the firmware flaw, but I think it would be
worth adding the WARN_ON in case the problem appears again.

 common_interrupt+0xf/0xf
 



Please find the exact commit that brought this bug,
and add a corresponding Fixes: tag


It will be a total pain to bisect this problem since my test
case that causes this is running my modified firmware (and a buggy one at that),
modified ath10k driver (to work with this firmware and support my test case 
easily),
and the failure case appears to cause multiple different-but-probably-related
crashes and often hangs or reboots the test system.

Probably this is all caused by some nasty race or buggy logic related to
dealing with a crashed ath10k firmware tearing down txq logic from the
bottom up.  There have been many such bugs in the past, I and others fixed a 
few,
and very likely more remain.

For what it is worth, I didn't see this crash in 4.13, and I spent some time
testing buggy firmware there occasionally.

If someone else has interest in debugging the ath10k driver, I will be happy to 
generate
a mostly-stock firmware image with ability to crash in the TX path and give it 
to them.
It will crash the stock upstream code reliably in my experience.

Thanks,
Ben





Signed-off-by: Ben Greear 
---
 include/net/fq_impl.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..e40354d 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,

flow = list_first_entry(head, struct fq_flow, flowchain);

+   if (WARN_ON_ONCE(!flow))
+   return NULL;
+
if (flow->deficit <= 0) {
flow->deficit += fq->quantum;
list_move_tail(&flow->flowchain,






--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Regression bisected to: softirq: Let ksoftirqd do its job

2018-05-17 Thread Ben Greear

One of my out-of-tree patches is a network impairment tool that acts a lot like
an Ethernet bridge with latency, jitter, etc.

We noticed recently that we were seeing igb adapter errors when testing with 
our emulator
at high speeds.  For whatever reason, it is only easily reproduced when we add 
jitter
to our emulator.  This would cause a bit more CPU usage and lock contention in 
our software,
and would increase the skb pkts allocated at any given time.

I bisected the problem to the commit below:

Author: Eric Dumazet 
Date:   Wed Aug 31 10:42:29 2016 -0700

softirq: Let ksoftirqd do its job

A while back, Paolo and Hannes sent an RFC patch adding threaded-able
napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/)


If I replace my emulator with a bridge, then I do not see the problem.  But, I 
also do not
(or very rarely?) see the problem when configuring the emulator with zero 
latency and jitter,
which is how the bridge would act.

Any idea what sort of (bad?) behaviour would be able to cause this tx q timeout?

If you have any interest, I will be happy to email you my out-of-tree patches 
and
instructions to reproduce the problem.


The kernel splat looks like this, and repeats often:


May 17 16:03:09 localhost.localdomain kernel: audit: type=1131 audit(1526598189.492:159): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed 
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

May 17 16:03:39 localhost.localdomain kernel: [ cut here 
]
May 17 16:03:39 localhost.localdomain kernel: WARNING: CPU: 5 PID: 0 at 
/home/greearb/git/linux-bisect/net/sched/sch_generic.c:316 
dev_watchdog+0x234/0x240
May 17 16:03:39 localhost.localdomain kernel: NETDEV WATCHDOG: eth5 (igb): 
transmit queue 0 timed out
May 17 16:03:39 localhost.localdomain kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 fuse macvlan wanlink(O) pktgen 
cfg80211 sunrpc coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass ipmi_ssif iTCO_wdt iTCO_vendor_support joydev i2c_i801 lpc_ich 
i2c_smbus ioatdma shpchp wmi ipmi_si ipmi_msghandler tpm_tis tpm_tis_core tpm acpi_power_meter acpi_pad sch_fq_codel ast drm_kms_helper ttm drm igb hwmon ptp 
pps_core dca i2c_algo_bit i2c_core fjes ipv6 crc_ccitt [last unloaded: nf_conntrack]

May 17 16:03:39 localhost.localdomain kernel: CPU: 5 PID: 0 Comm: swapper/5 
Tainted: G   O4.8.0-rc7+ #132
May 17 16:03:39 localhost.localdomain kernel: Hardware name: Iron_Systems,Inc 
CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
May 17 16:03:39 localhost.localdomain kernel:   
88087fd43d78 81417eb1 88087fd43dc8
May 17 16:03:39 localhost.localdomain kernel:   
88087fd43db8 81103556 013c7fd43da8
May 17 16:03:39 localhost.localdomain kernel:   
880854221940 0005 880854bb8000
May 17 16:03:39 localhost.localdomain kernel: Call Trace:
May 17 16:03:39 localhost.localdomain kernel:[] 
dump_stack+0x63/0x82
May 17 16:03:39 localhost.localdomain kernel:  [] 
__warn+0xc6/0xe0
May 17 16:03:39 localhost.localdomain kernel:  [] 
warn_slowpath_fmt+0x4a/0x50
May 17 16:03:39 localhost.localdomain kernel:  [] 
dev_watchdog+0x234/0x240
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
qdisc_rcu_free+0x40/0x40
May 17 16:03:39 localhost.localdomain kernel:  [] 
call_timer_fn+0x30/0x150
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
qdisc_rcu_free+0x40/0x40
May 17 16:03:39 localhost.localdomain kernel:  [] 
run_timer_softirq+0x1ea/0x450
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
ktime_get+0x37/0xa0
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
lapic_next_deadline+0x21/0x30
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
clockevents_program_event+0x7d/0x120
May 17 16:03:39 localhost.localdomain kernel:  [] 
__do_softirq+0xca/0x2d0
May 17 16:03:39 localhost.localdomain kernel:  [] 
irq_exit+0xb3/0xc0
May 17 16:03:39 localhost.localdomain kernel:  [] 
smp_apic_timer_interrupt+0x3d/0x50
May 17 16:03:39 localhost.localdomain kernel:  [] 
apic_timer_interrupt+0x82/0x90
May 17 16:03:39 localhost.localdomain kernel:[] ? 
cpuidle_enter_state+0x126/0x300
May 17 16:03:39 localhost.localdomain kernel:  [] 
cpuidle_enter+0x12/0x20
May 17 16:03:39 localhost.localdomain kernel:  [] 
call_cpuidle+0x25/0x40
May 17 16:03:39 localhost.localdomain kernel:  [] 
cpu_startup_entry+0x2ba/0x380
May 17 16:03:39 localhost.localdomain kernel:  [] 
start_secondary+0x149/0x170
May 17 16:03:39 localhost.localdomain kernel: ---[ end trace f62c6dd947785e8f 
]---


Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Performance regression between 4.13 and 4.14

2018-05-09 Thread Ben Greear

On 05/09/2018 12:02 PM, Ben Greear wrote:

On 05/09/2018 11:48 AM, Eric Dumazet wrote:



On 05/09/2018 11:43 AM, Ben Greear wrote:

On 05/08/2018 10:10 AM, Eric Dumazet wrote:



On 05/08/2018 09:44 AM, Ben Greear wrote:

Hello,

I am trying to track down a performance regression that appears to be between 
4.13
and 4.14.

I first saw the problem with a hacked version of pktgen on some ixgbe NICs.  
4.13 can do
right at 10G bi-directional on two ports, and 4.14 and later can do only about 
6Gbps.

I also tried with user-space UDP traffic on a stock kernel, and I can get about 
3.2Gbps combined tx+rx
on 4.14 and about 4.4Gbps on 4.13.

Attempting to bisect seems to be triggering a weirdness in git, and also lots 
of commits
crash or do not bring up networking, which makes the bisect difficult.

Looking at perf top, it would appear that some lock is probably to blame.



perf record -a -g -e cycles:pp sleep 5
perf report

Then you'll be able to tell us which lock (or call graph) is killing your perf.



I seem to be chasing multiple issues.  For 4.13, at least part of my problem 
was that LOCKDEP was enabled,
during my bisect, though it does NOT appear enabled in 4.16.  I think maybe 
CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING
in 4.16, or something like that?  My 4.16 .config does have 
CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it:

[greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config
CONFIG_LOCKDEP_SUPPORT=y


For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need
to disable to keep from getting a performance hit from the spectre-related bug
fixes?  At this point, I do not care about the security implications.

greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config
# CONFIG_RETPOLINE is not set


Thanks,
Ben



No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/


I initially saw the problem in 4.16, then bisected, and 4.14 still showed the
issue.


So, I guess I must have been enabling lockdep the whole time.  This 
__lock_acquire
is from lockdep as far as I can tell, not normal locking.  I re-built 4.16 after
verifying as best as I could that lockdep was not enabled, and now it performs
as expected.

I'm going to test a patch to change __lock_acquire to __lock_acquire_lockdep so
maybe someone else will not make the same mistake I made.


+   17.78%17.78%  kpktgend_1   [kernel.kallsyms] [k] 
__lock_acquire.isra.3



Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Performance regression between 4.13 and 4.14

2018-05-09 Thread Ben Greear

On 05/09/2018 11:48 AM, Eric Dumazet wrote:



On 05/09/2018 11:43 AM, Ben Greear wrote:

On 05/08/2018 10:10 AM, Eric Dumazet wrote:



On 05/08/2018 09:44 AM, Ben Greear wrote:

Hello,

I am trying to track down a performance regression that appears to be between 
4.13
and 4.14.

I first saw the problem with a hacked version of pktgen on some ixgbe NICs.  
4.13 can do
right at 10G bi-directional on two ports, and 4.14 and later can do only about 
6Gbps.

I also tried with user-space UDP traffic on a stock kernel, and I can get about 
3.2Gbps combined tx+rx
on 4.14 and about 4.4Gbps on 4.13.

Attempting to bisect seems to be triggering a weirdness in git, and also lots 
of commits
crash or do not bring up networking, which makes the bisect difficult.

Looking at perf top, it would appear that some lock is probably to blame.



perf record -a -g -e cycles:pp sleep 5
perf report

Then you'll be able to tell us which lock (or call graph) is killing your perf.



I seem to be chasing multiple issues.  For 4.13, at least part of my problem 
was that LOCKDEP was enabled,
during my bisect, though it does NOT appear enabled in 4.16.  I think maybe 
CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING
in 4.16, or something like that?  My 4.16 .config does have 
CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it:

[greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config
CONFIG_LOCKDEP_SUPPORT=y


For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need
to disable to keep from getting a performance hit from the spectre-related bug
fixes?  At this point, I do not care about the security implications.

greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config
# CONFIG_RETPOLINE is not set


Thanks,
Ben



No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/


I initially saw the problem in 4.16, then bisected, and 4.14 still showed the
issue.

4.13 works, but only when I use a .config I originally built for 4.13, not the 
4.16 .config
that I ended up using with the bisect (make oldconfig, accept all defaults).  I 
originally
configured 4.16 with a .config that had lockdep enabled, then manually tried to 
disable it
through 'make xconfig'.  I think that must leave "CONFIG_LOCKDEP=y" in the 
.config, which
screws up older builds during bisect, perhaps?


Before doing a (painful) dissection, the perf output would immediately tell you 
if
something is really wrong on your .config.


I didn't realize lockdep might be an issue at the time, but here is a 'bad' run 
from
a 4.13+ (plus pktgen hacks).  I guess lockdep is why this runs slowly, but I 
see no obvious
proof of that in the output:

4.13+, patched pktgen, 6Gbps throughput, on commit 
906dde0f355bd97c080c215811ae7db1137c4af8

Samples: 26K of event 'cycles:pp', Event count (approx.): 20119166736
  Children  Self  Command  Shared ObjectSymbol
+   87.97% 0.00%  kpktgend_1   [kernel.kallsyms][k] 
ret_from_fork
+   87.97% 0.00%  kpktgend_1   [kernel.kallsyms][k] kthread
+   86.89% 5.42%  kpktgend_1   [kernel.kallsyms][k] 
pktgen_thread_worker
+   33.75% 0.18%  kpktgend_1   [kernel.kallsyms][k] 
getnstimeofday64
+   32.77% 4.47%  kpktgend_1   [kernel.kallsyms][k] 
__getnstimeofday64
+   24.60%10.91%  kpktgend_1   [kernel.kallsyms][k] 
lock_acquire
+   23.59% 0.03%  kpktgend_1   [kernel.kallsyms][k] 
__do_softirq
+   23.55% 0.07%  kpktgend_1   [kernel.kallsyms][k] 
net_rx_action
+   22.29% 0.47%  kpktgend_1   [kernel.kallsyms][k] 
getRelativeCurNs
+   21.33% 1.71%  kpktgend_1   [kernel.kallsyms][k] 
ixgbe_poll
+   15.79% 0.02%  kpktgend_1   [kernel.kallsyms][k] 
ret_from_intr
+   15.78% 0.01%  kpktgend_1   [kernel.kallsyms][k] do_IRQ
+   15.34% 0.01%  kpktgend_1   [kernel.kallsyms][k] irq_exit
+   13.95%10.00%  kpktgend_1   [kernel.kallsyms][k] 
ip_send_check
+   13.80%13.80%  kpktgend_1   [kernel.kallsyms][k] 
__lock_acquire.isra.31
+   12.98% 0.53%  kpktgend_1   [kernel.kallsyms][k] 
pktgen_finalize_skb
+   12.31% 0.20%  kpktgend_1   [kernel.kallsyms][k] 
timestamp_skb.isra.24
+   11.68% 0.13%  kpktgend_1   [kernel.kallsyms][k] 
napi_gro_receive
+   11.36% 0.25%  kpktgend_1   [kernel.kallsyms][k] 
netif_receive_skb_internal
+   10.93% 0.00%  swapper  [kernel.kallsyms][k] 
verify_cpu
+   10.93% 0.00%  swapper  [kernel.kallsyms][k] 
cpu_startup_entry
+   10.92% 0.02%  swapper  [kernel.kallsyms][k] do_idle
+   10.71% 0.00%  swapper  [kernel.kallsyms][k] 
cpuidle_enter
+ 

Re: Performance regression between 4.13 and 4.14

2018-05-09 Thread Ben Greear

On 05/08/2018 10:10 AM, Eric Dumazet wrote:



On 05/08/2018 09:44 AM, Ben Greear wrote:

Hello,

I am trying to track down a performance regression that appears to be between 
4.13
and 4.14.

I first saw the problem with a hacked version of pktgen on some ixgbe NICs.  
4.13 can do
right at 10G bi-directional on two ports, and 4.14 and later can do only about 
6Gbps.

I also tried with user-space UDP traffic on a stock kernel, and I can get about 
3.2Gbps combined tx+rx
on 4.14 and about 4.4Gbps on 4.13.

Attempting to bisect seems to be triggering a weirdness in git, and also lots 
of commits
crash or do not bring up networking, which makes the bisect difficult.

Looking at perf top, it would appear that some lock is probably to blame.



perf record -a -g -e cycles:pp sleep 5
perf report

Then you'll be able to tell us which lock (or call graph) is killing your perf.



I seem to be chasing multiple issues.  For 4.13, at least part of my problem 
was that LOCKDEP was enabled,
during my bisect, though it does NOT appear enabled in 4.16.  I think maybe 
CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING
in 4.16, or something like that?  My 4.16 .config does have 
CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it:

[greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config
CONFIG_LOCKDEP_SUPPORT=y


For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need
to disable to keep from getting a performance hit from the spectre-related bug
fixes?  At this point, I do not care about the security implications.

greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config
# CONFIG_RETPOLINE is not set


Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



ICMP redirect and VRF

2018-05-08 Thread Ben Greear

While debugging some other problem today on a system using ip rules instead of
VRF, I ran into a case where the remote router was sending back ICMP redirects.

That got me thinking...where would these routes get stored in a VRF scenario?

Would it magically go to the correct VRF routing table based on the incoming 
interface
for the ICMP redirect response?

Thanks,
Ben
--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Performance regression between 4.13 and 4.14

2018-05-08 Thread Ben Greear

Hello,

I am trying to track down a performance regression that appears to be between 
4.13
and 4.14.

I first saw the problem with a hacked version of pktgen on some ixgbe NICs.  
4.13 can do
right at 10G bi-directional on two ports, and 4.14 and later can do only about 
6Gbps.

I also tried with user-space UDP traffic on a stock kernel, and I can get about 
3.2Gbps combined tx+rx
on 4.14 and about 4.4Gbps on 4.13.

Attempting to bisect seems to be triggering a weirdness in git, and also lots 
of commits
crash or do not bring up networking, which makes the bisect difficult.

Looking at perf top, it would appear that some lock is probably to blame.

Any ideas what might have been introduced during this interval that
would cause this?

Anyone else seen similar?

I'm going to attempt some more manual steps to try to find the commit that
introduces this...

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: The SO_BINDTODEVICE was set to the desired interface, but packets are received from all interfaces.

2018-05-07 Thread Ben Greear

On 05/07/2018 03:19 AM, Damir Mansurov wrote:


Greetings,

After successful call of the setsockopt(SO_BINDTODEVICE) function to set data 
reception from only one interface, the data is still received from all 
interfaces.
Function setsockopt() returns 0 but then recv() receives data from all 
available network interfaces.

The problem is reproducible on linux kernels 4.14 - 4.16, but it does not on 
linux kernels 4.4, 4.13.

I have written C-code to reproduce this issue (see attached files b2d_send.c 
and b2d_recv.c). See below explanation of tested configuration.


Hello,

I am not sure if this is your problem or not, but if you are using VRF, then 
you need
to call SO_BINDTODEVICE before you do the 'normal' bind() call.

Thanks,
Ben




PC-1  PC-2
 ---   ---
 | b2d_send|   | b2d_recv|
 | |   | |
 |   --|   |--   |
 |  | eth0 |---| eth0 |  |
 |   --|   |--   |
 | |   | |
 |   --|   |--   |
 |  | eth1 |---| eth1 |  |
 |   --|   |--   |
 | |   | |
 ---   ---

Steps:
1. Copy b2d_recv.c to PC-2, compile it ("gcc -o b2d_recv b2d_recv.c") and run 
"./b2d_recv eth0 23777" to get derived data only from eth0 interface. Port number
in this example is 23777 only for sample.

2. Copy b2d_send.c to PC-1, compile it ("gcc -o b2d_send b2d_send.c") and run 
"./b2d_send ip1 ip2 23777" where ip1 and ip2 are ip addresses of interfaces eth0
and eth1 of PC-2.

3. Result:
- b2d_recv prints out data from eth0 and eth1 on linux kernels from 4.14 up to 
4.16.
- b2d_recv prints out data from only eth0 on linux kernels below 4.14.


**
Thanks,
Damir Mansurov
dn...@oktetlabs.ru



--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] net: Work around crash in ipv6 fib-walk-continue

2018-05-04 Thread Ben Greear

On 05/04/2018 10:47 AM, David Ahern wrote:

On 4/19/18 12:01 PM, gree...@candelatech.com wrote:

From: Ben Greear 

This keeps us from crashing in certain test cases where we
bring up many (1000, for instance) mac-vlans with IPv6
enabled in the kernel.  This bug has been around for a
very long time.

Until a real fix is found (and for stable), maybe it
is better to return an incomplete fib walk instead
of crashing.

BUG: unable to handle kernel NULL pointer dereference at 8
IP: fib6_walk_continue+0x5b/0x140 [ipv6]
PGD 8007dfc0c067 P4D 8007dfc0c067 PUD 7e66ff067 PMD 0
Oops:  [#1] PREEMPT SMP PTI
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 
libcrc32c vrf]
CPU: 3 PID: 15117 Comm: ip Tainted: G   O 4.16.0+ #5
Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6]
RSP: 0018:c90008c3bc10 EFLAGS: 00010287
RAX: 88085ac45050 RBX: 8807e03008a0 RCX: 
RDX:  RSI: c90008c3bc48 RDI: 8232b240
RBP: 880819167600 R08: 0008 R09: 8807dff10071
R10: c90008c3bbd0 R11:  R12: 8807e03008a0
R13: 0002 R14: 8807e05744c8 R15: 8807e08ef000
FS:  7f2f04342700() GS:88087fcc() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0008 CR3: 0007e0556002 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 inet6_dump_fib+0x14b/0x2c0 [ipv6]
 netlink_dump+0x216/0x2a0
 netlink_recvmsg+0x254/0x400
 ? copy_msghdr_from_user+0xb5/0x110
 ___sys_recvmsg+0xe9/0x230
 ? find_held_lock+0x3b/0xb0
 ? __handle_mm_fault+0x617/0x1180
 ? __audit_syscall_entry+0xb3/0x110
 ? __sys_recvmsg+0x39/0x70
 __sys_recvmsg+0x39/0x70
 do_syscall_64+0x63/0x120
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f2f03a72030
RSP: 002b:7fffab3de508 EFLAGS: 0246 ORIG_RAX: 002f
RAX: ffda RBX: 7fffab3e641c RCX: 7f2f03a72030
RDX:  RSI: 7fffab3de570 RDI: 0004
RBP:  R08: 7e6c R09: 7fffab3e63a8
R10: 7fffab3de5b0 R11: 0246 R12: 7fffab3e6608
R13: 0066b460 R14: 7e6c R15: 
Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83 ea 01 
89 53 2c c7 4
RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: c90008c3bc10
CR2: 0008
---[ end trace bd03458864eb266c ]---

Signed-off-by: Ben Greear 
---



Does your use case that triggers this involve replacing routes? I just
noticed the route delete code in fib6_add_rt2node does not have the
'Adjust walkers' code that is in fib6_del_route.

Further, the adjust walkers code in fib6_del_route looks suspicious in
its timing with route deletes. If you have a reliable reproducer we can
try a few things with fib6_del_route and the walker code.


Yes, we replace routes, and yes we can reliably reproduce it and will
be happy to test patches.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-04-30 Thread Ben Greear

On 04/27/2018 08:11 PM, Steven Rostedt wrote:


We'd like this email archived in netdev list, but since netdev is
notorious for blocking outlook email as spam, it didn't go through. So
I'm replying here to help get it into the archives.

Thanks!

-- Steve


On Fri, 27 Apr 2018 23:05:46 +
Michael Wenig  wrote:


As part of VMware's performance testing with the Linux 4.15 kernel,
we identified CPU cost and throughput regressions when comparing to
the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM
send tests when using small message sizes. The regressions are
significant (up 3x) and were tracked down to be a side effect of Eric
Dumazat's RB tree changes that went into the Linux 4.15 kernel.
Further investigation showed our use of the TCP_NODELAY flag in
conjunction with Eric's change caused the regressions to show and
simply disabling TCP_NODELAY brought performance back to normal.
Eric's change also resulted into significant improvements in our
TCP_RR test cases.



Based on these results, our theory is that Eric's change made the
system overall faster (reduced latency) but as a side effect less
aggregation is happening (with TCP_NODELAY) and that results in lower
throughput. Previously even though TCP_NODELAY was set, system was
slower and we still got some benefit of aggregation. Aggregation
helps in better efficiency and higher throughput although it can
increase the latency. If you are seeing a regression in your
application throughput after this change, using TCP_NODELAY might
help bring performance back however that might increase latency.


I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY?

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-23 Thread Ben Greear

On 04/22/2018 02:15 PM, Roopa Prabhu wrote:

On Sun, Apr 22, 2018 at 11:54 AM, David Miller  wrote:

From: Johannes Berg 
Date: Thu, 19 Apr 2018 17:26:57 +0200


On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:


Maybe this could be in followup patches?  It's going to touch a lot of files,
and might be hell to get merged all at once, and I've never used spatch, so
just maybe someone else will volunteer that part :)


I guess you'll have to ask davem. :)


Well, first of all, I really don't like this.

The first reason is that every time I see interface foo become foo2,
foo3 is never far behind it.

If foo was not extensible enough such that we needed foo2, we beter
design the new thing with explicitly better extensibility in mind.

Furthermore, what you want here is a specific filter.  Someone else
will want to filter on another criteria, and the next person will
want yet another.

This needs to be properly generalized.

And frankly if we had moved to ethtool netlink/devlink by now, we
could just add a netlink attribute for filtering and not even be
having this conversation.



+1.

Also, the RTM_GETSTATS api was added to improve stats query efficiency
(with filters).
 we should look at it  to see if this fits there. Keeping all stats
queries in one place will help.


I like the ethtool API, so I'll be sticking with that for now.

Thanks,
Ben



--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-23 Thread Ben Greear

On 04/22/2018 11:54 AM, David Miller wrote:

From: Johannes Berg 
Date: Thu, 19 Apr 2018 17:26:57 +0200


On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:


Maybe this could be in followup patches?  It's going to touch a lot of files,
and might be hell to get merged all at once, and I've never used spatch, so
just maybe someone else will volunteer that part :)


I guess you'll have to ask davem. :)


Well, first of all, I really don't like this.

The first reason is that every time I see interface foo become foo2,
foo3 is never far behind it.

If foo was not extensible enough such that we needed foo2, we beter
design the new thing with explicitly better extensibility in mind.

Furthermore, what you want here is a specific filter.  Someone else
will want to filter on another criteria, and the next person will
want yet another.

This needs to be properly generalized.

And frankly if we had moved to ethtool netlink/devlink by now, we
could just add a netlink attribute for filtering and not even be
having this conversation.


Well, since there are un-defined flags, it would be simple enough to
extend the API further in the future (flag (1<<31) could mean expect
more input members, etc.  And, adding up to 30 more flags to filter on different
things won't change the API and should be backwards compatible.

But, if you don't want it, that is OK by me, I agree it is a fairly
obscure feature.  It would have saved me time if you had said you didn't
want it at the first RFC patch though...

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-19 Thread Ben Greear



On 04/18/2018 11:38 PM, Johannes Berg wrote:

On Wed, 2018-04-18 at 14:51 -0700, Ben Greear wrote:


It'd be pretty hard to know which flags are firmware stats?


Yes, it is, but ethtool stats are difficult to understand in a generic
manner anyway, so someone using them is already likely aware of low-level
details of the driver(s) they are using.


Right. Come to think of it though,


+ * @get_ethtool_stats2: Return extended statistics about the device.
+ * This is only useful if the device maintains statistics not
+ * included in &struct rtnl_link_stats64.
+ *  Takes a flags argument:  0 means all (same as get_ethtool_stats),
+ *  0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
+ *  Other flags are reserved for now.
+ *  Same number of stats will be returned, but some of them might
+ *  not be as accurate/refreshed.  This is to allow not querying
+ *  firmware or other expensive-to-read stats, for instance.


"skip" vs. "don't refresh" is a bit ambiguous - I'd argue better to
either really skip and not return the non-refreshed ones (also helps
with the identifying), or rename the flag.


In order to efficiently parse lots of stats over and over again, I probe
the stat names once on startup, map them to the variable I am trying to use
(since different drivers may have different names for the same basic stat),
and then I store the stat index.

On subsequent stat reads, I just grab stats and go right to the index to
store the stat.

If the stats indexes change, that will complicate my logic quite a bit.

Maybe the flag could be called:  ETHTOOL_GS2_NO_REFRESH_FW ?



Also, wrt. the rest of the patch, I'd argue that it'd be worthwhile to
write the spatch and just add the flags argument to "get_ethtool_stats"
instead of adding a separate method - internally to the kernel it's not
that hard to change.


Maybe this could be in followup patches?  It's going to touch a lot of files,
and might be hell to get merged all at once, and I've never used spatch, so
just maybe someone else will volunteer that part :)

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-18 Thread Ben Greear

On 04/18/2018 02:26 PM, Johannes Berg wrote:

On Tue, 2018-04-17 at 18:49 -0700, gree...@candelatech.com wrote:


+ * @get_ethtool_stats2: Return extended statistics about the device.
+ * This is only useful if the device maintains statistics not
+ * included in &struct rtnl_link_stats64.
+ *  Takes a flags argument:  0 means all (same as get_ethtool_stats),
+ *  0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
+ *  Other flags are reserved for now.


It'd be pretty hard to know which flags are firmware stats?


Yes, it is, but ethtool stats are difficult to understand in a generic
manner anyway, so someone using them is already likely aware of low-level
details of the driver(s) they are using.

In my case, I have lots of virtual stations (or APs), and I want stats
for them as well as for the 'radio', so I would probe the first vdev with
flags of 'skip-none' to get all stats, including radio (firmware) stats.

And then the rest I would just probe the non-firmware stats.

To be honest, I was slightly amused that anyone expressed interest in
this patch originally, but maybe other people have similar use case
and/or drivers with slow-to-acquire stats.


Anyway, there's no way I'm going to take this patch, so you need to
float it on netdev first (best CC us here) and get it applied there
before we can do anything on the wifi side.


I posted the patches to netdev, ath10k and linux-wireless.  If I had only
posted them individually to different lists I figure I'd be hearing about how
the netdev patch is useless because it has no driver support, etc.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2018-04-17 Thread Ben Greear

On 01/24/2018 03:59 PM, Ben Greear wrote:

On 06/20/2017 08:03 PM, David Ahern wrote:

On 6/20/17 5:41 PM, Ben Greear wrote:

On 06/20/2017 11:05 AM, Michal Kubecek wrote:

On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:

On 06/14/2017 03:25 PM, David Ahern wrote:

On 6/14/17 4:23 PM, Ben Greear wrote:

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no
longer
reproduce the problem.



ok, let's figure out which one is doing that. There are 3 debug
statements. I suspect fib6_del_route is the one setting the state to
FWS_U. Can you remove the debug prints in fib6_repair_tree and
fib6_walk_continue and try again?


We cannot reproduce with just that one printf in the kernel either.  It
must change the timing too much to trigger the bug.


You might try trace_printk() which should have less impact (don't forget
to enable /proc/sys/kernel/ftrace_dump_on_oops).


We cannot reproduce with trace_printk() either.


I think that suggests the walker state is set to FWS_U in
fib6_del_route, and it is the FWS_U case in fib6_walk_continue that
triggers the fault -- the null parent (pn = fn->parent). So we have the
2 areas of code that are interacting.

I'm on a road trip through the end of this week with little time to
focus on this problem. I'll get back to you another suggestion when I can.


FYI, problem still happens in 4.16.  I'm going to re-enable my hack below
for this kernel as well...I had hopes it might be fixed...

BUG: unable to handle kernel NULL pointer dereference at 8
IP: fib6_walk_continue+0x5b/0x140 [ipv6]
PGD 8007dfc0c067 P4D 8007dfc0c067 PUD 7e66ff067 PMD 0
Oops:  [#1] PREEMPT SMP PTI
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 
libcrc32c vrf]
CPU: 3 PID: 15117 Comm: ip Tainted: G   O 4.16.0+ #5
Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6]
RSP: 0018:c90008c3bc10 EFLAGS: 00010287
RAX: 88085ac45050 RBX: 8807e03008a0 RCX: 
RDX:  RSI: c90008c3bc48 RDI: 8232b240
RBP: 880819167600 R08: 0008 R09: 8807dff10071
R10: c90008c3bbd0 R11:  R12: 8807e03008a0
R13: 0002 R14: 8807e05744c8 R15: 8807e08ef000
FS:  7f2f04342700() GS:88087fcc() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0008 CR3: 0007e0556002 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 inet6_dump_fib+0x14b/0x2c0 [ipv6]
 netlink_dump+0x216/0x2a0
 netlink_recvmsg+0x254/0x400
 ? copy_msghdr_from_user+0xb5/0x110
 ___sys_recvmsg+0xe9/0x230
 ? find_held_lock+0x3b/0xb0
 ? __handle_mm_fault+0x617/0x1180
 ? __audit_syscall_entry+0xb3/0x110
 ? __sys_recvmsg+0x39/0x70
 __sys_recvmsg+0x39/0x70
 do_syscall_64+0x63/0x120
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f2f03a72030
RSP: 002b:7fffab3de508 EFLAGS: 0246 ORIG_RAX: 002f
RAX: ffda RBX: 7fffab3e641c RCX: 7f2f03a72030
RDX:  RSI: 7fffab3de570 RDI: 0004
RBP:  R08: 7e6c R09: 7fffab3e63a8
R10: 7fffab3de5b0 R11: 0246 R12: 7fffab3e6608
R13: 0066b460 R14: 7e6c R15: 
Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83 ea 01 
89 53 2c c7 4
RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: c90008c3bc10
CR2: 0008
---[ end trace bd03458864eb266c ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 10 seconds..
ACPI MEMORY or I/O RESET_REG.



So, though I don't know the right way to fix it, the patch below appears
to make the system not crash.


diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 68b9cc7..bf19a14 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w)
pn = fn->parent;
w->node = pn;
 #ifdef CONFIG_IPV6_SUBTREES
+   if (WARN_ON_ONCE(!pn)) {
+   pr_err("FWS-U, w: %p  fn: %p  pn: %p\n",
+  w, fn, pn);
+   /* Attempt to work around crash that has been 
here forever. --Ben */
+   return 0;
+   }
if (FIB6_SUBTREE(pn) == fn) {
WARN_ON(!(fn->fn_flags & RTN_ROOT));
w->stat

Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Ben Greear

On 03/20/2018 11:24 AM, Michal Kubecek wrote:

On Tue, Mar 20, 2018 at 08:39:33AM -0700, Ben Greear wrote:

On 03/20/2018 03:37 AM, Michal Kubecek wrote:


IMHO it would be more practical to set "0 means same as GSTATS" as a
rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to
avoid code duplication (or perhaps a use fall-through in the switch). It
would also allow drivers to provide only one of the callbacks.


Yes, but that would require changing all drivers at once, and would make 
backporting
and out-of-tree drivers harder to manage.  I had low hopes that this feature 
would
make it upstream, so I didn't want to propose any large changes up front.


I don't think so. What I mean is:

(a) driver implements ->get_ethtool_stats2() callback; then we use it
for GSTATS2
(b) driver does not implement get_ethtool_stats2() but implements
->get_ethtool_stats(); then we call for GSTATS2 if level is zero,
otherwise GSTATS2 returns -EINVAL

and GSTATS is always translated to GSTATS2 with level 0, either by
defining ethtool_get_stats() as a wrapper or by fall-through in the
switch statement.

This way, most drivers could be left untouched and only those which
would implement non-default levels would provide ->get_ethtool_stats2()
callback instead of ->get_ethtool_stats().


OK, that makes sense.  I'll wait on feedback from the flags or #defined levels
and re-spin the patch accordingly.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns

2018-03-20 Thread Ben Greear

On 03/20/2018 09:44 AM, Liran Alon wrote:



On 20/03/18 18:24, ebied...@xmission.com wrote:


I don't believe the current behavior is a bug.

I looked through the history.  Basically skb_scrub_packet
started out as the scrubbing needed for crossing network
namespaces.

Then tunnels which needed 90% of the functionality started
calling it, with the xnet flag added.  Because the tunnels
needed to preserve their historic behavior.

Then dev_forward_skb started calling skb_scrub_packet.

A veth pair is supposed to give the same behavior as a cross-over
cable plugged into two local nics.  A cross over cable won't
preserve things like the skb mark.  So I don't see why anyone would
expect a veth pair to preserve the mark.


I disagree with this argument.

I think that a skb crossing netns is what simulates a real packet crossing 
physical computers. Following your argument, why would skb->mark should be 
preserved
when crossing netdevs on same netns via routing? But this does today preserve 
skb->mark.

Therefore, I do think that skb->mark should conceptually only be scrubbed when 
crossing netns. Regardless of the netdev used to cross it.


It should be scrubbed in VETH as well.  That is one way to make virtual 
routers.  Possibly
the newer VRF features will give another better way to do it, but you should 
not break
things that used to work.

Now, if you want to add a new feature that allows one to configure the kernel 
(or VETH) for
a new behavior, then that might be something to consider.


Right now I don't see the point of handling packets that don't cross
network namespace boundaries specially, other than to preserve backwards
compatibility.


Well, backwards compat is a big deal all by itself!

Thanks,
Ben



Eric







--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Ben Greear

On 03/20/2018 09:11 AM, Steve deRosier wrote:

On Tue, Mar 20, 2018 at 8:39 AM, Ben Greear  wrote:

On 03/20/2018 03:37 AM, Michal Kubecek wrote:


On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote:


From: Ben Greear 

This is similar to ETHTOOL_GSTATS, but it allows you to specify
a 'level'.  This level can be used by the driver to decrease the
amount of stats refreshed.  In particular, this helps with ath10k
since getting the firmware stats can be slow.

Signed-off-by: Ben Greear 
---

NOTE:  I know to make it upstream I would need to split the patch and
remove the #define for 'backporting' that I added.  But, is the
feature in general wanted?  If so, I'll do the patch split and
other tweaks that might be suggested.





Yes, but that would require changing all drivers at once, and would make
backporting
and out-of-tree drivers harder to manage.  I had low hopes that this feature
would
make it upstream, so I didn't want to propose any large changes up front.



Hi Ben,

I find the feature OK, but I'm not thrilled with the arbitrary scale
of "level". Maybe there could be some named values, either on a
spectrum as level already is, similar to the kernel log DEBUG, WARN,
INFO  type levels. Or named bit flags like the way the ath drivers
do their debug flags for granular results.  Thoughts?


Yes, that would be easier to code too.  If there are any other drivers
out there that might take advantage of this, maybe they could chime in with
what levels and/or bit-fields they would like to see.

For instance a bit that says 'refresh-stats-from-firmware' would be great for 
ath10k,
but maybe useless for everyone else

Thanks,
Ben



- Steve




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Ben Greear

On 03/20/2018 03:37 AM, Michal Kubecek wrote:

On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote:

From: Ben Greear 

This is similar to ETHTOOL_GSTATS, but it allows you to specify
a 'level'.  This level can be used by the driver to decrease the
amount of stats refreshed.  In particular, this helps with ath10k
since getting the firmware stats can be slow.

Signed-off-by: Ben Greear 
---

NOTE:  I know to make it upstream I would need to split the patch and
remove the #define for 'backporting' that I added.  But, is the
feature in general wanted?  If so, I'll do the patch split and
other tweaks that might be suggested.


I'm not familiar enough with the technical background of stats
collecting to comment on usefulness and desirability of this feature.
Adding a new command just to add a numeric parameter certainly doesn't
feel right but it's how the ioctl interface works. I take it as
a reminder to find some time to get back to the netlink interface.


diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 674b6c9..d3b709f 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1947,6 +1947,54 @@ static int ethtool_get_stats(struct net_device *dev, 
void __user *useraddr)
return ret;
 }

+static int ethtool_get_stats2(struct net_device *dev, void __user *useraddr)
+{
+   struct ethtool_stats stats;
+   const struct ethtool_ops *ops = dev->ethtool_ops;
+   u64 *data;
+   int ret, n_stats;
+   u32 stats_level = 0;
+
+   if (!ops->get_ethtool_stats2 || !ops->get_sset_count)
+   return -EOPNOTSUPP;
+
+   n_stats = ops->get_sset_count(dev, ETH_SS_STATS);
+   if (n_stats < 0)
+   return n_stats;
+   if (n_stats > S32_MAX / sizeof(u64))
+   return -ENOMEM;
+   WARN_ON_ONCE(!n_stats);
+   if (copy_from_user(&stats, useraddr, sizeof(stats)))
+   return -EFAULT;
+
+   /* User can specify the level of stats to query.  How the
+* level value is used is up to the driver, but in general,
+* 0 means 'all', 1 means least, and higher means more.
+* The idea is that some stats may be expensive to query, so user
+* space could just ask for the cheap ones...
+*/
+   stats_level = stats.n_stats;
+
+   stats.n_stats = n_stats;
+   data = vzalloc(n_stats * sizeof(u64));
+   if (n_stats && !data)
+   return -ENOMEM;
+
+   ops->get_ethtool_stats2(dev, &stats, data, stats_level);
+
+   ret = -EFAULT;
+   if (copy_to_user(useraddr, &stats, sizeof(stats)))
+   goto out;
+   useraddr += sizeof(stats);
+   if (n_stats && copy_to_user(useraddr, data, n_stats * sizeof(u64)))
+   goto out;
+   ret = 0;
+
+ out:
+   vfree(data);
+   return ret;
+}
+
 static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 {
struct ethtool_stats stats;


IMHO it would be more practical to set "0 means same as GSTATS" as a
rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to
avoid code duplication (or perhaps a use fall-through in the switch). It
would also allow drivers to provide only one of the callbacks.


Yes, but that would require changing all drivers at once, and would make 
backporting
and out-of-tree drivers harder to manage.  I had low hopes that this feature 
would
make it upstream, so I didn't want to propose any large changes up front.

Thanks,
Ben



--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH net] virtio-net: disable NAPI only when enabled during XDP set

2018-02-28 Thread Ben Greear

On 02/28/2018 09:22 AM, David Miller wrote:

From: Jason Wang 
Date: Wed, 28 Feb 2018 18:20:04 +0800


We try to disable NAPI to prevent a single XDP TX queue being used by
multiple cpus. But we don't check if device is up (NAPI is enabled),
this could result stall because of infinite wait in
napi_disable(). Fixing this by checking device state through
netif_running() before.

Fixes: 4941d472bf95b ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang 


Yes, mis-paired NAPI enable/disable are really a pain.

Probably, we can do something in the interfaces or mechanisms to make
this less error prone and less fragile.

Anyways, applied and queued up for -stable, thanks!



I just hit a similar bug in ath10k.  It seems like napi has plenty
of free bit flags so it could keep track of 'is-enabled' state and
allow someone to call napi_disable multiple times w/out deadlocking.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH RFC net-next 1/4] ipv4: fib_rules: support match on sport, dport and ip proto

2018-02-13 Thread Ben Greear

On 02/12/2018 04:03 PM, David Miller wrote:

From: Eric Dumazet 
Date: Mon, 12 Feb 2018 13:54:59 -0800


We had project/teams using different routing tables for each vlan they
setup :/


Indeed, people use FIB rules and think they can scale in software.  As
currently implemented, they can't.

The example you give sounds possibly like a great VRF use case btw :-)


I'm one of those people with lots of FIB rules wishing it would scale
better, and wanting a routing table per netdev.

If there is a relatively easy suggestion to make this work better, I'd
like to give it a try.  I have not looked at VRF at all to date...

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2018-01-24 Thread Ben Greear

On 06/20/2017 08:03 PM, David Ahern wrote:

On 6/20/17 5:41 PM, Ben Greear wrote:

On 06/20/2017 11:05 AM, Michal Kubecek wrote:

On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:

On 06/14/2017 03:25 PM, David Ahern wrote:

On 6/14/17 4:23 PM, Ben Greear wrote:

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no
longer
reproduce the problem.



ok, let's figure out which one is doing that. There are 3 debug
statements. I suspect fib6_del_route is the one setting the state to
FWS_U. Can you remove the debug prints in fib6_repair_tree and
fib6_walk_continue and try again?


We cannot reproduce with just that one printf in the kernel either.  It
must change the timing too much to trigger the bug.


You might try trace_printk() which should have less impact (don't forget
to enable /proc/sys/kernel/ftrace_dump_on_oops).


We cannot reproduce with trace_printk() either.


I think that suggests the walker state is set to FWS_U in
fib6_del_route, and it is the FWS_U case in fib6_walk_continue that
triggers the fault -- the null parent (pn = fn->parent). So we have the
2 areas of code that are interacting.

I'm on a road trip through the end of this week with little time to
focus on this problem. I'll get back to you another suggestion when I can.


So, though I don't know the right way to fix it, the patch below appears
to make the system not crash.


diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 68b9cc7..bf19a14 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w)
pn = fn->parent;
w->node = pn;
 #ifdef CONFIG_IPV6_SUBTREES
+   if (WARN_ON_ONCE(!pn)) {
+   pr_err("FWS-U, w: %p  fn: %p  pn: %p\n",
+  w, fn, pn);
+   /* Attempt to work around crash that has been 
here forever. --Ben */
+   return 0;
+   }
if (FIB6_SUBTREE(pn) == fn) {
WARN_ON(!(fn->fn_flags & RTN_ROOT));
w->state = FWS_L;



The printout looks like this (when adding 4000 mac-vlans, so it is pretty 
rare).  PN is definitely NULL sometimes:

[root@2u-6n ~]# journalctl -f|grep FWS
Jan 24 15:48:05 2u-6n kernel: IPv6: FWS-U, w: 8807ea121ba0  fn: 
880856a09260  pn:   (null)
Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 8807e3963de0  fn: 
880856a09260  pn:   (null)
Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 88081ac22de0  fn: 
880856a09260  pn:   (null)
Jan 24 15:53:13 2u-6n kernel: IPv6: FWS-U, w: 8808290c69c0  fn: 
8807e369f920  pn:   (null)
Jan 24 15:53:24 2u-6n kernel: IPv6: FWS-U, w: 8807ea3156c0  fn: 
88082d1eeb60  pn:   (null)



8066 Jan 24 15:48:04 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device 
eth2#1006
 8067 Jan 24 15:48:05 2u-6n kernel: [ cut here ]
 8068 Jan 24 15:48:05 2u-6n kernel: WARNING: CPU: 5 PID: 3346 at /home/greearb/git/linux-4.13.dev.y/net/ipv6/ip6_fib.c:1617 fib6_walk_continue+ 
0x154/0x1b0 [ipv6]
 8069 Jan 24 15:48:05 2u-6n kernel: Modules linked in: 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen ipmi_ssif coretemp intel_raplsb_edac 
x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm ath9k irqbypass iTCO_wdt ath9k_common iTCO_vendor_support ath9k_hw ath  i2c_i801 mac80211 joydev 
lpc_ich cfg80211 ioatdma shpchp tpm_tis tpm_tis_core wmi tpm ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl 
sch_fq_codel lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_core ipv6 crc_ccitt

 8070 Jan 24 15:48:05 2u-6n kernel: CPU: 5 PID: 3346 Comm: ip Tainted: G
   O4.13.16+ #22
 8071 Jan 24 15:48:05 2u-6n kernel: Hardware name: Iron_Systems,Inc 
CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
 8072 Jan 24 15:48:05 2u-6n kernel: task: 8807e9ef1dc0 task.stack: 
c9002083c000
 8073 Jan 24 15:48:05 2u-6n kernel: RIP: 0010:fib6_walk_continue+0x154/0x1b0 
[ipv6]
 8074 Jan 24 15:48:05 2u-6n kernel: RSP: 0018:c9002083fbc0 EFLAGS: 00010246
 8075 Jan 24 15:48:05 2u-6n kernel: RAX:  RBX: 8807ea121ba0 
RCX: 
 8076 Jan 24 15:48:05 2u-6n kernel: RDX: 880856a09260 RSI: c9002083fc00 
RDI: 81ef2140
 8077 Jan 24 15:48:05 2u-6n kernel: RBP: c9002083fbc8 R08: 0008 
R09: 8807e36f6b25
 8078 Jan 24 15:48:05 2u-6n kernel: R10: c9002083fb70 R11: 000

Re: e1000e hardware unit hangs

2018-01-24 Thread Ben Greear

On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote:

On 2018-01-24 20:31, Ben Greear wrote:

On 01/24/2018 08:34 AM, Neftin, Sasha wrote:

On 1/24/2018 18:11, Alexander Duyck wrote:

On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear  wrote:

Hello,

Anyone have any more suggestions for making e1000e work better?  This is
from a 4.9.65+ kernel,
with these additional e1000e patches applied:

e1000e: Fix error path in link detection
e1000e: Fix wrong comment related to link detection
e1000e: Fix return value test
e1000e: Separate signaling for link check/link up
e1000e: Avoid receiver overrun interrupt bursts


Most of these patches shouldn't address anything that would trigger Tx
hangs. They are mostly related to just link detection.


Test case is simply to run 3 tcp connections each trying to send 56Kbps
of bi-directional
data between a pair of e1000e interfaces :)

No OOM related issues are seen on this kernel...similar test on 4.13 showed
some OOM
issues, but I have not debugged that yet...


Really a question like this probably belongs on e1000-devel or
intel-wired-lan so I have added those lists and the e1000e maintainer
to the thread.

It would be useful if you could provide more information about the
device itself such as the ID and the kind of test you are running.
Keep in mind the e1000e driver supports a pretty broad swath of
devices so we need to narrow things down a bit.


please, also re-check if your kernel include:
e1000e: fix buffer overrun while the I219 is processing DMA transactions
e1000e: fix the use of magic numbers for buffer overrun issue
where you take fresh version of kernel?


Hello,

I tried adding those two patches, but I still see this splat shortly
after starting
my test.  The kernel I am using is here:

https://github.com/greearb/linux-ct-4.13

I've seen similar issues at least back to the 4.0 kernel, including
stock kernels and my
own kernels with additional patches.

Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499,
wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: [ cut
here ]
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0
PID: 0 at
/home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322
dev_watchdog+0x228/0x250
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in:
nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c
cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0
Comm: swapper/0 Tainted: G   O4.13.16+ #22
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name:
Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task:
81e104c0 task.stack: 81e0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP:
0010:dev_watchdog+0x228/0x250
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP:
0018:88042fc03e50 EFLAGS: 00010282
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX:
0086 RBX:  RCX: 
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX:
88042fc15b40 RSI: 88042fc0dbf8 RDI: 88042fc0dbf8
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP:
88042fc03e98 R08: 0001 R09: 03c4
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10:
 R11: 03c4 R12: 1388
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13:
000100050dc3 R14: 88041767 R15: 000100052400
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS:
() GS:88042fc0()
knlGS:
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS:  0010 DS: 
ES:  CR0: 80050033
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2:
01d14000 CR3: 01e09000 CR4: 001406f0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace:
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  call_timer_fn+0x30/0x160
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
run_timer_softirq+0x1f0/0x450
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
lapic_next_deadline+0x21/0x30
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
clockevents_program_event+0x78/0xf0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  __do_softirq+0xc1/0x2c0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  irq_exit+0xb1/0xc0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
smp_apic_timer_interrupt+0x38/0x50
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
apic_timer_interrupt+0x89/0x90
Ja

Re: e1000e hardware unit hangs

2018-01-24 Thread Ben Greear

On 01/24/2018 08:34 AM, Neftin, Sasha wrote:

On 1/24/2018 18:11, Alexander Duyck wrote:

On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear  wrote:

Hello,

Anyone have any more suggestions for making e1000e work better?  This is
from a 4.9.65+ kernel,
with these additional e1000e patches applied:

e1000e: Fix error path in link detection
e1000e: Fix wrong comment related to link detection
e1000e: Fix return value test
e1000e: Separate signaling for link check/link up
e1000e: Avoid receiver overrun interrupt bursts


Most of these patches shouldn't address anything that would trigger Tx
hangs. They are mostly related to just link detection.


Test case is simply to run 3 tcp connections each trying to send 56Kbps
of bi-directional
data between a pair of e1000e interfaces :)

No OOM related issues are seen on this kernel...similar test on 4.13 showed
some OOM
issues, but I have not debugged that yet...


Really a question like this probably belongs on e1000-devel or
intel-wired-lan so I have added those lists and the e1000e maintainer
to the thread.

It would be useful if you could provide more information about the
device itself such as the ID and the kind of test you are running.
Keep in mind the e1000e driver supports a pretty broad swath of
devices so we need to narrow things down a bit.


please, also re-check if your kernel include:
e1000e: fix buffer overrun while the I219 is processing DMA transactions
e1000e: fix the use of magic numbers for buffer overrun issue
where you take fresh version of kernel?


Hello,

I tried adding those two patches, but I still see this splat shortly after 
starting
my test.  The kernel I am using is here:

https://github.com/greearb/linux-ct-4.13

I've seen similar issues at least back to the 4.0 kernel, including stock 
kernels and my
own kernels with additional patches.

Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, wd-timeout: 5000 
jiffies: 4295304192 tx-queues: 1

Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: [ cut here 
]
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 PID: 0 at /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 
dev_watchdog+0x228/0x250
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c cfg80211 macvlan 
wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip

Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 Comm: 
swapper/0 Tainted: G   O4.13.16+ #22
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: Supermicro 
X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: 81e104c0 
task.stack: 81e0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: 
0010:dev_watchdog+0x228/0x250
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: 0018:88042fc03e50 
EFLAGS: 00010282
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: 0086 RBX: 
 RCX: 
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: 88042fc15b40 RSI: 
88042fc0dbf8 RDI: 88042fc0dbf8
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: 88042fc03e98 R08: 
0001 R09: 03c4
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10:  R11: 
03c4 R12: 1388
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: 000100050dc3 R14: 
88041767 R15: 000100052400
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS:  () 
GS:88042fc0() knlGS:
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS:  0010 DS:  ES:  
CR0: 80050033
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: 01d14000 CR3: 
01e09000 CR4: 001406f0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace:
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  call_timer_fn+0x30/0x160
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  
run_timer_softirq+0x1f0/0x450
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? 
lapic_next_deadline+0x21/0x30
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? 
clockevents_program_event+0x78/0xf0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  __do_softirq+0xc1/0x2c0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  irq_exit+0xb1/0xc0
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  
smp_apic_timer_interrupt+0x38/0x50
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  
apic_timer_interrupt+0x89/0x90
Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: 

Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

2018-01-23 Thread Ben Greear

On 01/23/2018 03:27 PM, Ben Greear wrote:

On 01/23/2018 03:21 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 15:10 -0800, Ben Greear wrote:

On 01/23/2018 02:29 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote:

On 01/23/2018 02:07 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



Hello Eric, looks like it is one of your commits that causes the issue
I see.

Here are some more details on my specific test case I used to bisect:

I have two ixgbe ports looped back, configured on same subnet, but with 
different IPs.
Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client 
and server
side let me send-to-self over the external looped cable.

I have 2 mac-vlans on each physical interface.

I created 5 server-side connections on one physical port, and two more on one 
of the mac-vlans.

On the client-side, I create a process that spawns 5000 connections to the 
corresponding server side.

End result is 25,000 connections on one pair of real interfaces, and 10,000 
connections on the
mac-vlan ports.

In the passing case, I get very close to all 5000 connections on all endpoints 
quickly.

In the failing case, I get a max of around 16k connections on the two physical 
ports.  The two mac-vlans have 10k connections
across them working reliably.  It seems to be an issue with 'connect' failing.

connect(2074, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
fcntl(2075, F_GETFD)= 0
fcntl(2075, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2075, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2075, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
fcntl(2076, F_GETFD)= 0
fcntl(2076, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2076, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2076, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign 
requested address)



ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet 
Date:   Thu Feb 11 16:28:50 2016 -0800

 tcp/dccp: better use of ephemeral ports in bind()

 Implement strategy used in __inet_hash_connect() in opposite way :

 Try to find a candidate using odd ports, then fallback to even ports.

 We no longer disable BH for whole traversal, but one bucket at a time.
 We also use cond_resched() to yield cpu to other tasks if needed.

 I removed one indentation level and tried to mirror the loop we have
 in __inet_hash_connect() and variable names to ease code maintenance.

 Signed-off-by: Eric Dumazet 
 Signed-off-by: David S. Miller 

:04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 
e0c014ae8b7e2867256eff60f6210821d36eacef Mnet


I will be happy to test patches or try to get any other results that might help 
diagno

e1000e hardware unit hangs

2018-01-23 Thread Ben Greear
, trans_start: 4294748730, wd-timeout: 5000 
jiffies: 4294759424 tx-queues: 1

Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: 
Reset adapter unexpectedly
Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: 
Reset adapter unexpectedly
Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4294766123, wd-timeout: 5000 
jiffies: 4294771200 tx-queues: 1
Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4294766125, wd-timeout: 5000 
jiffies: 4294771200 tx-queues: 1

Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: 
Reset adapter unexpectedly
Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: 
Reset adapter unexpectedly
Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: 
Detected Hardware Unit Hang:
  TDH  
  TDT  
...
Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx


Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

2018-01-23 Thread Ben Greear

On 01/23/2018 03:21 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 15:10 -0800, Ben Greear wrote:

On 01/23/2018 02:29 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote:

On 01/23/2018 02:07 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



Hello Eric, looks like it is one of your commits that causes the issue
I see.

Here are some more details on my specific test case I used to bisect:

I have two ixgbe ports looped back, configured on same subnet, but with 
different IPs.
Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client 
and server
side let me send-to-self over the external looped cable.

I have 2 mac-vlans on each physical interface.

I created 5 server-side connections on one physical port, and two more on one 
of the mac-vlans.

On the client-side, I create a process that spawns 5000 connections to the 
corresponding server side.

End result is 25,000 connections on one pair of real interfaces, and 10,000 
connections on the
mac-vlan ports.

In the passing case, I get very close to all 5000 connections on all endpoints 
quickly.

In the failing case, I get a max of around 16k connections on the two physical 
ports.  The two mac-vlans have 10k connections
across them working reliably.  It seems to be an issue with 'connect' failing.

connect(2074, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
fcntl(2075, F_GETFD)= 0
fcntl(2075, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2075, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2075, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
fcntl(2076, F_GETFD)= 0
fcntl(2076, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2076, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2076, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign 
requested address)



ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet 
Date:   Thu Feb 11 16:28:50 2016 -0800

 tcp/dccp: better use of ephemeral ports in bind()

 Implement strategy used in __inet_hash_connect() in opposite way :

 Try to find a candidate using odd ports, then fallback to even ports.

 We no longer disable BH for whole traversal, but one bucket at a time.
 We also use cond_resched() to yield cpu to other tasks if needed.

 I removed one indentation level and tried to mirror the loop we have
 in __inet_hash_connect() and variable names to ease code maintenance.

 Signed-off-by: Eric Dumazet 
 Signed-off-by: David S. Miller 

:04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 
e0c014ae8b7e2867256eff60f6210821d36eacef M  net


I will be happy to test patches or try to get any other results that might help 
diagnose
this problem better.


Problem is I do not s

Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

2018-01-23 Thread Ben Greear

On 01/23/2018 02:29 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote:

On 01/23/2018 02:07 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



Hello Eric, looks like it is one of your commits that causes the issue
I see.

Here are some more details on my specific test case I used to bisect:

I have two ixgbe ports looped back, configured on same subnet, but with 
different IPs.
Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client 
and server
side let me send-to-self over the external looped cable.

I have 2 mac-vlans on each physical interface.

I created 5 server-side connections on one physical port, and two more on one 
of the mac-vlans.

On the client-side, I create a process that spawns 5000 connections to the 
corresponding server side.

End result is 25,000 connections on one pair of real interfaces, and 10,000 
connections on the
mac-vlan ports.

In the passing case, I get very close to all 5000 connections on all endpoints 
quickly.

In the failing case, I get a max of around 16k connections on the two physical 
ports.  The two mac-vlans have 10k connections
across them working reliably.  It seems to be an issue with 'connect' failing.

connect(2074, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
fcntl(2075, F_GETFD)= 0
fcntl(2075, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2075, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2075, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
fcntl(2076, F_GETFD)= 0
fcntl(2076, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2076, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2076, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign 
requested address)



ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet 
Date:   Thu Feb 11 16:28:50 2016 -0800

 tcp/dccp: better use of ephemeral ports in bind()

 Implement strategy used in __inet_hash_connect() in opposite way :

 Try to find a candidate using odd ports, then fallback to even ports.

 We no longer disable BH for whole traversal, but one bucket at a time.
 We also use cond_resched() to yield cpu to other tasks if needed.

 I removed one indentation level and tried to mirror the loop we have
 in __inet_hash_connect() and variable names to ease code maintenance.

 Signed-off-by: Eric Dumazet 
 Signed-off-by: David S. Miller 

:04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 
e0c014ae8b7e2867256eff60f6210821d36eacef M  net


I will be happy to test patches or try to get any other results that might help 
diagnose
this problem better.


Problem is I do not see anything obvious here.

Please provide /proc/sys/net/ipv4/ip_local_port_range


[root@lf1003-e

Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

2018-01-23 Thread Ben Greear

On 01/23/2018 02:07 PM, Eric Dumazet wrote:

On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



Hello Eric, looks like it is one of your commits that causes the issue
I see.

Here are some more details on my specific test case I used to bisect:

I have two ixgbe ports looped back, configured on same subnet, but with 
different IPs.
Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client 
and server
side let me send-to-self over the external looped cable.

I have 2 mac-vlans on each physical interface.

I created 5 server-side connections on one physical port, and two more on one 
of the mac-vlans.

On the client-side, I create a process that spawns 5000 connections to the 
corresponding server side.

End result is 25,000 connections on one pair of real interfaces, and 10,000 
connections on the
mac-vlan ports.

In the passing case, I get very close to all 5000 connections on all endpoints 
quickly.

In the failing case, I get a max of around 16k connections on the two physical 
ports.  The two mac-vlans have 10k connections
across them working reliably.  It seems to be an issue with 'connect' failing.

connect(2074, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
fcntl(2075, F_GETFD)= 0
fcntl(2075, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2075, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2075, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
fcntl(2076, F_GETFD)= 0
fcntl(2076, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2076, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2076, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign 
requested address)



ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet 
Date:   Thu Feb 11 16:28:50 2016 -0800

 tcp/dccp: better use of ephemeral ports in bind()

 Implement strategy used in __inet_hash_connect() in opposite way :

 Try to find a candidate using odd ports, then fallback to even ports.

 We no longer disable BH for whole traversal, but one bucket at a time.
 We also use cond_resched() to yield cpu to other tasks if needed.

 I removed one indentation level and tried to mirror the loop we have
 in __inet_hash_connect() and variable names to ease code maintenance.

 Signed-off-by: Eric Dumazet 
 Signed-off-by: David S. Miller 

:04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 
e0c014ae8b7e2867256eff60f6210821d36eacef M  net


I will be happy to test patches or try to get any other results that might help 
diagnose
this problem better.


Problem is I do not see anything obvious here.

Please provide /proc/sys/net/ipv4/ip_local_port_range


[root@lf1003-e3v2-13100124-f20x64 ~]# cat /proc/sys/net/ipv4/ip_local_port_range
1   61001



Also you probab

Re: TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-23 Thread Ben Greear

On 01/22/2018 10:46 AM, Josh Hunt wrote:

On Mon, Jan 22, 2018 at 10:30 AM, Ben Greear  wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:


On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:


My test case is to have 6 processes each create 5000 TCP IPv4 connections
to each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a
4.7 kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running
out of tcp memory,
but even after forcing those values higher, the max connections we can
get is around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is
my fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in
more recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



I am sending to self, but over external network interfaces, by using
routing tables and rules and such.

On 4.13.16+, I see the Intel driver bouncing when I try to start 20k
connections.  In this case, I have a pair of 10G ports doing 15k, and then
I try to start 5k on two of the 1G ports

Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3
(e1000e): transmit queue 0 timed out, trans_s...es: 1
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0
eth3: Reset adapter unexpectedly



Ben

We had an interface doing this and grabbing these commits resolved it for us:

4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts
19110cfbb34d e1000e: Separate signaling for link check/link up
d3509f8bc7b0 e1000e: Fix return value test
65a29da1f5fd e1000e: Fix wrong comment related to link detection
c4c40e51f9c3 e1000e: Fix error path in link detection

They are in the LTS kernels now, but don't believe they were when we
first hit this problem.


Thanks a lot for the suggestions, I can confirm that these patches applied to 
my 4.13.16+
tree does indeed seem to fix the problem.

Thanks,
Ben



Josh




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

2018-01-23 Thread Ben Greear

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



Hello Eric, looks like it is one of your commits that causes the issue
I see.

Here are some more details on my specific test case I used to bisect:

I have two ixgbe ports looped back, configured on same subnet, but with 
different IPs.
Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client 
and server
side let me send-to-self over the external looped cable.

I have 2 mac-vlans on each physical interface.

I created 5 server-side connections on one physical port, and two more on one 
of the mac-vlans.

On the client-side, I create a process that spawns 5000 connections to the 
corresponding server side.

End result is 25,000 connections on one pair of real interfaces, and 10,000 
connections on the
mac-vlan ports.

In the passing case, I get very close to all 5000 connections on all endpoints 
quickly.

In the failing case, I get a max of around 16k connections on the two physical 
ports.  The two mac-vlans have 10k connections
across them working reliably.  It seems to be an issue with 'connect' failing.

connect(2074, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
fcntl(2075, F_GETFD)= 0
fcntl(2075, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2075, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2075, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
fcntl(2076, F_GETFD)= 0
fcntl(2076, F_SETFD, FD_CLOEXEC)= 0
setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 
16) = 0
setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(2076, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.1.1.4")}, 16) = 0
getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR)
fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
connect(2076, {sa_family=AF_INET, sin_port=htons(33012), 
sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign 
requested address)



ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet 
Date:   Thu Feb 11 16:28:50 2016 -0800

tcp/dccp: better use of ephemeral ports in bind()

Implement strategy used in __inet_hash_connect() in opposite way :

Try to find a candidate using odd ports, then fallback to even ports.

We no longer disable BH for whole traversal, but one bucket at a time.
We also use cond_resched() to yield cpu to other tasks if needed.

I removed one indentation level and tried to mirror the loop we have
in __inet_hash_connect() and variable names to ease code maintenance.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 

:04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 
e0c014ae8b7e2867256eff60f6210821d36eacef M  net


I will be happy to test patches or try to get any other results that might help 
diagnose
this problem better.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-22 Thread Ben Greear

On 01/22/2018 10:30 AM, Ben Greear wrote:

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



I am sending to self, but over external network interfaces, by using
routing tables and rules and such.

On 4.13.16+, I see the Intel driver bouncing when I try to start 20k
connections.  In this case, I have a pair of 10G ports doing 15k, and then
I try to start 5k on two of the 1G ports

Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 
(e1000e): transmit queue 0 timed out, trans_s...es: 1
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: 
Reset adapter unexpectedly


System reports 10+GB RAM free in this case, btw.

Actually, maybe the good kernel was even older than 4.7...I see same resets and 
inability to do a full 20k
connections on 4.7 too.   I double-checked with system-test and it seems 4.4 
was a good kernel.  I'll test
that next.  Here is splat from 4.7:

[  238.921679] [ cut here ]
[  238.921689] WARNING: CPU: 0 PID: 3 at 
/home/greearb/git/linux-bisect/net/sched/sch_generic.c:272 
dev_watchdog+0xd4/0x12f
[  238.921690] NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out
[  238.921691] Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink 
nf_defrag_ipv4 cfg80211 macvlan pktgen bnep bluetooth fuse coretemp intel_rapl
ftdi_sio x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm iTCO_wdt 
iTCO_vendor_support joydev ie31200_edac ipmi_devintf irqbypass serio_raw 
ipmi_si edac_core
shpchp fjes video i2c_i801 tpm_tis lpc_ich ipmi_msghandler tpm nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core
e1000e ixgbe mdio hwmon dca ptp pps_core ipv6 [last unloaded: nf_conntrack]
[  238.921720] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.7.0 #62
[  238.921721] Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 
09/17/2012
[  238.921723]   88041cdd7cd8 81352a23 
88041cdd7d28
[  238.921725]   88041cdd7d18 810ea5dd 
01101cdd7d90
[  238.921727]  880417a84000 0100 8163ecff 
880417a84440
[  238.921728] Call Trace:
[  238.921733]  [] dump_stack+0x61/0x7d
[  238.921736]  [] __warn+0xbd/0xd8
[  238.921738]  [] ? netif_tx_lock+0x81/0x81
[  238.921740]  [] warn_slowpath_fmt+0x46/0x4e
[  238.921741]  [] ? netif_tx_lock+0x74/0x81
[  238.921743]  [] dev_watchdog+0xd4/0x12f
[  238.921746]  [] call_timer_fn+0x65/0x11b
[  238.921748]  [] ? netif_tx_lock+0x81/0x81
[  238.921749]  [] run_timer_softirq+0x1ad/0x1d7
[  238.921751]  [] __do_softirq+0xfb/0x25c
[  238.921752]  [] run_ksoftirqd+0x19/0x35
[  238.921755]  [] smpboot_thread_fn+0x169/0x1a9
[  238.921756]  [] ? sort_range+0x1d/0x1d
[  238.921759]  [] kthread+0xa0/0xa8
[  238.921763]  [] ret_from_fork+0x1f/0x40
[  238.921764]  [] ? init_completion+0x24/0x24
[  238.921765] ---[ end trace 933912956c6ee5ff ]---
[  238.961672] e1000e :07:00.0 eth3: Reset adapter unexpectedly


So, on 4.4.8+, I see this and other splats related to e1000e.  I guess that is 
a separate
issue.  I can easily start 40k connections however, 30k across the two 10G 
ports,
and 10k more across a pair of mac-vlans on the 10G ports (since I was out of
address space to add a full 40k on the two physical ports).


Looks like the e1000e problem is a separate issu

Re: TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-22 Thread Ben Greear

On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...



Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?



I am sending to self, but over external network interfaces, by using
routing tables and rules and such.

On 4.13.16+, I see the Intel driver bouncing when I try to start 20k
connections.  In this case, I have a pair of 10G ports doing 15k, and then
I try to start 5k on two of the 1G ports

Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is 
Down
Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 
1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 
(e1000e): transmit queue 0 timed out, trans_s...es: 1
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: 
Reset adapter unexpectedly


System reports 10+GB RAM free in this case, btw.

Actually, maybe the good kernel was even older than 4.7...I see same resets and 
inability to do a full 20k
connections on 4.7 too.   I double-checked with system-test and it seems 4.4 
was a good kernel.  I'll test
that next.  Here is splat from 4.7:

[  238.921679] [ cut here ]
[  238.921689] WARNING: CPU: 0 PID: 3 at 
/home/greearb/git/linux-bisect/net/sched/sch_generic.c:272 
dev_watchdog+0xd4/0x12f
[  238.921690] NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out
[  238.921691] Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 macvlan pktgen bnep bluetooth fuse coretemp intel_rapl 
ftdi_sio x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm iTCO_wdt iTCO_vendor_support joydev ie31200_edac ipmi_devintf irqbypass serio_raw ipmi_si edac_core 
shpchp fjes video i2c_i801 tpm_tis lpc_ich ipmi_msghandler tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core 
e1000e ixgbe mdio hwmon dca ptp pps_core ipv6 [last unloaded: nf_conntrack]

[  238.921720] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.7.0 #62
[  238.921721] Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 
09/17/2012
[  238.921723]   88041cdd7cd8 81352a23 
88041cdd7d28
[  238.921725]   88041cdd7d18 810ea5dd 
01101cdd7d90
[  238.921727]  880417a84000 0100 8163ecff 
880417a84440
[  238.921728] Call Trace:
[  238.921733]  [] dump_stack+0x61/0x7d
[  238.921736]  [] __warn+0xbd/0xd8
[  238.921738]  [] ? netif_tx_lock+0x81/0x81
[  238.921740]  [] warn_slowpath_fmt+0x46/0x4e
[  238.921741]  [] ? netif_tx_lock+0x74/0x81
[  238.921743]  [] dev_watchdog+0xd4/0x12f
[  238.921746]  [] call_timer_fn+0x65/0x11b
[  238.921748]  [] ? netif_tx_lock+0x81/0x81
[  238.921749]  [] run_timer_softirq+0x1ad/0x1d7
[  238.921751]  [] __do_softirq+0xfb/0x25c
[  238.921752]  [] run_ksoftirqd+0x19/0x35
[  238.921755]  [] smpboot_thread_fn+0x169/0x1a9
[  238.921756]  [] ? sort_range+0x1d/0x1d
[  238.921759]  [] kthread+0xa0/0xa8
[  238.921763]  [] ret_from_fork+0x1f/0x40
[  238.921764]  [] ? init_completion+0x24/0x24
[  238.921765] ---[ end trace 933912956c6ee5ff ]---
[  238.961672] e1000e :07:00.0 eth3: Reset adapter unexpectedly


Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-22 Thread Ben Greear

My test case is to have 6 processes each create 5000 TCP IPv4 connections to 
each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 
kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running out of 
tcp memory,
but even after forcing those values higher, the max connections we can get is 
around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is my 
fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in more 
recent kernels?

I will start bisecting in the meantime...

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



fm10k cannot get link

2017-10-31 Thread Ben Greear

Hello,

We're trying to get an Intel 100G NIC to work, and so far, cannot get it to 
link.

The cable is:  X0016I4AO3 QSFP28 10Gtek  (any suggestions for a 
better/different one?)

[5.022681] fm10k :05:00.0: PCI Express bandwidth of 64GT/s available
[5.022683] fm10k :05:00.0: (Speed:8.0GT/s, Width: x8, Encoding 
Loss:<2%, Payload:256B)
[5.022684] fm10k :05:00.0: 00:e0:ed:54:78:f2
[5.027864] fm10k :06:00.0: PCI Express bandwidth of 64GT/s available
[5.027865] fm10k :06:00.0: (Speed:8.0GT/s, Width: x8, Encoding 
Loss:<2%, Payload:256B)
[5.027866] fm10k :06:00.0: 00:e0:ed:54:78:f3
[6.057950] Modules linked in: ioatdma(+) shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper ttm igb drm i2c_algo_bit i2c_core ixgbe mdio 
hwmon fm10k ptp pps_core dca fjes ipv6 crc_ccitt

[7.294441] fm10k :05:00.0 eth0.r: renamed from eth0
[   14.044914] fm10k :05:00.0 eth2: renamed from eth0.r
[   14.107798] fm10k :06:00.0 eth1.r: renamed from eth1
[   14.178217] fm10k :06:00.0 eth3: renamed from eth1.r


[root@lf1005c-is14120020 ~]# ethtool eth3
Settings for eth3:
Current message level: 0x0007 (7)
   drv probe link
Link detected: no


[root@lf1005c-is14120020 ~]# uname -a
Linux lf1005c-is14120020 4.9.29+ #46 SMP PREEMPT Wed Jul 26 17:48:57 PDT 2017 
x86_64 x86_64 x86_64 GNU/Linux

[root@lf1005c-is14120020 ~]# ethtool -i eth3
driver: fm10k
version: 0.21.2-k
firmware-version:
bus-info: :06:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

[root@lf1005c-is14120020 ~]# lspci|grep 06
06:00.0 Ethernet controller: Intel Corporation Device 15a4


Please let me know if you have any suggestions.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Ethtool question

2017-10-16 Thread Ben Greear

On 10/12/2017 03:00 PM, Roopa Prabhu wrote:

On Thu, Oct 12, 2017 at 2:45 PM, Ben Greear  wrote:

On 10/11/2017 01:49 PM, David Miller wrote:


From: "John W. Linville" 
Date: Wed, 11 Oct 2017 16:44:07 -0400


On Wed, Oct 11, 2017 at 09:51:56AM -0700, Ben Greear wrote:


I noticed today that setting some ethtool settings to the same value
returns an error code.  I would think this should silently return
success instead?  Makes it easier to call it from scripts this way:

[root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1
combined unmodified, ignoring
no channel parameters changed, aborting
current values: tx 0 rx 0 other 1 combined 1
[root@lf0313-6477 lanforge]# echo $?
1



I just had this discussion a couple of months ago with someone. My
initial feeling was like you, a no-op is not a failure. But someone
convinced me otherwise...I will now endeavour to remember who that
was and how they convinced me...

Anyone else have input here?



I guess this usually happens when drivers don't support changing the
settings at all.  So they just make their ethtool operation for the
'set' always return an error.

We could have a generic ethtool helper that does "get" and then if the
"set" request is identical just return zero.

But from another perspective, the error returned from the "set" in this
situation also indicates to the user that the driver does not support
the "set" operation which has value and meaning in and of itself.  And
we'd lose that with the given suggestion.



In my case, the driver (igb) does support the set, my program just made the
same
ethtool call several times and it fails after the initial change (that
actually
changes something), as best as I can figure.



This error is returned by ethtool user-space. It does a get, check and
then set if user has requested changes.



So, should we fix ethtool to return 0 in this case instead of an error code?

I think so.  If the driver itself returns an error, then probably return the
error code and/or fix the driver as seems appropriate.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Ethtool question

2017-10-12 Thread Ben Greear

On 10/11/2017 01:49 PM, David Miller wrote:

From: "John W. Linville" 
Date: Wed, 11 Oct 2017 16:44:07 -0400


On Wed, Oct 11, 2017 at 09:51:56AM -0700, Ben Greear wrote:

I noticed today that setting some ethtool settings to the same value
returns an error code.  I would think this should silently return
success instead?  Makes it easier to call it from scripts this way:

[root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1
combined unmodified, ignoring
no channel parameters changed, aborting
current values: tx 0 rx 0 other 1 combined 1
[root@lf0313-6477 lanforge]# echo $?
1


I just had this discussion a couple of months ago with someone. My
initial feeling was like you, a no-op is not a failure. But someone
convinced me otherwise...I will now endeavour to remember who that
was and how they convinced me...

Anyone else have input here?


I guess this usually happens when drivers don't support changing the
settings at all.  So they just make their ethtool operation for the
'set' always return an error.

We could have a generic ethtool helper that does "get" and then if the
"set" request is identical just return zero.

But from another perspective, the error returned from the "set" in this
situation also indicates to the user that the driver does not support
the "set" operation which has value and meaning in and of itself.  And
we'd lose that with the given suggestion.


In my case, the driver (igb) does support the set, my program just made the same
ethtool call several times and it fails after the initial change (that actually
changes something), as best as I can figure.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Ethtool question

2017-10-11 Thread Ben Greear

I noticed today that setting some ethtool settings to the same value
returns an error code.  I would think this should silently return
success instead?  Makes it easier to call it from scripts this way:

[root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1
combined unmodified, ignoring
no channel parameters changed, aborting
current values: tx 0 rx 0 other 1 combined 1
[root@lf0313-6477 lanforge]# echo $?
1

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?

2017-09-26 Thread Ben Greear

On 09/12/2017 01:26 PM, Michal Kubecek wrote:

On Tue, Sep 12, 2017 at 11:54:43AM -0700, Ben Greear wrote:

It does not appear to work on Fedora-26, and I'm curious if someone
knows what needs doing to get this support working?


It's rather complicated. The "vlan" and "vlan " filters didn't
handle the case when vlan information is passed in metadata until commit
04660eb1e561 ("Use BPF extensions in compiled filters"), i.e. libpcap
1.7.0. Unfortunately that commit made libpcap always check only metadata
for the first outermost vlan tag so that it broke the case when vlan
information is passed in packet itself (which is less frequent today).

To handle both cases correctly, you would need libpcap with commits
d739b068ac29 ("Make VLAN filter handle both metadata and inline tags")
and 7c7a19fbd9af ("Fix logic of combined VLAN test") and also the
optimizer fix from

  https://github.com/the-tcpdump-group/libpcap/pull/582/commits/075015a3d17a

(without it the filters generate incorrect BPF in some cases unless the
optimizer is disabled). As far as I can see, these commits are not in
any release yet.

   Michal Kubecek



So, I cloned the latest libpcap, and I'm going to start poking at this.

Do you happen to know if I need to do anything special other than
'pcap_compile()'?  I'm curious how the library would know if it can use
newer kernel API or not...or maybe it is somehow magically backwards/forward
compatible?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?

2017-09-12 Thread Ben Greear

On 09/12/2017 11:54 AM, Ben Greear wrote:

It does not appear to work on Fedora-26, and I'm curious if someone knows what 
needs
doing to get this support working?

Thanks,
Ben




Gah, I spoke too soon.  system-test guy says it works on cmd-line, but
not when we try to make it work in another way...could be local bug,
I'll poke at this more.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Can libpcap filter on vlan tags when vlans are hardware-accelerated?

2017-09-12 Thread Ben Greear

It does not appear to work on Fedora-26, and I'm curious if someone knows what 
needs
doing to get this support working?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] Fix build on fedora-14 (and other older systems)

2017-09-03 Thread Ben Greear



On 09/03/2017 08:50 AM, Stephen Hemminger wrote:

On Sat,  2 Sep 2017 07:15:02 -0700
gree...@candelatech.com wrote:


diff --git a/include/linux/sysinfo.h b/include/linux/sysinfo.h
index 934335a..3596b02 100644
--- a/include/linux/sysinfo.h
+++ b/include/linux/sysinfo.h
@@ -3,6 +3,14 @@

 #include 

+/* So we can compile on older OSs, hopefully this is correct. --Ben */
+#ifndef __kernel_long_t
+typedef long __kernel_long_t;
+#endif
+#ifndef __kernel_ulong_t
+typedef unsigned long __kernel_ulong_t;
+#endif
+
 #define SI_LOAD_SHIFT  16
 struct sysinfo {
__kernel_long_t uptime; /* Seconds since boot */


I am not accepting this patch because all files in include/linux are 
automatically
regenerated from kernel 'make install_headers'. No exceptions. If you want to 
change
a header in include/linux it has to go through upstream kernel inclusion.


It would be wrong to add this to the actual kernel header I think.

Do you have another suggestion for fixing iproute2 compile?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: Problem compiling iproute2 on older systems

2017-09-02 Thread Ben Greear



On 09/02/2017 12:55 AM, Michal Kubecek wrote:

On Fri, Sep 01, 2017 at 04:52:20PM -0700, Ben Greear wrote:

In the patch below, usage of __kernel_ulong_t and __kernel_long_t is
introduced, but that is not available on older system (fedora-14, at least).

It is not a #define, so I am having trouble finding a quick hack
around this.

Any ideas on how to make this work better on older OSs running
modern kernels?


Author: Stephen Hemminger   2017-01-12 17:54:39
Committer: Stephen Hemminger   2017-01-12 17:54:39
Child:  c7ec7697e3f000359aa317394e6dd972e35c1f84 (Fix build on fedora-14 (and 
other older systems))
Branches: master, remotes/origin/master
Follows: v3.10.0
Precedes:

add more uapi header files

In order to ensure no backward/forward compatiablity problems,
make sure that all kernel headers used come from the local copy.

Signed-off-by: Stephen Hemminger 

--- include/linux/sysinfo.h ---
new file mode 100644
index 000..934335a
@@ -0,0 +1,24 @@
+#ifndef _LINUX_SYSINFO_H
+#define _LINUX_SYSINFO_H
+
+#include 
+
+#define SI_LOAD_SHIFT  16
+struct sysinfo {
+   __kernel_long_t uptime; /* Seconds since boot */
+   __kernel_ulong_t loads[3];  /* 1, 5, and 15 minute load averages */
+   __kernel_ulong_t totalram;  /* Total usable main memory size */
+   __kernel_ulong_t freeram;   /* Available memory size */
+   __kernel_ulong_t sharedram; /* Amount of shared memory */
+   __kernel_ulong_t bufferram; /* Memory used by buffers */
+   __kernel_ulong_t totalswap; /* Total swap space size */
+   __kernel_ulong_t freeswap;  /* swap space still available */
+   __u16 procs;/* Number of current processes */
+   __u16 pad;  /* Explicit padding for m68k */
+   __kernel_ulong_t totalhigh; /* Total high memory size */
+   __kernel_ulong_t freehigh;  /* Available high memory size */
+   __u32 mem_unit; /* Memory unit size in bytes */
+   char _f[20-2*sizeof(__kernel_ulong_t)-sizeof(__u32)];   /* Padding: 
libc5 uses this.. */
+};
+
+#endif /* _LINUX_SYSINFO_H */


I've been already thinking about this a bit. Normally, we would simply
add the file where __kernel_long_t and __kernel_ulong_t are defined.
The problem is this is  which depends on
architecture - which is the point of these types.

Good thing is iproute2 doesn't actually use struct sysinfo anywhere so
we don't need to have them defined correctly. One possible workaround
would therefore be defining them as long and unsigned long. As long as
we don't use the types anywhere, we would be fine.

Another option would be to replace include/linux/sysinfo.h with an empty
file. The problem I can see with this is that if someone uses a script
to refresh all copies of uapi headers automatically, the script would
have to be aware that it must not update this file and preserve the fake
empty one.


I just sent a patch that appears to compile on all of my build systems, which 
are
generally fedora-14 to fedora-24 currently.

I haven't actually tested functionality yet, but if you say it is unused, then
it is very likely to be OK, and even if not, I think it will be fine unless
someone is trying to cross-compile.  And in that case, probably more than one
issue involved...

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Problem compiling iproute2 on older systems

2017-09-01 Thread Ben Greear

In the patch below, usage of __kernel_ulong_t and __kernel_long_t is
introduced, but that is not available on older system (fedora-14, at least).

It is not a #define, so I am having trouble finding a quick hack
around this.

Any ideas on how to make this work better on older OSs running
modern kernels?


Author: Stephen Hemminger   2017-01-12 17:54:39
Committer: Stephen Hemminger   2017-01-12 17:54:39
Child:  c7ec7697e3f000359aa317394e6dd972e35c1f84 (Fix build on fedora-14 (and 
other older systems))
Branches: master, remotes/origin/master
Follows: v3.10.0
Precedes:

add more uapi header files

In order to ensure no backward/forward compatiablity problems,
make sure that all kernel headers used come from the local copy.

Signed-off-by: Stephen Hemminger 

--- include/linux/sysinfo.h ---
new file mode 100644
index 000..934335a
@@ -0,0 +1,24 @@
+#ifndef _LINUX_SYSINFO_H
+#define _LINUX_SYSINFO_H
+
+#include 
+
+#define SI_LOAD_SHIFT  16
+struct sysinfo {
+   __kernel_long_t uptime; /* Seconds since boot */
+   __kernel_ulong_t loads[3];  /* 1, 5, and 15 minute load averages */
+   __kernel_ulong_t totalram;  /* Total usable main memory size */
+   __kernel_ulong_t freeram;   /* Available memory size */
+   __kernel_ulong_t sharedram; /* Amount of shared memory */
+   __kernel_ulong_t bufferram; /* Memory used by buffers */
+   __kernel_ulong_t totalswap; /* Total swap space size */
+   __kernel_ulong_t freeswap;  /* swap space still available */
+   __u16 procs;/* Number of current processes */
+   __u16 pad;  /* Explicit padding for m68k */
+   __kernel_ulong_t totalhigh; /* Total high memory size */
+   __kernel_ulong_t freehigh;  /* Available high memory size */
+   __u32 mem_unit; /* Memory unit size in bytes */
+   char _f[20-2*sizeof(__kernel_ulong_t)-sizeof(__u32)];   /* Padding: 
libc5 uses this.. */
+};
+
+#endif /* _LINUX_SYSINFO_H */


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers

2017-08-16 Thread Ben Greear

On 08/16/2017 08:18 PM, Dan Williams wrote:

On Wed, 2017-08-16 at 19:36 -0700, Ben Greear wrote:

On 08/16/2017 07:11 PM, Dan Williams wrote:

On Wed, 2017-08-16 at 14:31 -0700, David Miller wrote:

From: Dan Williams 
Date: Wed, 16 Aug 2017 16:22:41 -0500


My biggest suggestion is that perhaps bonding should grow


hysteresis

for link speeds. Since WiFi can change speed every packet, you


probably

don't want the bond characteristics changing every couple
seconds


just

in case your WiFi link is jumping around.  Ethernet won't
bounce


around

that much, so the hysteresis would have no effect there.  Or,
if


people

are concerned about response time to speed changes on ethernet


(where

you probably do want an instant switch-over) some new flag to


indicate

that certain devices don't have stable speeds over time.


Or just report the average of the range the wireless link can
hit,
and
be done with it.

I think you guys are overcomplicating things.


That range can be from 1 to > 800Mb/s.  No, it won't usually be all
over that range, but it won't be uncommon to fluctuate by hundreds
of
Mb/s.  I'm not sure a simple average is really the answer
here.  Even
doing that would require new knobs to ethtool, since the rate
depends
heavily on card capabilities and also what AP you're connected to
*at
that moment*.  If you roam to another AP, then the max speed can
certainly change.

You'll probably say "aim for the 75% case" or something like that,
which is fine, but then you're depending on your 75% case to be (a)
single AP, (b) never move (eg, only bond wifi + ethernet), (c)
little
radio interference.  I'm not sure I'd buy that.  If I've put words
in
your mouth, forgive me.


If you keep ethtool API simple and just return the last (rx-rate +
tx-rate) / 2, or the rate averaged
over the last 100 frames or 10 seconds, then the caller can do longer
term averaging
as it sees fit.  Probably no need for lots of averaging complexity in
the kernel.


Yeah, that works too, but I was thinking it was better to present the
actual data through ethtool so that things other than bonding could use
it, and since bonding is the thing that actually cares about the
fluctuation, make it do the more extensive processing.


What do you mean by 'actual data'?  If you want to know the most accurate
transmit/rx rate info, then you need to pay attention to each and every frame's 
tx/rx rate, as
well as it's ampdu/amsdu, retries, etc.  It is virtually impossible.

So, you will have to settle for something less...  I suggest something simple
to calculate, similar to existing values that are available via debugfs and/or
'iw dev foo station dump', etc.  Let higher layers manipulate the raw data
as they see fit (they can query ethtool as often as they like).

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers

2017-08-16 Thread Ben Greear

On 08/16/2017 07:11 PM, Dan Williams wrote:

On Wed, 2017-08-16 at 14:31 -0700, David Miller wrote:

From: Dan Williams 
Date: Wed, 16 Aug 2017 16:22:41 -0500


My biggest suggestion is that perhaps bonding should grow

hysteresis

for link speeds. Since WiFi can change speed every packet, you

probably

don't want the bond characteristics changing every couple seconds

just

in case your WiFi link is jumping around.  Ethernet won't bounce

around

that much, so the hysteresis would have no effect there.  Or, if

people

are concerned about response time to speed changes on ethernet

(where

you probably do want an instant switch-over) some new flag to

indicate

that certain devices don't have stable speeds over time.


Or just report the average of the range the wireless link can hit,
and
be done with it.

I think you guys are overcomplicating things.


That range can be from 1 to > 800Mb/s.  No, it won't usually be all
over that range, but it won't be uncommon to fluctuate by hundreds of
Mb/s.  I'm not sure a simple average is really the answer here.  Even
doing that would require new knobs to ethtool, since the rate depends
heavily on card capabilities and also what AP you're connected to *at
that moment*.  If you roam to another AP, then the max speed can
certainly change.

You'll probably say "aim for the 75% case" or something like that,
which is fine, but then you're depending on your 75% case to be (a)
single AP, (b) never move (eg, only bond wifi + ethernet), (c) little
radio interference.  I'm not sure I'd buy that.  If I've put words in
your mouth, forgive me.


If you keep ethtool API simple and just return the last (rx-rate + tx-rate) / 
2, or the rate averaged
over the last 100 frames or 10 seconds, then the caller can do longer term 
averaging
as it sees fit.  Probably no need for lots of averaging complexity in the 
kernel.

rate-ctrl for wifi basically doesn't happen until you transmit or receive a
fairly steady stream, so it will fluctuate a lot.

Thanks,
Ben



Dan




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-20 Thread Ben Greear

On 06/20/2017 11:05 AM, Michal Kubecek wrote:

On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:

On 06/14/2017 03:25 PM, David Ahern wrote:

On 6/14/17 4:23 PM, Ben Greear wrote:

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no longer
reproduce the problem.



ok, let's figure out which one is doing that. There are 3 debug
statements. I suspect fib6_del_route is the one setting the state to
FWS_U. Can you remove the debug prints in fib6_repair_tree and
fib6_walk_continue and try again?


We cannot reproduce with just that one printf in the kernel either.  It
must change the timing too much to trigger the bug.


You might try trace_printk() which should have less impact (don't forget
to enable /proc/sys/kernel/ftrace_dump_on_oops).


We cannot reproduce with trace_printk() either.

Thanks,
Ben



Michal Kubecek




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-20 Thread Ben Greear



On 06/14/2017 03:25 PM, David Ahern wrote:

On 6/14/17 4:23 PM, Ben Greear wrote:

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no longer
reproduce the problem.



ok, let's figure out which one is doing that. There are 3 debug
statements. I suspect fib6_del_route is the one setting the state to
FWS_U. Can you remove the debug prints in fib6_repair_tree and
fib6_walk_continue and try again?


We cannot reproduce with just that one printf in the kernel either.  It
must change the timing too much to trigger the bug.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-14 Thread Ben Greear

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no longer
reproduce the problem.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-13 Thread Ben Greear

On 06/13/2017 01:28 PM, David Ahern wrote:

On 6/13/17 2:16 PM, Ben Greear wrote:

On 06/09/2017 02:25 PM, Eric Dumazet wrote:

On Fri, 2017-06-09 at 07:27 -0600, David Ahern wrote:

On 6/8/17 11:55 PM, Cong Wang wrote:

On Thu, Jun 8, 2017 at 2:27 PM, Ben Greear 
wrote:


As far as I can tell, the patch did not help, or at least we still
reproduce
the
crash easily.


netlink dump is serialized by nlk->cb_mutex so I don't think that
patch makes any sense w.r.t race condition.


From what I can see fn_sernum should be accessed under table lock, so
when saving and checking it during a walk make sure it the lock is held.
That has nothing to do with the netlink dump, but the table changing
during a walk.



Yes, your patch makes total sense, of course.


I guess someone should go ahead and make an official patch and
submit it, even if it doesn't fix my problem.


I can do that; was hoping to root cause the problem first.





(gdb) l *(fib6_walk_continue+0x76)
0x188c6 is in fib6_walk_continue
(/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593).
1588if (fn == w->root)
1589return 0;
1590pn = fn->parent;
1591w->node = pn;
1592#ifdef CONFIG_IPV6_SUBTREES
1593if (FIB6_SUBTREE(pn) == fn) {


Apparently fn->parent is NULL here for some reason, but
I don't know if that is expected or not. If a simple NULL check
is not enough here, we have to trace why it is NULL.


From my understanding, parent should not be null hence the attempts to
fix access to table nodes under a lock. ie., figuring out why it is null
here.


If someone has more suggestions, I'll be happy to test.


I have looked at the code again and nothing is jumping out. Will look
again later today.



I noticed there is some code to help fix up the walkers when nodes are deleted. 
 They
use lock:   read_lock(&net->ipv6.fib6_walker_lock);

The code you were tweaking uses a different lock:  
read_lock_bh(&table->tb6_lock);

In is certainly not simple code, so I don't know if that is correct or not, but
might possibly be a place to start looking.

I'm going to re-test with a WARN_ON to see if that triggers since previous 
suggestion
was that f->parent was NULL.


diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 51cd637..86295df 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1571,6 +1571,10 @@ static int fib6_walk_continue(struct fib6_walker *w)
case FWS_U:
if (fn == w->root)
return 0;
+   if (!fn->parent) {
+   WARN_ON_ONCE(0);
+   return 0;
+   }
pn = fn->parent;
            w->node = pn;
 #ifdef CONFIG_IPV6_SUBTREES


Thanks,
Ben

Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-13 Thread Ben Greear

On 06/09/2017 02:25 PM, Eric Dumazet wrote:

On Fri, 2017-06-09 at 07:27 -0600, David Ahern wrote:

On 6/8/17 11:55 PM, Cong Wang wrote:

On Thu, Jun 8, 2017 at 2:27 PM, Ben Greear  wrote:


As far as I can tell, the patch did not help, or at least we still reproduce
the
crash easily.


netlink dump is serialized by nlk->cb_mutex so I don't think that
patch makes any sense w.r.t race condition.


From what I can see fn_sernum should be accessed under table lock, so
when saving and checking it during a walk make sure it the lock is held.
That has nothing to do with the netlink dump, but the table changing
during a walk.



Yes, your patch makes total sense, of course.


I guess someone should go ahead and make an official patch and
submit it, even if it doesn't fix my problem.


(gdb) l *(fib6_walk_continue+0x76)
0x188c6 is in fib6_walk_continue
(/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593).
1588if (fn == w->root)
1589return 0;
1590pn = fn->parent;
1591w->node = pn;
1592#ifdef CONFIG_IPV6_SUBTREES
1593if (FIB6_SUBTREE(pn) == fn) {


Apparently fn->parent is NULL here for some reason, but
I don't know if that is expected or not. If a simple NULL check
is not enough here, we have to trace why it is NULL.


From my understanding, parent should not be null hence the attempts to
fix access to table nodes under a lock. ie., figuring out why it is null
here.


If someone has more suggestions, I'll be happy to test.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-08 Thread Ben Greear
;
1597}
(gdb) l *(inet6_dump_fib+0x1ab)
0x1939b is in inet6_dump_fib 
(/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:392).
387 w->skip = w->count;
388 } else
389 w->skip = 0;
390 
391 res = fib6_walk_continue(w);
392 read_unlock_bh(&table->tb6_lock);
393 if (res <= 0) {
394 fib6_walker_unlink(net, w);
395 cb->args[4] = 0;
396 }
(gdb)

[greearb@ben-dt3 linux-2.6]$ git diff
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index d4bf2c6..4e32a16 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -372,12 +372,13 @@ static int fib6_dump_table(struct fib6_table *table, 
struct sk_buff *skb,

read_lock_bh(&table->tb6_lock);
res = fib6_walk(net, w);
-   read_unlock_bh(&table->tb6_lock);
if (res > 0) {
cb->args[4] = 1;
cb->args[5] = w->root->fn_sernum;
}
+   read_unlock_bh(&table->tb6_lock);
} else {
+   read_lock_bh(&table->tb6_lock);
if (cb->args[5] != w->root->fn_sernum) {
/* Begin at the root if the tree changed */
cb->args[5] = w->root->fn_sernum;
@@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table, struct 
sk_buff *skb,
} else
w->skip = 0;

-   read_lock_bh(&table->tb6_lock);
res = fib6_walk_continue(w);
read_unlock_bh(&table->tb6_lock);
if (res <= 0) {


Thanks,
Ben



--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-06 Thread Ben Greear

On 06/06/2017 05:27 PM, Eric Dumazet wrote:

On Tue, 2017-06-06 at 18:00 -0600, David Ahern wrote:

On 6/6/17 3:06 PM, Ben Greear wrote:

This bug has been around forever, and we recently got an intern and
stuck him with
trying to reproduce it on the latest kernel.  It is still here.  I'm not
super excited
about trying to fix this, but we can easily test patches if someone has a
patch to try.


Can you try this (whitespace damaged on paste, but it is moving the lock
ahead of the fn_sernum check):

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index deea901746c8..7a44c49055c0 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -378,6 +378,7 @@ static int fib6_dump_table(struct fib6_table *table,
struct sk_buff *skb,
cb->args[5] = w->root->fn_sernum;
}
} else {
+   read_lock_bh(&table->tb6_lock);
if (cb->args[5] != w->root->fn_sernum) {
/* Begin at the root if the tree changed */
cb->args[5] = w->root->fn_sernum;
@@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table,
struct sk_buff *skb,
} else
w->skip = 0;

-   read_lock_bh(&table->tb6_lock);
res = fib6_walk_continue(w);
read_unlock_bh(&table->tb6_lock);
if (res <= 0) {



Good catch, but it looks like similar fix is needed a few lines before.


We will test this tomorrow.

Thanks,
Ben




diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 
deea901746c8570c5e801e40592c91e3b62812e0..b214443dc8346cef3690df7f27cc48a864028865
 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -372,12 +372,13 @@ static int fib6_dump_table(struct fib6_table *table, 
struct sk_buff *skb,

read_lock_bh(&table->tb6_lock);
res = fib6_walk(net, w);
-   read_unlock_bh(&table->tb6_lock);
if (res > 0) {
cb->args[4] = 1;
cb->args[5] = w->root->fn_sernum;
}
+   read_unlock_bh(&table->tb6_lock);
} else {
+   read_lock_bh(&table->tb6_lock);
if (cb->args[5] != w->root->fn_sernum) {
/* Begin at the root if the tree changed */
cb->args[5] = w->root->fn_sernum;
@@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table, struct 
sk_buff *skb,
} else
w->skip = 0;

-   read_lock_bh(&table->tb6_lock);
        res = fib6_walk_continue(w);
read_unlock_bh(&table->tb6_lock);
if (res <= 0) {





--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2017-06-06 Thread Ben Greear

Hello,

This bug has been around forever, and we recently got an intern and stuck him 
with
trying to reproduce it on the latest kernel.  It is still here.  I'm not super 
excited
about trying to fix this, but we can easily test patches if someone has a
patch to try.

Test case is to create 1000 mac-vlans and bring them up, with user-space
tools running lots of 'dump' related commands as part of bringing up the
interfaces and configuring some special source-based routing tables.

(gdb) l *(inet6_dump_fib+0x109)
0x192f9 is in inet6_dump_fib 
(/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:392).
387 } else
388 w->skip = 0;
389 
390 read_lock_bh(&table->tb6_lock);
391 res = fib6_walk_continue(w);
392 read_unlock_bh(&table->tb6_lock);
393 if (res <= 0) {
394 fib6_walker_unlink(net, w);
395 cb->args[4] = 0;
396 }

(gdb) l *(fib6_walk_continue+0x76)
0x188c6 is in fib6_walk_continue 
(/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593).
1588if (fn == w->root)
1589return 0;
1590pn = fn->parent;
1591w->node = pn;
1592#ifdef CONFIG_IPV6_SUBTREES
1593if (FIB6_SUBTREE(pn) == fn) {
1594WARN_ON(!(fn->fn_flags & RTN_ROOT));
1595w->state = FWS_L;
1596continue;
1597}

[root@ct524-ffb0 ~]# BUG: unable to handle kernel NULL pointer dereference at 
0018
IP: fib6_walk_continue+0x76/0x180 [ipv6]
PGD 3d9226067
P4D 3d9226067
PUD 3d9020067
PMD 0

Oops:  [#1] PREEMPT SMP
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c bnep fuse macvlan pktgen cfg80211 ipmi_ssif iTCO_wdt iTCO_vendor_support 
coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass joydev i2c_i801 ie31200_edac intel_pch_thermal shpchp hci_uart ipmi_si btbcm 
btqca ipmi_devintf btintel ipmi_msghandler bluetooth pinctrl_sunrisepoint acpi_als pinctrl_intel video tpm_tis intel_lpss_acpi kfifo_buf tpm_tis_core intel_lpss 
industrialio tpm acpi_pad acpi_power_meter sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca 
i2c_algo_bit i2c_hid i2c_core ipv6 crc_ccitt [last unloaded: nf_conntrack]

CPU: 1 PID: 996 Comm: ip Not tainted 4.12.0-rc4+ #32
Hardware name: Supermicro Super Server/X11SSM-F, BIOS 1.0b 12/29/2015
task: 8803d4d61dc0 task.stack: c9000970c000
RIP: 0010:fib6_walk_continue+0x76/0x180 [ipv6]
RSP: 0018:c9000970fbb8 EFLAGS: 00010283
RAX: 8803de84b020 RBX: 8803e0756f00 RCX: 
RDX:  RSI: c9000970fc00 RDI: 81eee280
RBP: c9000970fbc0 R08: 0008 R09: 8803d4fbbf31
R10: c9000970fb68 R11:  R12: 0001
R13: 0001 R14: 8803e0756f00 R15: 8803d9345b18
FS:  7f32ca4ec700() GS:88047784() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0018 CR3: 0003ddacc000 CR4: 003406e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 inet6_dump_fib+0x109/0x290 [ipv6]
 netlink_dump+0x11d/0x290
 netlink_recvmsg+0x260/0x3f0
 sock_recvmsg+0x38/0x40
 ___sys_recvmsg+0xe9/0x230
 ? alloc_pages_vma+0x9d/0x260
 ? page_add_new_anon_rmap+0x88/0xc0
 ? lru_cache_add_active_or_unevictable+0x31/0xb0
 ? __handle_mm_fault+0xce3/0xf70
 __sys_recvmsg+0x3d/0x70
 ? __sys_recvmsg+0x3d/0x70
 SyS_recvmsg+0xd/0x20
 do_syscall_64+0x56/0xc0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f32c9e21050
RSP: 002b:7fff96401de8 EFLAGS: 0246 ORIG_RAX: 002f
RAX: ffda RBX:  RCX: 7f32c9e21050
RDX:  RSI: 7fff96401e50 RDI: 0004
RBP: 7fff96405e74 R08: 3fe4 R09: 
R10: 7fff96401e90 R11: 0246 R12: 0064f3a0
R13: 7fff96405ee0 R14: 3fe4 R15: 
Code: f6 40 2a 04 74 11 8b 53 30 85 d2 0f 84 02 01 00 00 83 ea 01 89 53 30 c7 43 28 04 00 00 00 48 39 43 10 74 33 48 8b 10 48 89 53 18 <48> 39 42 18 0f 84 a3 00 
00 00 48 39 42 08 0f 84 ae 00 00 00 48

RIP: fib6_walk_continue+0x76/0x180 [ipv6] RSP: c9000970fbb8
CR2: 0018
---[ end trace 5ebbc4ee97bea64e ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 10 seconds..
ACPI MEMORY or I/O RESET_REG.


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: 'iw events' stops receiving events after a while on 4.9 + hacks

2017-05-31 Thread Ben Greear



On 05/31/2017 01:18 AM, Bastian Bittorf wrote:

* Johannes Berg  [31.05.2017 10:09]:

Is there any way to dump out the socket information if we reproduce
the problem?


I have no idea, sorry.

If you or Bastian can tell me how to reproduce the problem, I can try
to investigate it.


there was an interesting fix regarding the shell-builtin 'read' in
busybox[1]. I will retest again and report if this changes anything.

bye, bastian

PS: @ben: are you also using 'iw event | while read -r LINE ...'?


I'm using a perl script to read the output, and not using busybox.

I have not seen the problem again, so it is not easy for me to reproduce.

If you reproduce it, maybe check 'strace' on the 'iw' process to see if it is
hung on writing output to the pipe or reading input?  In my case, it appeared
to be hung reading input from netlink, input that never arrived.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: 'iw events' stops receiving events after a while on 4.9 + hacks

2017-05-17 Thread Ben Greear

On 05/17/2017 06:30 AM, Johannes Berg wrote:

On Wed, 2017-05-17 at 12:08 +0200, Bastian Bittorf wrote:

* Ben Greear  [17.05.2017 11:51]:

I have been keeping an 'iw events' program running with a perl
script gathering its
output and post-processing it.  This has been working for several
years on 4.7 and earlier
kernels, but when testing on 4.9 overnight, I notice that 'iw
events' is not showing any input.  'strace' shows
that it is waiting on recvmsg.  If I start a second 'iw events'
then it will get
wifi events as expected.


me too, also seen on 4.4 - i'am happy for debug ideas.


I've never seen this.

Does it happen when it's very long-running? Or when there are lots of
events?

Perhaps something in the socket buffer accounting is going wrong, so
that it's slowly decreasing to 0?


I saw it exactly once so far, and it happened overnight,
but we have not been doing a lot of work with the 4.9 kernel until recently.

I don't think there were many messages on this system, and certainly
others have run much longer on systems that should be generating many more
events without trouble.

Is there any way to dump out the socket information if we reproduce
the problem?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



  1   2   3   4   5   6   >