Re: ipv6 sysctl

2017-03-02 Thread Ani Sinha
Hey netdev guys,

Any feedback on this? :-)

thanks
ani


On Tue, Feb 28, 2017 at 11:22 AM, Ani Sinha  wrote:
> Hi guys,
>
> Commit a79ca223e029 ('ipv6: fix bad free of addrconf_init_net')
> introduced in linux 3.9 tries to fix an issue involving free-ing
> statically allocated memory. Additionally, it subtly changes behavior
> of how certain ipv6 sysctl values are inherited from the default net
> namespace to the child namespaces.   Before a79ca223e029, the default
> namespace would directly modify the values in statically allocated
> struct ipv6_devconf for example and all child namespaces would inherit
> these values upon creation (their own private copy was initialized
> using the statically allocated ipv6_devconf). After this change, any
> sysctl value changes in default net namespace is not seen by any new
> child namespaces that are created afterwards. This is because all
> network namespaces, including the default namespace has it's own
> private copy of  struct ipv6_devconf which is initialized by certain
> fixed values. This is in contrast to what we have in ipv4 where child
> namespaces continues to inherit values from the default namespace upon
> creation.
>
> I see that there was a previous discussion here :
> https://patchwork.kernel.org/patch/4639391/
>
> Was the above inconsistency between ipv4 and ipv6 sysctl
> initialization intentional or was it an unintended effect of the above
> change ? It would be nice to have a symmetric behavior between ipv4
> and ipv6. Please share your thoughts on this.
>
> thanks,
> ani


ipv6 sysctl

2017-02-28 Thread Ani Sinha
Hi guys,

Commit a79ca223e029 ('ipv6: fix bad free of addrconf_init_net')
introduced in linux 3.9 tries to fix an issue involving free-ing
statically allocated memory. Additionally, it subtly changes behavior
of how certain ipv6 sysctl values are inherited from the default net
namespace to the child namespaces.   Before a79ca223e029, the default
namespace would directly modify the values in statically allocated
struct ipv6_devconf for example and all child namespaces would inherit
these values upon creation (their own private copy was initialized
using the statically allocated ipv6_devconf). After this change, any
sysctl value changes in default net namespace is not seen by any new
child namespaces that are created afterwards. This is because all
network namespaces, including the default namespace has it's own
private copy of  struct ipv6_devconf which is initialized by certain
fixed values. This is in contrast to what we have in ipv4 where child
namespaces continues to inherit values from the default namespace upon
creation.

I see that there was a previous discussion here :
https://patchwork.kernel.org/patch/4639391/

Was the above inconsistency between ipv4 and ipv6 sysctl
initialization intentional or was it an unintended effect of the above
change ? It would be nice to have a symmetric behavior between ipv4
and ipv6. Please share your thoughts on this.

thanks,
ani


ip v6 routing behavior difference between linux 3.4 and linux 3.18

2016-02-26 Thread Ani Sinha
Hi guys,

I am a little puzzled with a behavior difference I see between linux
3.4 and linux 3.18. Here's my setup where the numbers in hex are ipv6
addresses of the interfaces in parenthesis :

fd7a:629f:52a4:fffd::1 (lo0)
  ∣
  ∣
 fd7a:629f:52a4:fffe::1 (vlan_dev1)
 ∣ linux box 2 (unit under test)
 ---
  ∣   linux box1 (Test Driver)
  ∣
 fd7a:629f:52a4:fffe::2 (e0)

Linux box2 is running linux kernel 3.4. Linux box1 is running linux
kernel 3.18.

I am running a small test script on box1 where I try to ping the
loopback interface. Before I do that, I set up a static route for
loopback device lo on box1, something like this :

fd7a:629f:52a4:fffd::1 via fd7a:629f:52a4:fffe::1 dev e0  metric 1024

Then I bring down the real device under the vlan_dev1 interface on
box2. The ping to loopback fails. So far so good.

Now I bring the real device under vlan_dev1 back up. This time, the
ping6 to lo0 on box1 keeps failing with "destination unreachable: no
route". I don't understand why the ping would fail even with a static
route programmed. I have also noticed that when I ping6 vlan_dev1 from
box1 and then ping6 lo0 from box1, the ping6 to lo0 then succeeds.
Alternatively, if I ping6 e0 from box2, then ping6 from box1 to lo0,
it succeeds.

Now as another experiment data point, I run linux kernel 3.4 on box1.
The behavior is slightly different.  The moment I bring back up the
underlying device for vlan_dev1, the pings succeed right away without
any tinkering. I don't understand why this subtle difference in
behavior in the two kernels?

Any pointers would be greatly appreciated.

thanks
ani


Re: tbl->lock not taken in neigh_lookup() ?

2016-01-24 Thread Ani Sinha
Hi All:

Can I get some insights into this? I am sure I am missing something.

thanks
ani


On Thu, Jan 21, 2016 at 9:05 PM, Ani Sinha  wrote:
> hi guys
>
> As per the comment at the top of net/core/neighbor.c we should be
> taking this lock even for scanning the hash buckets. I do see that
> this lock is taken in pneigh_lookup() but not in neigh_lookup(). Am i
> missing something?
>
> For the context I am investigating the following crash which happened
> on one of our boxes. This is in 3.18.19 kernel where the following
> code in neigh_lookup() returns a corrupt value of neighbour* :
>
> for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
>
>
>
> [ 4918.117044] BUG: unable to handle kernel paging request at 14000270
> [ 4918.200203] IP: [] neigh_lookup+0x64/0xdf
> [ 4918.266790] PGD 7e383067 PUD 7e202067 PMD 0
> [ 4918.266795] Oops:  [#1] PREEMPT SMP
> [ 4918.266798] Modules linked in: macvlan l2mod_dma(PO)
> arptable_filter arp_tables nf_log_ipv6 nf_conntrack_ipv6
> nf_defrag_ipv6 ip6t_REJECT nf_reject_ipv6 ip6table_mangle nf_log_ipv4
> nf_log_common nf_conntrack_ipv4 nf_defrag_ipv4 xt_LOG xt_limit xt_hl
> xt_conntrack ipt_REJECT nf_reject_ipv4 xt_multiport xt_tcpudp
> iptable_mangle sch_prio strata_dma(PO) arista_strata_bde(PO) msr
> kbfd(O) 8021q garp stp llc tun scd_em_driver(O) plxnt_nl(O)
> plxnt_ll(O) nf_conntrack_tftp iptable_raw iptable_filter ip_tables
> xt_CT nf_conntrack xt_mark ip6table_raw ip6table_filter ip6_tables
> x_tables coretemp microcode intel_ips scd(O) sb_e3_edac kvm_intel kvm
> [last unloaded: plxnt_ll]
> [ 4918.266837] CPU: 5 PID: 3315 Comm: Arp Tainted: P   O   3.18.19 #1
> [ 4918.266840] task: 8803c6dd3720 ti: 88007b98c000 task.ti:
> 88007b98c000
> [ 4918.266841] RIP: 0010:[]  []
> neigh_lookup+0x64/0xdf
> [ 4918.266847] RSP: 0018:88007b98faa8  EFLAGS: 00210206
> [ 4918.266849] RAX: 8803203bf278 RBX: 81866990 RCX: 
> 0014
> [ 4918.266851] RDX: 88032a46624c RSI: 88005ff82000 RDI: 
> 880319ee6020
> [ 4918.266852] RBP: 88007b98fad8 R08: 0a00 R09: 
> 880319ee6000
> [ 4918.266854] R10: 88007b98faf8 R11: 88043d001900 R12: 
> 14000100
> [ 4918.266855] R13: 880319ee6020 R14: 88005ff82000 R15: 
> 0010
> [ 4918.266857] FS:  () GS:88044f54(0063)
> knlGS:f73cb8e0
> [ 4918.266859] CS:  0010 DS: 002b ES: 002b CR0: 80050033
> [ 4918.266860] CR2: 14000270 CR3: 7e371000 CR4: 
> 07e0
> [ 4918.266862] Stack:
> [ 4918.266863]  0180 8185cd00 81866990
> 880319ee6010
> [ 4918.266866]  880319ee601c 88005ff82000 88007b98fb28
> 813b97c2
> [ 4918.266869]  88007b98faf8 818282c0 88007b98fb18
> 880377411600
> [ 4918.266872] Call Trace:
> [ 4918.266875] [] neigh_delete+0x113/0x17f
> [ 4918.266878] [] rtnetlink_rcv_msg+0x18a/0x1a0
> [ 4918.266882] [] ? get_parent_ip+0x11/0x42
> [ 4918.266885] [] ? rhashtable_lookup_compare+0x4b/0x71
> [ 4918.266887] [] ? rtnetlink_rcv_msg+0x0/0x1a0
> [ 4918.266890] [] netlink_rcv_skb+0x3e/0x94
> [ 4918.266891] [] rtnetlink_rcv+0x21/0x28
> [ 4918.266893] [] netlink_unicast+0x10b/0x1ad
> [ 4918.266895] [] netlink_sendmsg+0x2e6/0x327
> [ 4918.266898] [] sock_sendmsg+0x6d/0x86
> [ 4918.266901] [] ? sock_poll+0x10e/0x11c
> [ 4918.266903] [] ? sockfd_lookup_light+0x12/0x5d
> [ 4918.266906] [] SyS_sendto+0xf3/0x11b
> [ 4918.266909] [] ? __mutex_lock_slowpath+0x2de/0x2fe
> [ 4918.266911] [] ? get_parent_ip+0x11/0x42
> [ 4918.266914] [] SyS_send+0xf/0x11
> [ 4918.266916] [] compat_SyS_socketcall+0x122/0x1e3
> [ 4918.266919] [] sysenter_dispatch+0x7/0x1e
> [ 4918.266920] Code: 00 00 4c 89 f6 4c 89 ef 49 8d 54 24 0c ff 53 18
> b9 20 00 00 00 41 2b 4c 24 08 d3 e8 89 c0 48 c1 e0 03 49 03 04 24 4c
> 8b 20 eb 55 <4d> 3b b4 24 70 01 00 00 75 47 49 8d bc 24 78 01 00 00 4c
> 89 fa
> [ 4918.266947] RIP  [] neigh_lookup+0x64/0xdf
> [ 4918.334502]  RSP 
> [ 4918.334505] Kernel version: 3.18.19 #1 SMP PREEMPT Mon Jan 4
> 12:34:37 PST 2016
>
> [ 4918.455186] CR2: 14000270


tbl->lock not taken in neigh_lookup() ?

2016-01-21 Thread Ani Sinha
hi guys

As per the comment at the top of net/core/neighbor.c we should be
taking this lock even for scanning the hash buckets. I do see that
this lock is taken in pneigh_lookup() but not in neigh_lookup(). Am i
missing something?

For the context I am investigating the following crash which happened
on one of our boxes. This is in 3.18.19 kernel where the following
code in neigh_lookup() returns a corrupt value of neighbour* :

for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);



[ 4918.117044] BUG: unable to handle kernel paging request at 14000270
[ 4918.200203] IP: [] neigh_lookup+0x64/0xdf
[ 4918.266790] PGD 7e383067 PUD 7e202067 PMD 0
[ 4918.266795] Oops:  [#1] PREEMPT SMP
[ 4918.266798] Modules linked in: macvlan l2mod_dma(PO)
arptable_filter arp_tables nf_log_ipv6 nf_conntrack_ipv6
nf_defrag_ipv6 ip6t_REJECT nf_reject_ipv6 ip6table_mangle nf_log_ipv4
nf_log_common nf_conntrack_ipv4 nf_defrag_ipv4 xt_LOG xt_limit xt_hl
xt_conntrack ipt_REJECT nf_reject_ipv4 xt_multiport xt_tcpudp
iptable_mangle sch_prio strata_dma(PO) arista_strata_bde(PO) msr
kbfd(O) 8021q garp stp llc tun scd_em_driver(O) plxnt_nl(O)
plxnt_ll(O) nf_conntrack_tftp iptable_raw iptable_filter ip_tables
xt_CT nf_conntrack xt_mark ip6table_raw ip6table_filter ip6_tables
x_tables coretemp microcode intel_ips scd(O) sb_e3_edac kvm_intel kvm
[last unloaded: plxnt_ll]
[ 4918.266837] CPU: 5 PID: 3315 Comm: Arp Tainted: P   O   3.18.19 #1
[ 4918.266840] task: 8803c6dd3720 ti: 88007b98c000 task.ti:
88007b98c000
[ 4918.266841] RIP: 0010:[]  []
neigh_lookup+0x64/0xdf
[ 4918.266847] RSP: 0018:88007b98faa8  EFLAGS: 00210206
[ 4918.266849] RAX: 8803203bf278 RBX: 81866990 RCX: 0014
[ 4918.266851] RDX: 88032a46624c RSI: 88005ff82000 RDI: 880319ee6020
[ 4918.266852] RBP: 88007b98fad8 R08: 0a00 R09: 880319ee6000
[ 4918.266854] R10: 88007b98faf8 R11: 88043d001900 R12: 14000100
[ 4918.266855] R13: 880319ee6020 R14: 88005ff82000 R15: 0010
[ 4918.266857] FS:  () GS:88044f54(0063)
knlGS:f73cb8e0
[ 4918.266859] CS:  0010 DS: 002b ES: 002b CR0: 80050033
[ 4918.266860] CR2: 14000270 CR3: 7e371000 CR4: 07e0
[ 4918.266862] Stack:
[ 4918.266863]  0180 8185cd00 81866990
880319ee6010
[ 4918.266866]  880319ee601c 88005ff82000 88007b98fb28
813b97c2
[ 4918.266869]  88007b98faf8 818282c0 88007b98fb18
880377411600
[ 4918.266872] Call Trace:
[ 4918.266875] [] neigh_delete+0x113/0x17f
[ 4918.266878] [] rtnetlink_rcv_msg+0x18a/0x1a0
[ 4918.266882] [] ? get_parent_ip+0x11/0x42
[ 4918.266885] [] ? rhashtable_lookup_compare+0x4b/0x71
[ 4918.266887] [] ? rtnetlink_rcv_msg+0x0/0x1a0
[ 4918.266890] [] netlink_rcv_skb+0x3e/0x94
[ 4918.266891] [] rtnetlink_rcv+0x21/0x28
[ 4918.266893] [] netlink_unicast+0x10b/0x1ad
[ 4918.266895] [] netlink_sendmsg+0x2e6/0x327
[ 4918.266898] [] sock_sendmsg+0x6d/0x86
[ 4918.266901] [] ? sock_poll+0x10e/0x11c
[ 4918.266903] [] ? sockfd_lookup_light+0x12/0x5d
[ 4918.266906] [] SyS_sendto+0xf3/0x11b
[ 4918.266909] [] ? __mutex_lock_slowpath+0x2de/0x2fe
[ 4918.266911] [] ? get_parent_ip+0x11/0x42
[ 4918.266914] [] SyS_send+0xf/0x11
[ 4918.266916] [] compat_SyS_socketcall+0x122/0x1e3
[ 4918.266919] [] sysenter_dispatch+0x7/0x1e
[ 4918.266920] Code: 00 00 4c 89 f6 4c 89 ef 49 8d 54 24 0c ff 53 18
b9 20 00 00 00 41 2b 4c 24 08 d3 e8 89 c0 48 c1 e0 03 49 03 04 24 4c
8b 20 eb 55 <4d> 3b b4 24 70 01 00 00 75 47 49 8d bc 24 78 01 00 00 4c
89 fa
[ 4918.266947] RIP  [] neigh_lookup+0x64/0xdf
[ 4918.334502]  RSP 
[ 4918.334505] Kernel version: 3.18.19 #1 SMP PREEMPT Mon Jan 4
12:34:37 PST 2016

[ 4918.455186] CR2: 14000270


Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.

2015-11-04 Thread Ani Sinha
(removed a bunch of people from CC list)

On Mon, Oct 26, 2015 at 1:06 PM, Pablo Neira Ayuso  wrote:

> Then we can review and, if no major concerns, I can submit this to
> -stable.

Now that Neal has sufficiently tested the patches, is it OK to apply
to -stable or do you guys want me to do anything more?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -stable 3.18, backport] ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() in preemptible context.

2015-11-02 Thread Ani Sinha
On Mon, Nov 2, 2015 at 4:50 PM, Eric Dumazet  wrote:
> On Mon, 2015-11-02 at 16:40 -0800, Ani Sinha wrote:
>> [ Upstream commit 44f49dd8b5a606870a1f2 ]
>
> Please carefully read Documentation/networking/netdev-FAQ.txt

I don't see any recent releases of 3.18 version series in Greg KH's
tree. Is the 3.18 stable train being maintained by Sasha here?

http://git.kernel.org/cgit/linux/kernel/git/sashal/linux-stable.git/

If that is the case, should 3.18 specific backport patches be sent
directly to him ?

>
> A: Normally Greg Kroah-Hartman collects stable commits himself, but
>for networking, Dave collects up patches he deems critical for the
>networking subsystem, and then hands them off to Greg.
>
>There is a patchworks queue that you can see here:
> http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
>
>It contains the patches which Dave has selected, but not yet handed
>off to Greg.  If Greg already has the patch, then it will be here:
> http://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git
>
>A quick way to find whether the patch is in this stable-queue is
>to simply clone the repo, and then git grep the mainline commit ID, e.g.
>
> stable-queue$ git grep -l 284041ef21fdf2e
> releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
> releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
> releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
> stable/stable-queue$
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -stable 3.18, backport] ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() in preemptible context.

2015-11-02 Thread Ani Sinha
On Mon, Nov 2, 2015 at 4:50 PM, Eric Dumazet  wrote:
> On Mon, 2015-11-02 at 16:40 -0800, Ani Sinha wrote:
>> [ Upstream commit 44f49dd8b5a606870a1f2 ]
>
> Please carefully read Documentation/networking/netdev-FAQ.txt
>
> A: Normally Greg Kroah-Hartman collects stable commits himself, but
>for networking, Dave collects up patches he deems critical for the
>networking subsystem, and then hands them off to Greg.
>
>There is a patchworks queue that you can see here:
> http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
>
>It contains the patches which Dave has selected, but not yet handed
>off to Greg.  If Greg already has the patch, then it will be here:
> http://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git
>
>A quick way to find whether the patch is in this stable-queue is
>to simply clone the repo, and then git grep the mainline commit ID, e.g.
>
> stable-queue$ git grep -l 284041ef21fdf2e
> releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
> releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
> releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
> stable/stable-queue$

Ah cool! Thanks for the pointer. Seems its already queued up for other
stable kernel version trains  :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -stable 3.18, backport] ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() in preemptible context.

2015-11-02 Thread Ani Sinha
[ Upstream commit 44f49dd8b5a606870a1f2 ] 

Fixes the following kernel BUG :

BUG: using __this_cpu_add() in preemptible [] code: bash/2758
caller is __this_cpu_preempt_check+0x13/0x15
CPU: 0 PID: 2758 Comm: bash Tainted: P   O   3.18.19 #2
 8170eaca 880110d1b788 81482b2a 
  880110d1b7b8 812010ae 880007cab800
 88001a060800 88013a899108 880108b84240 880110d1b7c8
Call Trace:
[] dump_stack+0x52/0x80
[] check_preemption_disabled+0xce/0xe1
[] __this_cpu_preempt_check+0x13/0x15
[] ipmr_queue_xmit+0x647/0x70c
[] ip_mr_forward+0x32f/0x34e
[] ip_mroute_setsockopt+0xe03/0x108c
[] ? get_parent_ip+0x11/0x42
[] ? pollwake+0x4d/0x51
[] ? default_wake_function+0x0/0xf
[] ? get_parent_ip+0x11/0x42
[] ? __wake_up_common+0x45/0x77
[] ? _raw_spin_unlock_irqrestore+0x1d/0x32
[] ? __wake_up_sync_key+0x4a/0x53
[] ? sock_def_readable+0x71/0x75
[] do_ip_setsockopt+0x9d/0xb55
[] ? unix_seqpacket_sendmsg+0x3f/0x41
[] ? sock_sendmsg+0x6d/0x86
[] ? sockfd_lookup_light+0x12/0x5d
[] ? SyS_sendto+0xf3/0x11b
[] ? new_sync_read+0x82/0xaa
[] compat_ip_setsockopt+0x3b/0x99
[] compat_raw_setsockopt+0x11/0x32
[] compat_sock_common_setsockopt+0x18/0x1f
[] compat_SyS_setsockopt+0x1a9/0x1cf
[] compat_SyS_socketcall+0x180/0x1e3
[] cstar_dispatch+0x7/0x1e

Signed-off-by: Ani Sinha 
---
 net/ipv4/ipmr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index c803458..a1fc97a 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1674,8 +1674,8 @@ static inline int ipmr_forward_finish(struct sk_buff *skb)
 {
struct ip_options *opt = &(IPCB(skb)->opt);
 
-   IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), 
IPSTATS_MIB_OUTFORWDATAGRAMS);
-   IP_ADD_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTOCTETS, 
skb->len);
+   IP_INC_STATS(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTFORWDATAGRAMS);
+   IP_ADD_STATS(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTOCTETS, 
skb->len);
 
if (unlikely(opt->optlen))
ip_forward_options(skb);
@@ -1737,7 +1737,7 @@ static void ipmr_queue_xmit(struct net *net, struct 
mr_table *mrt,
 * to blackhole.
 */
 
-   IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+   IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
ip_rt_put(rt);
goto out_free;
}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.

2015-11-02 Thread Ani Sinha
> On Thu, Oct 29, 2015 at 6:21 PM, Neal P. Murphy
>  wrote:
> > On Thu, 29 Oct 2015 17:01:24 -0700
> > Ani Sinha  wrote:
> >
> >> On Wed, Oct 28, 2015 at 11:40 PM, Neal P. Murphy
> >>  wrote:
> >> > On Wed, 28 Oct 2015 02:36:50 -0400
> >> > "Neal P. Murphy"  wrote:
> >> >
> >> >> On Mon, 26 Oct 2015 21:06:33 +0100
> >> >> Pablo Neira Ayuso  wrote:
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> > On Mon, Oct 26, 2015 at 11:55:39AM -0700, Ani Sinha wrote:
> >> >> > > netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> >> >> >
> >> >> > Please, no need to Cc everyone here. Please, submit your Netfilter
> >> >> > patches to netfilter-de...@vger.kernel.org.
> >> >> >
> >> >> > Moreover, it would be great if the subject includes something
> >> >> > descriptive on what you need, for this I'd suggest:
> >> >> >
> >> >> > [PATCH -stable 3.4,backport] netfilter: nf_conntrack: fix RCU race in 
> >> >> > nf_conntrack_find_get
> >> >> >
> >> >> > I'm including Neal P. Murphy, he said he would help testing these
> >> >> > backports, getting a Tested-by: tag usually speeds up things too.
> >> >>
> >> >
> >> > I've probably done about as much seat-of-the-pants testing as I can. All 
> >> > opening/closing the same destination IP/port.
> >> >
> >> > Host: Debian Jessie, 8-core Vishera 8350 at 4.4 GHz, 16GiB RAM at (I 
> >> > think) 2100MHz.
> >> >
> >> > Traffic generator 1: 6-CPU KVM running 64-bit Smoothwall Express 3.1 
> >> > (linux 3.4.109 without these patches), with 8GiB RAM and 9GiB swap. 
> >> > Packets sent across PURPLE (to bypass NAT and firewall).
> >> >
> >> > Traffic generator 2: 32-bit KVM running Smoothwall Express 3.1 (linux 
> >> > 3.4.110 with these patches), 3GiB RAM and minimal swap.
> >> >
> >> > In the first set of tests, generator 1's traffic passed through 
> >> > Generator 2 as a NATting firewall, to the host's web server. In the 
> >> > second set of tests, generator 2's traffic went through NAT to the 
> >> > host's web server.
> >> >
> >> > The load tests:
> >> >   - 2500 processes using 2500 addresses and random src ports
> >> >   - 2500 processes using 2500 addresses and the same src port
> >> >   - 2500 processes using the same src address and port
> >> >
> >> > I also tested using stock NF timeouts and using 1 second timeouts.
> >> >
> >> > Bandwidth used got as high as 16Mb/s for some tests.
> >> >
> >> > Conntracks got up to 200 000 or so or bounced between 1 and 2, depending 
> >> > on the test and the timeouts.
> >> >
> >> > I did not reproduce the problem these patches solve. But more 
> >> > importantly, I saw no problems at all. Each time I terminated a test, 
> >> > RAM usage returned to about that of post-boot; so there were no apparent 
> >> > memory leaks. No kernel messages and no netfilter messages appeared 
> >> > during the tests.
> >> >
> >> > If I have time, I suppose I could run another set of tests: 2500 source 
> >> > processes using 2500 addresses times 200 ports to connect to 2500 
> >> > addresses times 200 ports on a destination system. Each process opens 
> >> > 200 sockets, then closes them. And repeats ad infinitum. But I might 
> >> > have to be clever since I can't run 500 000 processes; but I could run 
> >> > 20 VMs; that would get it down to about 12 000 processes per VM. And I 
> >> > might have to figure out how to allow allow processes on the destination 
> >> > system to open hundreds or thousands of sockets.
> >>
> >> Should I resend the patch with a Tested-by: tag?
> >
> > ... Oh, wait. Not yet. The dawn just broke over ol' Marblehead here. I only 
> > tested TCP; I need to hammer UDP, too.
> >
> > Can I set the timeouts to zero? Or is one as low as I can go?
>
> Any progress with testing ?

I applied the 'hammer' through a firewall with the patch. I used TCP,
UDP and ICMP.

I don't know if the patch fixes the problem. But I'm reasonably sure
that it did not break normal operations.

To test a different problem I fixed (a memory leak in my 64-bit
counter patch for xt_ACCOUNT), I tested 60,000 addresses (most of a
/16) through the firewall. Again, no troubles.

I only observed two odd things which are likely completely unrelated
to your patch. When I started the TCP test, then added the UDP test,
only TCP would come through. If I stopped and restarted the TCP test,
only UDP would come through. I suspect this is due to buffering. It's
just a behaviour I haven't encountered since I started using Linux
many years ago (around '98). The second, when I started the test, the
firewall would lose contact with the upstream F/W's apcupsd daemon;
again, this is likely due to the nature of the test: it likely floods
input and output queues.


I'd say you can probably resend with Tested-by.

Neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -stable 3.4,backport] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.

2015-11-02 Thread Ani Sinha
netfilter: nf_conntrack: don't release a conntrack with non-zero
refcnt

With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.

Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().

A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.

CPU1CPU2
nf_conntrack_find()
nf_ct_put()
 destroy_conntrack()
...
init_conntrack
 __nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
 if (!l4proto->new(ct, skb, dataoff, 
timeouts))
  nf_conntrack_free(ct); (use = 2 !!!)
...
__nf_conntrack_alloc (set use = 1)
 if (!nf_ct_key_equal(h, tuple, zone))
  nf_ct_put(ct); (use = 0)
   destroy_conntrack()
/* continue to work with CT */

After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():

<4>[67096.759334] [ cut here ]
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C 
---2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[]  [] 
destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255]  [] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255]  [] nf_conntrack_find_get+0x85/0x130 
[nf_conntrack]
<4>[67096.760255]  [] nf_conntrack_in+0x352/0xb60 
[nf_conntrack]
<4>[67096.760255]  [] ipv4_conntrack_local+0x51/0x60 
[nf_conntrack_ipv4]
<4>[67096.760255]  [] nf_iterate+0x69/0xb0
<4>[67096.760255]  [] ? dst_output+0x0/0x20
<4>[67096.760255]  [] nf_hook_slow+0x74/0x110
<4>[67096.760255]  [] ? dst_output+0x0/0x20
<4>[67096.760255]  [] raw_sendmsg+0x775/0x910
<4>[67096.760255]  [] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] inet_sendmsg+0x4a/0xb0
<4>[67096.760255]  [] ? sock_sendmsg+0x13/0x140
<4>[67096.760255]  [] sock_sendmsg+0x117/0x140
<4>[67096.760255]  [] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255]  [] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255]  [] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255]  [] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] sys_sendto+0x139/0x190
<4>[67096.760255]  [] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255]  [] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255]  [] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255]  [] ia32_sysret+0x0/0x5

I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.

Signed-off-by: Ani Sinha 
Tested-by: "Neal P. Murphy"  
---
 net/netfilter/nf_conntrack_core.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 9a171b2..9a46908 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -441,7 +441,9 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
goto out;
 
add_timer(&ct->timeout);
-   nf_conntrack_get(&ct->ct_general);
+   smp_wmb();
+   /* The caller holds a reference to this object */
+   atomic_set(&ct->ct_general.use, 2);
__nf_conntrack_hash_insert(ct, hash, repl_hash);
NF_CT_STAT_INC(net, insert);
spin_unlock_bh(&nf_conntrack_lock);
@@ -732,11 +734,10 @@ __nf_conntrack_alloc(struct net *net, u16 zone,
nf_ct_zone->id = zone;
}
 #endif
-   /*
-* changes to lookup keys must be done before setting refcnt to 1
+   /* Because we use RCU lookups, we set ct_general.use to zero before
+* this is inserted in any list.
 */
-   smp_wmb();
-   atomic_set(&ct->ct_general.use, 1);
+   at

[PATCH -stable 3.4,backport] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.

2015-11-02 Thread Ani Sinha
netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1  task 2  task 3
nf_conntrack_find_get
 nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
__nf_conntrack_alloc
 kmem_cache_alloc
 
memset(&ct->tuplehash[IP_CT_DIR_MAX],
 if (nf_ct_is_dying(ct))
 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[]  [] 
nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [] alloc_null_binding+0x5b/0xa0 
[iptable_nat]
<4>[46267.085697]  [] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [] nf_iterate+0x69/0xb0
<4>[46267.085991]  [] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [] ? dst_output+0x0/0x20
<4>[46267.086277]  [] ip_output+0xa4/0xc0
<4>[46267.086346]  [] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [] sys_sendto+0x139/0x190
<4>[46267.087229]  [] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 
c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 
0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [] nf_nat_setup_info+0x564/0x590

Signed-off-by: Ani Sinha 
Tested-by: Neal P. Murphy 
---
 net/netfilter/nf_conntrack_core.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 9a46908..fd0f7a3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -309,6 +309,21 @@ static void death_by_timeout(unsigned long ul_conntrack)
nf_ct_put(ct);
 }
 
+static inline bool
+nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
+   const struct nf_conntrack_tuple *tuple,
+   u16 zone)
+{
+   struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+
+   /* A conntrack can be recreated with the equal tuple,
+* so we need to check that the conntrack is confirmed
+*/
+   return nf_ct_tuple_equal(tuple, &h->tuple) &&
+   nf_ct_zone(ct) == zone &&
+   nf_ct_is_confirmed(ct);
+}
+
 /*
  * Warning :
  * - Caller must take a

[PATCH 1/1] ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() in preemptible context.

2015-10-30 Thread Ani Sinha
Fixes the following kernel BUG :

BUG: using __this_cpu_add() in preemptible [] code: bash/2758
caller is __this_cpu_preempt_check+0x13/0x15
CPU: 0 PID: 2758 Comm: bash Tainted: P   O   3.18.19 #2
 8170eaca 880110d1b788 81482b2a 
  880110d1b7b8 812010ae 880007cab800
 88001a060800 88013a899108 880108b84240 880110d1b7c8
Call Trace:
[] dump_stack+0x52/0x80
[] check_preemption_disabled+0xce/0xe1
[] __this_cpu_preempt_check+0x13/0x15
[] ipmr_queue_xmit+0x647/0x70c
[] ip_mr_forward+0x32f/0x34e
[] ip_mroute_setsockopt+0xe03/0x108c
[] ? get_parent_ip+0x11/0x42
[] ? pollwake+0x4d/0x51
[] ? default_wake_function+0x0/0xf
[] ? get_parent_ip+0x11/0x42
[] ? __wake_up_common+0x45/0x77
[] ? _raw_spin_unlock_irqrestore+0x1d/0x32
[] ? __wake_up_sync_key+0x4a/0x53
[] ? sock_def_readable+0x71/0x75
[] do_ip_setsockopt+0x9d/0xb55
[] ? unix_seqpacket_sendmsg+0x3f/0x41
[] ? sock_sendmsg+0x6d/0x86
[] ? sockfd_lookup_light+0x12/0x5d
[] ? SyS_sendto+0xf3/0x11b
[] ? new_sync_read+0x82/0xaa
[] compat_ip_setsockopt+0x3b/0x99
[] compat_raw_setsockopt+0x11/0x32
[] compat_sock_common_setsockopt+0x18/0x1f
[] compat_SyS_setsockopt+0x1a9/0x1cf
[] compat_SyS_socketcall+0x180/0x1e3
[] cstar_dispatch+0x7/0x1e

Signed-off-by: Ani Sinha 
---
 net/ipv4/ipmr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 866ee89..8e8203d 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1682,8 +1682,8 @@ static inline int ipmr_forward_finish(struct sock *sk, 
struct sk_buff *skb)
 {
struct ip_options *opt = &(IPCB(skb)->opt);
 
-   IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), 
IPSTATS_MIB_OUTFORWDATAGRAMS);
-   IP_ADD_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTOCTETS, 
skb->len);
+   IP_INC_STATS(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTFORWDATAGRAMS);
+   IP_ADD_STATS(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTOCTETS, 
skb->len);
 
if (unlikely(opt->optlen))
ip_forward_options(skb);
@@ -1745,7 +1745,7 @@ static void ipmr_queue_xmit(struct net *net, struct 
mr_table *mrt,
 * to blackhole.
 */
 
-   IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+   IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
ip_rt_put(rt);
goto out_free;
}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG in ipmr_queue_xmit()

2015-10-30 Thread Ani Sinha
On Fri, Oct 30, 2015 at 12:12 PM, Eric Dumazet  wrote:
> On Fri, 2015-10-30 at 10:47 -0700, Ani Sinha wrote:
>
>> for 32 bit archs, it does in SNMP_ADD_STATS64_USER()
>
> Sure. But x86 these days is 64bit, at 99 % maybe.
>
> We do not make changes that looks 'maybe better' for i486 or i586
>
> Just do the same that multiple similar patches did.
>
> Example :
>
> 757efd32d5ce31f67193cc0e6a56e4dffcc42fb1

OK thanks for pointing me to this. Seems we have a precedence for this
I will go ahead and send a patch as per your suggestion.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG in ipmr_queue_xmit()

2015-10-30 Thread Ani Sinha
On Fri, Oct 30, 2015 at 4:00 AM, Eric Dumazet  wrote:
> On Fri, 2015-10-30 at 11:48 +0100, Florian Westphal wrote:
>> Hannes Frederic Sowa  wrote:
>> > > > > @@ -936,7 +936,9 @@ static void ipmr_cache_resolve(struct net *net, 
>> > > > > struct mr_table *mrt,
>> > > > >
>> > > > >   rtnl_unicast(skb, net, NETLINK_CB(skb).portid);
>> > > > >   } else {
>> > > > > + preempt_disable();
>> > > > >   ip_mr_forward(net, mrt, skb, c, 0);
>> > > > > + preempt_enable();
>> > > > >   }
>> > > > >   }
>> > > > >  }
>> > > >
>> > > > I do not believe this fix is correct.
>> > >
>> > > Yes, sorry.  I should have suggested local_bh_disable instead.
>> > >
>> > > > Better replace the
>> > > > IP_INC_STATS_BH() by IP_INC_STATS()
>> > > >
>> > > > and IP_ADD_STATS_BH() by IP_ADD_STATS()
>> > >
>> > > Hmm, whats the rationale for this?
>> > >
>> > > Note that IP_ADD_STATS_BH in question is unconditional (not in
>> > > error path).  It seems that its virtually always called from softirq
>> > > except in the setsockopt case.
>> >
>> > The naming of the functions is bad if you compare them to e.g.
>> > spin_lock_bh.
>> >
>> > STATS_BH can only be used from bottom half and the normal ones (without
>> > _BH) can be called from everywhere. It is a common pattern in the
>> > kernel.
>> >
>> > Eric's proposal is correct.
>>
>> Yes, its correct but it results in 4 additonal bh on/off calls
>> for the common case, hence my question.
>>
>> Moving the one ip_mr_forward into bh-off keeps the bh-disable thing
>> in the setsockopt path.
>
> I have no idea how long is the ip_mr_forward(net, mrt, skb, c, 0)
> section, and if GFP_KERNEL allocations were attempted in this path.
>
> The proposed fix might add other regressions.
>
> I do not want to spend time auditing this code that nobody uses.
>
> While on x86, IP_INC_STATS() does not use additional bh on/off calls
>

for 32 bit archs, it does in SNMP_ADD_STATS64_USER()


> In general, we should disable interrupts (even if soft) for limited
> amount of times.
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.

2015-10-30 Thread Ani Sinha
On Thu, Oct 29, 2015 at 6:21 PM, Neal P. Murphy
 wrote:
> On Thu, 29 Oct 2015 17:01:24 -0700
> Ani Sinha  wrote:
>
>> On Wed, Oct 28, 2015 at 11:40 PM, Neal P. Murphy
>>  wrote:
>> > On Wed, 28 Oct 2015 02:36:50 -0400
>> > "Neal P. Murphy"  wrote:
>> >
>> >> On Mon, 26 Oct 2015 21:06:33 +0100
>> >> Pablo Neira Ayuso  wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > On Mon, Oct 26, 2015 at 11:55:39AM -0700, Ani Sinha wrote:
>> >> > > netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>> >> >
>> >> > Please, no need to Cc everyone here. Please, submit your Netfilter
>> >> > patches to netfilter-de...@vger.kernel.org.
>> >> >
>> >> > Moreover, it would be great if the subject includes something
>> >> > descriptive on what you need, for this I'd suggest:
>> >> >
>> >> > [PATCH -stable 3.4,backport] netfilter: nf_conntrack: fix RCU race in 
>> >> > nf_conntrack_find_get
>> >> >
>> >> > I'm including Neal P. Murphy, he said he would help testing these
>> >> > backports, getting a Tested-by: tag usually speeds up things too.
>> >>
>> >
>> > I've probably done about as much seat-of-the-pants testing as I can. All 
>> > opening/closing the same destination IP/port.
>> >
>> > Host: Debian Jessie, 8-core Vishera 8350 at 4.4 GHz, 16GiB RAM at (I 
>> > think) 2100MHz.
>> >
>> > Traffic generator 1: 6-CPU KVM running 64-bit Smoothwall Express 3.1 
>> > (linux 3.4.109 without these patches), with 8GiB RAM and 9GiB swap. 
>> > Packets sent across PURPLE (to bypass NAT and firewall).
>> >
>> > Traffic generator 2: 32-bit KVM running Smoothwall Express 3.1 (linux 
>> > 3.4.110 with these patches), 3GiB RAM and minimal swap.
>> >
>> > In the first set of tests, generator 1's traffic passed through Generator 
>> > 2 as a NATting firewall, to the host's web server. In the second set of 
>> > tests, generator 2's traffic went through NAT to the host's web server.
>> >
>> > The load tests:
>> >   - 2500 processes using 2500 addresses and random src ports
>> >   - 2500 processes using 2500 addresses and the same src port
>> >   - 2500 processes using the same src address and port
>> >
>> > I also tested using stock NF timeouts and using 1 second timeouts.
>> >
>> > Bandwidth used got as high as 16Mb/s for some tests.
>> >
>> > Conntracks got up to 200 000 or so or bounced between 1 and 2, depending 
>> > on the test and the timeouts.
>> >
>> > I did not reproduce the problem these patches solve. But more importantly, 
>> > I saw no problems at all. Each time I terminated a test, RAM usage 
>> > returned to about that of post-boot; so there were no apparent memory 
>> > leaks. No kernel messages and no netfilter messages appeared during the 
>> > tests.
>> >
>> > If I have time, I suppose I could run another set of tests: 2500 source 
>> > processes using 2500 addresses times 200 ports to connect to 2500 
>> > addresses times 200 ports on a destination system. Each process opens 200 
>> > sockets, then closes them. And repeats ad infinitum. But I might have to 
>> > be clever since I can't run 500 000 processes; but I could run 20 VMs; 
>> > that would get it down to about 12 000 processes per VM. And I might have 
>> > to figure out how to allow allow processes on the destination system to 
>> > open hundreds or thousands of sockets.
>>
>> Should I resend the patch with a Tested-by: tag?
>
> ... Oh, wait. Not yet. The dawn just broke over ol' Marblehead here. I only 
> tested TCP; I need to hammer UDP, too.
>
> Can I set the timeouts to zero? Or is one as low as I can go?

I don't see any assertion or check against 0 sec timeouts. You can
try. Your conntrack entries will be constantly flushing.

>
> N
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.

2015-10-29 Thread Ani Sinha
On Wed, Oct 28, 2015 at 11:40 PM, Neal P. Murphy
 wrote:
> On Wed, 28 Oct 2015 02:36:50 -0400
> "Neal P. Murphy"  wrote:
>
>> On Mon, 26 Oct 2015 21:06:33 +0100
>> Pablo Neira Ayuso  wrote:
>>
>> > Hi,
>> >
>> > On Mon, Oct 26, 2015 at 11:55:39AM -0700, Ani Sinha wrote:
>> > > netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>> >
>> > Please, no need to Cc everyone here. Please, submit your Netfilter
>> > patches to netfilter-de...@vger.kernel.org.
>> >
>> > Moreover, it would be great if the subject includes something
>> > descriptive on what you need, for this I'd suggest:
>> >
>> > [PATCH -stable 3.4,backport] netfilter: nf_conntrack: fix RCU race in 
>> > nf_conntrack_find_get
>> >
>> > I'm including Neal P. Murphy, he said he would help testing these
>> > backports, getting a Tested-by: tag usually speeds up things too.
>>
>
> I've probably done about as much seat-of-the-pants testing as I can. All 
> opening/closing the same destination IP/port.
>
> Host: Debian Jessie, 8-core Vishera 8350 at 4.4 GHz, 16GiB RAM at (I think) 
> 2100MHz.
>
> Traffic generator 1: 6-CPU KVM running 64-bit Smoothwall Express 3.1 (linux 
> 3.4.109 without these patches), with 8GiB RAM and 9GiB swap. Packets sent 
> across PURPLE (to bypass NAT and firewall).
>
> Traffic generator 2: 32-bit KVM running Smoothwall Express 3.1 (linux 3.4.110 
> with these patches), 3GiB RAM and minimal swap.
>
> In the first set of tests, generator 1's traffic passed through Generator 2 
> as a NATting firewall, to the host's web server. In the second set of tests, 
> generator 2's traffic went through NAT to the host's web server.
>
> The load tests:
>   - 2500 processes using 2500 addresses and random src ports
>   - 2500 processes using 2500 addresses and the same src port
>   - 2500 processes using the same src address and port
>
> I also tested using stock NF timeouts and using 1 second timeouts.
>
> Bandwidth used got as high as 16Mb/s for some tests.
>
> Conntracks got up to 200 000 or so or bounced between 1 and 2, depending on 
> the test and the timeouts.
>
> I did not reproduce the problem these patches solve. But more importantly, I 
> saw no problems at all. Each time I terminated a test, RAM usage returned to 
> about that of post-boot; so there were no apparent memory leaks. No kernel 
> messages and no netfilter messages appeared during the tests.
>
> If I have time, I suppose I could run another set of tests: 2500 source 
> processes using 2500 addresses times 200 ports to connect to 2500 addresses 
> times 200 ports on a destination system. Each process opens 200 sockets, then 
> closes them. And repeats ad infinitum. But I might have to be clever since I 
> can't run 500 000 processes; but I could run 20 VMs; that would get it down 
> to about 12 000 processes per VM. And I might have to figure out how to allow 
> allow processes on the destination system to open hundreds or thousands of 
> sockets.

Should I resend the patch with a Tested-by: tag?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.

2015-10-26 Thread Ani Sinha
netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1  task 2  task 3
nf_conntrack_find_get
 nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
__nf_conntrack_alloc
 kmem_cache_alloc
 
memset(&ct->tuplehash[IP_CT_DIR_MAX],
 if (nf_ct_is_dying(ct))
 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[]  [] 
nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [] alloc_null_binding+0x5b/0xa0 
[iptable_nat]
<4>[46267.085697]  [] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [] nf_iterate+0x69/0xb0
<4>[46267.085991]  [] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [] ? dst_output+0x0/0x20
<4>[46267.086277]  [] ip_output+0xa4/0xc0
<4>[46267.086346]  [] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [] sys_sendto+0x139/0x190
<4>[46267.087229]  [] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 
c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 
0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [] nf_nat_setup_info+0x564/0x590

Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Pablo Neira Ayuso 
Cc: Patrick McHardy 
Cc: Jozsef Kadlecsik 
Cc: "David S. Miller" 
Cc: Cyrill Gorcunov 
Signed-off-by: Andrey Vagin 
Acked-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Ani Sinha 
---
 net/netfilter/nf_conntrack_core.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 9a46908..fd0f7a3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -309,6 +309,21 @@ static void death_by_timeout(unsigned long ul_conntrack)
nf_ct_put(ct);
 }
 
+static inline bool
+nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
+   const struct nf_conntrack_tuple *tuple,
+   u16 zone)
+{
+   struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+
+   /* A conntrack can be recreated with the equal tuple,
+* so we need to check that the conntrack is confirmed
+*/

Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.

2015-10-24 Thread Ani Sinha
Please refer to the thread "linux 3.4.43 : kernel crash at
__nf_conntrack_confirm" on netdev for context.

thanks

On Sat, Oct 24, 2015 at 11:27 AM, Ani Sinha  wrote:
> netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>
> Lets look at destroy_conntrack:
>
> hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> ...
> nf_conntrack_free(ct)
> kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
>
> net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.
>
> The hash is protected by rcu, so readers look up conntracks without
> locks.
> A conntrack is removed from the hash, but in this moment a few readers
> still can use the conntrack. Then this conntrack is released and another
> thread creates conntrack with the same address and the equal tuple.
> After this a reader starts to validate the conntrack:
> * It's not dying, because a new conntrack was created
> * nf_ct_tuple_equal() returns true.
>
> But this conntrack is not initialized yet, so it can not be used by two
> threads concurrently. In this case BUG_ON may be triggered from
> nf_nat_setup_info().
>
> Florian Westphal suggested to check the confirm bit too. I think it's
> right.
>
> task 1  task 2  task 3
> nf_conntrack_find_get
>  nf_conntrack_find
> destroy_conntrack
>  hlist_nulls_del_rcu
>  nf_conntrack_free
>  kmem_cache_free
> __nf_conntrack_alloc
>  kmem_cache_alloc
>  
> memset(&ct->tuplehash[IP_CT_DIR_MAX],
>  if (nf_ct_is_dying(ct))
>  if (!nf_ct_tuple_equal()
>
> I'm not sure, that I have ever seen this race condition in a real life.
> Currently we are investigating a bug, which is reproduced on a few nodes.
> In our case one conntrack is initialized from a few tasks concurrently,
> we don't have any other explanation for this.
>
> <2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
> ...
> <4>[46267.083951] RIP: 0010:[]  [] 
> nf_nat_setup_info+0x564/0x590 [nf_nat]
> ...
> <4>[46267.085549] Call Trace:
> <4>[46267.085622]  [] alloc_null_binding+0x5b/0xa0 
> [iptable_nat]
> <4>[46267.085697]  [] nf_nat_rule_find+0x5c/0x80 
> [iptable_nat]
> <4>[46267.085770]  [] nf_nat_fn+0x111/0x260 [iptable_nat]
> <4>[46267.085843]  [] nf_nat_out+0x48/0xd0 [iptable_nat]
> <4>[46267.085919]  [] nf_iterate+0x69/0xb0
> <4>[46267.085991]  [] ? ip_finish_output+0x0/0x2f0
> <4>[46267.086063]  [] nf_hook_slow+0x74/0x110
> <4>[46267.086133]  [] ? ip_finish_output+0x0/0x2f0
> <4>[46267.086207]  [] ? dst_output+0x0/0x20
> <4>[46267.086277]  [] ip_output+0xa4/0xc0
> <4>[46267.086346]  [] raw_sendmsg+0x8b4/0x910
> <4>[46267.086419]  [] inet_sendmsg+0x4a/0xb0
> <4>[46267.086491]  [] ? sock_update_classid+0x3a/0x50
> <4>[46267.086562]  [] sock_sendmsg+0x117/0x140
> <4>[46267.086638]  [] ? _spin_unlock_bh+0x1b/0x20
> <4>[46267.086712]  [] ? autoremove_wake_function+0x0/0x40
> <4>[46267.086785]  [] ? do_ip_setsockopt+0x90/0xd80
> <4>[46267.086858]  [] ? call_function_interrupt+0xe/0x20
> <4>[46267.086936]  [] ? ub_slab_ptr+0x20/0x90
> <4>[46267.087006]  [] ? ub_slab_ptr+0x20/0x90
> <4>[46267.087081]  [] ? kmem_cache_alloc+0xd8/0x1e0
> <4>[46267.087151]  [] sys_sendto+0x139/0x190
> <4>[46267.087229]  [] ? sock_setsockopt+0x16d/0x6f0
> <4>[46267.087303]  [] ? audit_syscall_entry+0x1d7/0x200
> <4>[46267.087378]  [] ? __audit_syscall_exit+0x265/0x290
> <4>[46267.087454]  [] ? compat_sys_setsockopt+0x75/0x210
> <4>[46267.087531]  [] compat_sys_socketcall+0x13f/0x210
> <4>[46267.087607]  [] ia32_sysret+0x0/0x5
> <4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 
> c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 
> <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
> <1>[46267.088023] RIP  [] nf_nat_setup_info+0x564/0x590
>
> Cc: Eric Dumazet 
> Cc: Florian Westphal 
> Cc: Pablo Neira Ayuso 
> Cc: Patrick McHardy 
> Cc: Jozsef Kadlecsik 
> Cc: "David S. Miller" 
> Cc: Cyrill Gorcunov 
> Signed-off-by: Andrey Vagin 
> Acked-by: Eric Dumazet 
> Signed-off-by: Pablo Neira Ayuso 
> Signed-off-by: Ani Sinha 
> ---
>  net/netfilter/nf_conntrack_core.c | 21 +
>  1 file changed, 17 insertions(+), 4 delet

Re: [PATCH 1/1] commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da upstream.

2015-10-24 Thread Ani Sinha
Please refer to the thread "linux 3.4.43 : kernel crash at
__nf_conntrack_confirm" on netdev for context.

thanks

On Sat, Oct 24, 2015 at 10:27 AM, Ani Sinha  wrote:
> netfilter: nf_conntrack: don't release a conntrack with non-zero
> refcnt
>
> With this patch, the conntrack refcount is initially set to zero and
> it is bumped once it is added to any of the list, so we fulfill
> Eric's golden rule which is that all released objects always have a
> refcount that equals zero.
>
> Andrey Vagin reports that nf_conntrack_free can't be called for a
> conntrack with non-zero ref-counter, because it can race with
> nf_conntrack_find_get().
>
> A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
> ref-counter says that this conntrack is used. So when we release
> a conntrack with non-zero counter, we break this assumption.
>
> CPU1CPU2
> nf_conntrack_find()
> nf_ct_put()
>  destroy_conntrack()
> ...
> init_conntrack
>  __nf_conntrack_alloc (set use = 1)
> atomic_inc_not_zero(&ct->use) (use = 2)
>  if (!l4proto->new(ct, skb, dataoff, 
> timeouts))
>   nf_conntrack_free(ct); (use = 2 !!!)
> ...
> __nf_conntrack_alloc (set use = 1)
>  if (!nf_ct_key_equal(h, tuple, zone))
>   nf_ct_put(ct); (use = 0)
>destroy_conntrack()
> /* continue to work with CT */
>
> After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
> race in nf_conntrack_find_get" another bug was triggered in
> destroy_conntrack():
>
> <4>[67096.759334] [ cut here ]
> <2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
> ...
> <4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C 
> ---2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
> <4>[67096.759932] RIP: 0010:[]  [] 
> destroy_conntrack+0x15c/0x190 [nf_conntrack]
> <4>[67096.760255] Call Trace:
> <4>[67096.760255]  [] nf_conntrack_destroy+0x17/0x30
> <4>[67096.760255]  [] nf_conntrack_find_get+0x85/0x130 
> [nf_conntrack]
> <4>[67096.760255]  [] nf_conntrack_in+0x352/0xb60 
> [nf_conntrack]
> <4>[67096.760255]  [] ipv4_conntrack_local+0x51/0x60 
> [nf_conntrack_ipv4]
> <4>[67096.760255]  [] nf_iterate+0x69/0xb0
> <4>[67096.760255]  [] ? dst_output+0x0/0x20
> <4>[67096.760255]  [] nf_hook_slow+0x74/0x110
> <4>[67096.760255]  [] ? dst_output+0x0/0x20
> <4>[67096.760255]  [] raw_sendmsg+0x775/0x910
> <4>[67096.760255]  [] ? flush_tlb_others_ipi+0x128/0x130
> <4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
> <4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
> <4>[67096.760255]  [] inet_sendmsg+0x4a/0xb0
> <4>[67096.760255]  [] ? sock_sendmsg+0x13/0x140
> <4>[67096.760255]  [] sock_sendmsg+0x117/0x140
> <4>[67096.760255]  [] ? native_smp_send_reschedule+0x49/0x60
> <4>[67096.760255]  [] ? _spin_unlock_bh+0x1b/0x20
> <4>[67096.760255]  [] ? autoremove_wake_function+0x0/0x40
> <4>[67096.760255]  [] ? do_ip_setsockopt+0x90/0xd80
> <4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
> <4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
> <4>[67096.760255]  [] sys_sendto+0x139/0x190
> <4>[67096.760255]  [] ? audit_syscall_entry+0x1d7/0x200
> <4>[67096.760255]  [] ? __audit_syscall_exit+0x265/0x290
> <4>[67096.760255]  [] compat_sys_socketcall+0x13f/0x210
> <4>[67096.760255]  [] ia32_sysret+0x0/0x5
>
> I have reused the original title for the RFC patch that Andrey posted and
> most of the original patch description.
>
> Cc: Eric Dumazet 
> Cc: Andrew Vagin 
> Cc: Florian Westphal 
> Cc: Zefan Li 
> Signed-off-by: Ani Sinha 
> Reported-by: Andrew Vagin 
> Signed-off-by: Pablo Neira Ayuso 
> Reviewed-by: Eric Dumazet 
> Acked-by: Andrew Vagin 
> ---
>  net/netfilter/nf_conntrack_core.c | 18 +-
>  1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/net/netfilter/nf_conntrack_core.c 
> b/net/netfilter/nf_conntrack_core.c
> index 9a171b2..9a46908 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -441,7 +441,9 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct

[PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.

2015-10-24 Thread Ani Sinha
netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1  task 2  task 3
nf_conntrack_find_get
 nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
__nf_conntrack_alloc
 kmem_cache_alloc
 
memset(&ct->tuplehash[IP_CT_DIR_MAX],
 if (nf_ct_is_dying(ct))
 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[]  [] 
nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [] alloc_null_binding+0x5b/0xa0 
[iptable_nat]
<4>[46267.085697]  [] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [] nf_iterate+0x69/0xb0
<4>[46267.085991]  [] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [] ? dst_output+0x0/0x20
<4>[46267.086277]  [] ip_output+0xa4/0xc0
<4>[46267.086346]  [] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [] sys_sendto+0x139/0x190
<4>[46267.087229]  [] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 
c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 
0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [] nf_nat_setup_info+0x564/0x590

Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Pablo Neira Ayuso 
Cc: Patrick McHardy 
Cc: Jozsef Kadlecsik 
Cc: "David S. Miller" 
Cc: Cyrill Gorcunov 
Signed-off-by: Andrey Vagin 
Acked-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Ani Sinha 
---
 net/netfilter/nf_conntrack_core.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 9a46908..fd0f7a3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -309,6 +309,21 @@ static void death_by_timeout(unsigned long ul_conntrack)
nf_ct_put(ct);
 }
 
+static inline bool
+nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
+   const struct nf_conntrack_tuple *tuple,
+   u16 zone)
+{
+   struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+
+   /* A conntrack can be recreated with the equal tuple,
+* so we need to check that the conntrack is confirmed
+*/

[PATCH 1/1] commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da upstream.

2015-10-24 Thread Ani Sinha
netfilter: nf_conntrack: don't release a conntrack with non-zero
refcnt

With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.

Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().

A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.

CPU1CPU2
nf_conntrack_find()
nf_ct_put()
 destroy_conntrack()
...
init_conntrack
 __nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
 if (!l4proto->new(ct, skb, dataoff, 
timeouts))
  nf_conntrack_free(ct); (use = 2 !!!)
...
__nf_conntrack_alloc (set use = 1)
 if (!nf_ct_key_equal(h, tuple, zone))
  nf_ct_put(ct); (use = 0)
   destroy_conntrack()
/* continue to work with CT */

After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():

<4>[67096.759334] [ cut here ]
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C 
---2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[]  [] 
destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255]  [] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255]  [] nf_conntrack_find_get+0x85/0x130 
[nf_conntrack]
<4>[67096.760255]  [] nf_conntrack_in+0x352/0xb60 
[nf_conntrack]
<4>[67096.760255]  [] ipv4_conntrack_local+0x51/0x60 
[nf_conntrack_ipv4]
<4>[67096.760255]  [] nf_iterate+0x69/0xb0
<4>[67096.760255]  [] ? dst_output+0x0/0x20
<4>[67096.760255]  [] nf_hook_slow+0x74/0x110
<4>[67096.760255]  [] ? dst_output+0x0/0x20
<4>[67096.760255]  [] raw_sendmsg+0x775/0x910
<4>[67096.760255]  [] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] inet_sendmsg+0x4a/0xb0
<4>[67096.760255]  [] ? sock_sendmsg+0x13/0x140
<4>[67096.760255]  [] sock_sendmsg+0x117/0x140
<4>[67096.760255]  [] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255]  [] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255]  [] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255]  [] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] sys_sendto+0x139/0x190
<4>[67096.760255]  [] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255]  [] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255]  [] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255]  [] ia32_sysret+0x0/0x5

I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.

Cc: Eric Dumazet 
Cc: Andrew Vagin 
Cc: Florian Westphal 
Cc: Zefan Li 
Signed-off-by: Ani Sinha 
Reported-by: Andrew Vagin 
Signed-off-by: Pablo Neira Ayuso 
Reviewed-by: Eric Dumazet 
Acked-by: Andrew Vagin 
---
 net/netfilter/nf_conntrack_core.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 9a171b2..9a46908 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -441,7 +441,9 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
goto out;
 
add_timer(&ct->timeout);
-   nf_conntrack_get(&ct->ct_general);
+   smp_wmb();
+   /* The caller holds a reference to this object */
+   atomic_set(&ct->ct_general.use, 2);
__nf_conntrack_hash_insert(ct, hash, repl_hash);
NF_CT_STAT_INC(net, insert);
spin_unlock_bh(&nf_conntrack_lock);
@@ -732,11 +734,10 @@ __nf_conntrack_alloc(struct net *net, u16 zone,
nf_ct_zone->id = zone;
}
 #endif
-   /*
-* changes to lookup keys must be done before setting refcnt to 1
+   /* Because we use RCU lookups, we set ct_general.us

Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm

2015-10-21 Thread Ani Sinha
On Wed, Oct 21, 2015 at 2:19 PM, Florian Westphal  wrote:
> Ani Sinha  wrote:
>> >> > commit c6825c0976fa7893692e0e43b09740b419b23c09
>> >> > Author: Andrey Vagin 
>> >> > Date:   Wed Jan 29 19:34:14 2014 +0100
>> >> >  netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>> >> >
>> >> > and a followup patch :
>> >> >
>> >> > commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
>> >> > Author: Pablo Neira Ayuso 
>> >> > Date:   Mon Feb 3 20:01:53 2014 +0100
>> >> > netfilter: nf_conntrack: don't release a conntrack with 
>> >> > non-zero refcnt
>> >> >
>> >
>> > These for instance fix such bugs.
>>
>> So since both these patches were not backported to 3.4 series and
>> since now we have evidence of a crash that points to issues which the
>> patches fix, should we consider backporting the above patches to 3.4?
>
> Yes.

Ok cool. I will send out backport patches for 3.4 corresponding to
both the above patches.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm

2015-10-21 Thread Ani Sinha
On Sun, Oct 18, 2015 at 1:07 AM, Florian Westphal  wrote:
> Ani Sinha  wrote:
>> Coming back to this crash, I see something interesting in the
>> conntrack code in linux 3.4.109 (a supported kernel version). I see
>> that the hash table manipulations are protected by a spinlock. Also
>> lookups/reads are protected by RCU. However allocation and
>> deallocation of conntrack objects happen outside of both the locks.
>> It seems to me that a conntrack object can be deallocated and a new
>> object can be allocated and initialized within the same RCU grace
>> period, while the hash table is being read.
>
> Yes.  We need to use SLAB_DESTROY_BY_RCU instead of kfree_rcu because
> there could be hundreds of thousands of alloc/free pairs within a short
> time period.
>
>> It looks like a bug to me.
>
> No, as long as readers detect object reuse.
>
>> > Looking upstream, I see a couple of patches which fixes race condition
>> > around the use of the conntrack hash table with RCU (lock free read)
>> > primitives :
>> >
>> > commit c6825c0976fa7893692e0e43b09740b419b23c09
>> > Author: Andrey Vagin 
>> > Date:   Wed Jan 29 19:34:14 2014 +0100
>> >  netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>> >
>> > and a followup patch :
>> >
>> > commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
>> > Author: Pablo Neira Ayuso 
>> > Date:   Mon Feb 3 20:01:53 2014 +0100
>> > netfilter: nf_conntrack: don't release a conntrack with non-zero 
>> > refcnt
>> >
>
> These for instance fix such bugs.

So since both these patches were not backported to 3.4 series and
since now we have evidence of a crash that points to issues which the
patches fix, should we consider backporting the above patches to 3.4?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm

2015-10-19 Thread Ani Sinha
On Mon, Oct 19, 2015 at 1:33 PM, Florian Westphal  wrote:
> Ani Sinha  wrote:
>> On Sun, Oct 18, 2015 at 2:40 PM, Florian Westphal  wrote:
>> > Ani Sinha  wrote:
>> >> Indeed. So it seems to me that we have run into one another such case.
>> >> In patch c6825c0976fa7893692, I see we have added an additional check 
>> >> (along with comparing tuple and zone) to verify that if the conntrack is 
>> >> confirmed.
>> >>
>> >> +   return nf_ct_tuple_equal(tuple, &h->tuple) &&
>> >> +   nf_ct_zone(ct) == zone &&
>> >> +   nf_ct_is_confirmed(ct);
>> >>
>> >> This is necessary since it's possible that a conntrack can be recreated 
>> >> with the same zone.
>> >> Unfortunately, we leave a hole open in __nf_conntrack_confirm() because 
>> >> this routine _is_ responsible
>> >> for confirming the conntrack. We cannot use the same logic here.
>> >
>> > Hmm, why?
>> >
>> > I don't understand why we need to change __nf_conntrack_confirm(), can
>> > you elaborate?
>>
>> ok, let's take a step back. The fundamental question I am trying to
>> find answer to is that whether it is possible for another thread to
>> deallocate and then reallocate and initialize the conntrack object
>> while running concurrently during __nf_conntrack_confirm() .
>
> Not unless something is broken.

With or without e53376bef2cd97d3e3f61fdc6 ?

>
>> crash), we do not have the patch
>>
>> e53376bef2cd97d3e3f61fdc6
>>
>> applied. This patch bumps the refcount before adding the connrack
>> entry into the unconfirmed list.
>
> Yes, that patch fixes such bug.
>
>> + /* Now it is inserted into the unconfirmed list, bump refcount */
>> + nf_conntrack_get(&ct->ct_general);
>>
>> and if we assume the invariant that nf_conntrack_free() is never
>> called when refcount is !=0, then this would seem to indicate that the
>> above patch should fix the crash I mentioned in the thread.
>
> nf_conntrack_free must only be invoked after refcount becomes zero, right.
>
>> One curious piece of hunk is :
>>
>> + /* A freed object has refcnt == 0, that's
>> + * the golden rule for SLAB_DESTROY_BY_RCU
>> + */
>> + NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 0);
>> +
>> First, this assertion only puts a warning log at best when it fails.
>> Second, if this assertion is false, at some point we will get into a
>> kernel crash as the one I mentioned. So this assertion effectively
>> does nothing other than perhaps help in debugging.
>
> Right.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm

2015-10-19 Thread Ani Sinha
On Sun, Oct 18, 2015 at 2:40 PM, Florian Westphal  wrote:
> Ani Sinha  wrote:
>> Indeed. So it seems to me that we have run into one another such case.
>> In patch c6825c0976fa7893692, I see we have added an additional check (along 
>> with comparing tuple and zone) to verify that if the conntrack is confirmed.
>>
>> +   return nf_ct_tuple_equal(tuple, &h->tuple) &&
>> +   nf_ct_zone(ct) == zone &&
>> +   nf_ct_is_confirmed(ct);
>>
>> This is necessary since it's possible that a conntrack can be recreated with 
>> the same zone.
>> Unfortunately, we leave a hole open in __nf_conntrack_confirm() because this 
>> routine _is_ responsible
>> for confirming the conntrack. We cannot use the same logic here.
>
> Hmm, why?
>
> I don't understand why we need to change __nf_conntrack_confirm(), can
> you elaborate?

ok, let's take a step back. The fundamental question I am trying to
find answer to is that whether it is possible for another thread to
deallocate and then reallocate and initialize the conntrack object
while running concurrently during __nf_conntrack_confirm() . The crash
below seems to indicate that this can happen.

However, in the current 3.4 release (and the image which generated the
crash), we do not have the patch

e53376bef2cd97d3e3f61fdc6

applied. This patch bumps the refcount before adding the connrack
entry into the unconfirmed list.

+ /* Now it is inserted into the unconfirmed list, bump refcount */
+ nf_conntrack_get(&ct->ct_general);

and if we assume the invariant that nf_conntrack_free() is never
called when refcount is !=0, then this would seem to indicate that the
above patch should fix the crash I mentioned in the thread.

One curious piece of hunk is :

+ /* A freed object has refcnt == 0, that's
+ * the golden rule for SLAB_DESTROY_BY_RCU
+ */
+ NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 0);
+

First, this assertion only puts a warning log at best when it fails.
Second, if this assertion is false, at some point we will get into a
kernel crash as the one I mentioned. So this assertion effectively
does nothing other than perhaps help in debugging. Third, the very
fact that this assertion was placed seems to indicate that there might
be cases where we can free a conntrack object with non-zero ref-count.

Does all this makes sense?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm

2015-10-18 Thread Ani Sinha


> 
> On Sun, Oct 18, 2015 at 1:07 AM, Florian Westphal  wrote:
> > Ani Sinha  wrote:
> >> Coming back to this crash, I see something interesting in the
> >> conntrack code in linux 3.4.109 (a supported kernel version). I see
> >> that the hash table manipulations are protected by a spinlock. Also
> >> lookups/reads are protected by RCU. However allocation and
> >> deallocation of conntrack objects happen outside of both the locks.
> >> It seems to me that a conntrack object can be deallocated and a new
> >> object can be allocated and initialized within the same RCU grace
> >> period, while the hash table is being read.
> >
> > Yes.  We need to use SLAB_DESTROY_BY_RCU instead of kfree_rcu because
> > there could be hundreds of thousands of alloc/free pairs within a short
> > time period.
> >
> >> It looks like a bug to me.
> >
> > No, as long as readers detect object reuse.
 
Right.
 
> >
> >> > Looking upstream, I see a couple of patches which fixes race condition
> >> > around the use of the conntrack hash table with RCU (lock free read)
> >> > primitives :
> >> >
> >> > commit c6825c0976fa7893692e0e43b09740b419b23c09
> >> > Author: Andrey Vagin 
> >> > Date:   Wed Jan 29 19:34:14 2014 +0100
> >> >  netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> >> >
> >> > and a followup patch :
> >> >
> >> > commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
> >> > Author: Pablo Neira Ayuso 
> >> > Date:   Mon Feb 3 20:01:53 2014 +0100
> >> > netfilter: nf_conntrack: don't release a conntrack with non-zero 
> >> > refcnt
> >> >
> >
> > These for instance fix such bugs.
> 
Indeed. So it seems to me that we have run into one another such case.
In patch c6825c0976fa7893692, I see we have added an additional check (along 
with comparing tuple and zone) to verify that if the conntrack is confirmed.
 
+   return nf_ct_tuple_equal(tuple, &h->tuple) &&
+   nf_ct_zone(ct) == zone &&
+   nf_ct_is_confirmed(ct);
 
 
This is necessary since it's possible that a conntrack can be recreated with 
the same zone.
Unfortunately, we leave a hole open in __nf_conntrack_confirm() because this 
routine _is_ responsible
for confirming the conntrack. We cannot use the same logic here. 
 
Should I send a patch along the lines of :
 
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 71935fc..6ff4088 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -535,6 +535,12 @@ __nf_conntrack_confirm(struct sk_buff *skb)
zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
goto out;
 
+   /* we might be racing against a case where the conntrack was deleted 
+  and a new conntrack was initialized with the exact same zone. We
+  need to make sure that the conntrack node is in the hashtable */
+   if (hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode))
+ goto out;
+
/* Remove from unconfirmed list */
hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm

2015-10-17 Thread Ani Sinha
Hi guys,

Coming back to this crash, I see something interesting in the
conntrack code in linux 3.4.109 (a supported kernel version). I see
that the hash table manipulations are protected by a spinlock. Also
lookups/reads are protected by RCU. However allocation and
deallocation of conntrack objects happen outside of both the locks.
It seems to me that a conntrack object can be deallocated and a new
object can be allocated and initialized within the same RCU grace
period, while the hash table is being read. It looks like a bug to me.
Do you guys have any thoughts on this? Situations like the one I
described can result in the crash I sent below.

thanks
ani

On Wed, Oct 7, 2015 at 12:57 PM, Ani Sinha  wrote:
> Hi guys :
>
> We encountered a kernel crash on one of our boxes running 3.4.43
> kernel in the conntrack code. We are using dnsmasq as a proxy to relay
> our dns requests to the real dns server. We verified that the
> conntrack tables were not full. running conntrack -L around the time
> of the crash showed that it had more than 2100 entries for dnsmasq.
>
> Looking upstream, I see a couple of patches which fixes race condition
> around the use of the conntrack hash table with RCU (lock free read)
> primitives :
>
> commit c6825c0976fa7893692e0e43b09740b419b23c09
> Author: Andrey Vagin 
> Date:   Wed Jan 29 19:34:14 2014 +0100
>  netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>
> and a followup patch :
>
> commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
> Author: Pablo Neira Ayuso 
> Date:   Mon Feb 3 20:01:53 2014 +0100
> netfilter: nf_conntrack: don't release a conntrack with non-zero 
> refcnt
>
>
> We are trying to reproduce the crash again but it is very rare.
> Meanwhile, I have two questions:
>
> - Do you guys think the race condition described in the above two
> patches have anything to do with the crash I mention below?
> - If answer to the above is a NO, then have you guys have any other
> reports of a similar crash or any idea what could be going on?
>
> We are still investigating and I will update this thread if I can get
> additional info.
>
> Thanks
> Ani
>
> <1>[10618591.817967] BUG: unable to handle kernel NULL pointer
> dereference at   (null)
> <1>[10618591.914483] IP: []
> __nf_conntrack_confirm+0x1fb/0x36c [nf_conntrack]
> <4>[10618592.012027] PGD 5aa67067 PUD 5b4f4067 PMD 0
> <4>[10618592.012035] Oops: 0002 [#1] PREEMPT SMP
> <4>[10618592.012041] CPU 1
> <4>[10618592.012043] Modules linked in: xt_comment sch_prio fpdma(PO)
> msr nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT ip6table_mangle
> nf_conntrack_ipv4
> nf_defr
> ag_ipv4 xt_LOG xt_limit xt_hl xt_state ipt_REJECT xt_multiport
> xt_tcpudp iptable_mangle kbfd(O) 8021q garp stp llc tun
> nf_conntrack_tftp iptable_raw
> iptable_fil
> ter ip_tables xt_NOTRACK nf_conntrack xt_mark ip6table_raw
> ip6table_filter ip6_tables x_tables k10temp hwmon amd64_edac_mod
> scd(O) microcode kvm_amd kvm
> <4>[10618592.012092]
> <4>[10618592.012096] Pid: 5586, comm: dnsmasq Tainted: P   O 3.4.43 #1
> <4>[10618592.012102] RIP: 0010:[]
> [] __nf_conntrack_confirm+0x1fb/0x36c [nf_conntrack]
> <4>[10618592.012112] RSP: 0018:88005aa1fb98  EFLAGS: 00010202
> <4>[10618592.012116] RAX: 2769 RBX: 880063d58658 RCX:
> 1cc74948
> <4>[10618592.012120] RDX:  RSI: 88010cd8 RDI:
> 4000
> <4>[10618592.012123] RBP: 88005aa1fbc8 R08: 872541be R09:
> 7aa31682
> <4>[10618592.012127] R10: 880063d586d8 R11: 88005aa1fb68 R12:
> 81648180
> <4>[10618592.012130] R13: 17ef R14: bf78 R15:
> 9da0
> <4>[10618592.012135] FS:  ()
> GS:88013fb0(0063) knlGS:f74126d0
> <4>[10618592.012139] CS:  0010 DS: 002b ES: 002b CR0: 80050033
> <4>[10618592.012142] CR2:  CR3: 5b412000 CR4:
> 07e0
> <4>[10618592.012146] DR0:  DR1:  DR2:
> 
> <4>[10618592.012149] DR3:  DR6: 0ff0 DR7:
> 0400
> <4>[10618592.012154] Process dnsmasq (pid: 5586, threadinfo
> 88005aa1e000, task 8800727d6050)
> <4>[10618592.012156] Stack:
> <4>[10618592.012159]   8800889050c0
> 8800889050c0 880063d58658
> <4>[10618592.012166]  0004 0002
> 88005aa1fc38 a00e3c54
> <4>[10618592.012172]  0004 
> 88005aa1fc38 a0078168
> <4>[10618592.

linux 3.4.43 : kernel crash at __nf_conntrack_confirm

2015-10-07 Thread Ani Sinha
Hi guys :

We encountered a kernel crash on one of our boxes running 3.4.43
kernel in the conntrack code. We are using dnsmasq as a proxy to relay
our dns requests to the real dns server. We verified that the
conntrack tables were not full. running conntrack -L around the time
of the crash showed that it had more than 2100 entries for dnsmasq.

Looking upstream, I see a couple of patches which fixes race condition
around the use of the conntrack hash table with RCU (lock free read)
primitives :

commit c6825c0976fa7893692e0e43b09740b419b23c09
Author: Andrey Vagin 
Date:   Wed Jan 29 19:34:14 2014 +0100
 netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

and a followup patch :

commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
Author: Pablo Neira Ayuso 
Date:   Mon Feb 3 20:01:53 2014 +0100
netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt


We are trying to reproduce the crash again but it is very rare.
Meanwhile, I have two questions:

- Do you guys think the race condition described in the above two
patches have anything to do with the crash I mention below?
- If answer to the above is a NO, then have you guys have any other
reports of a similar crash or any idea what could be going on?

We are still investigating and I will update this thread if I can get
additional info.

Thanks
Ani

<1>[10618591.817967] BUG: unable to handle kernel NULL pointer
dereference at   (null)
<1>[10618591.914483] IP: []
__nf_conntrack_confirm+0x1fb/0x36c [nf_conntrack]
<4>[10618592.012027] PGD 5aa67067 PUD 5b4f4067 PMD 0
<4>[10618592.012035] Oops: 0002 [#1] PREEMPT SMP
<4>[10618592.012041] CPU 1
<4>[10618592.012043] Modules linked in: xt_comment sch_prio fpdma(PO)
msr nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT ip6table_mangle
nf_conntrack_ipv4
nf_defr
ag_ipv4 xt_LOG xt_limit xt_hl xt_state ipt_REJECT xt_multiport
xt_tcpudp iptable_mangle kbfd(O) 8021q garp stp llc tun
nf_conntrack_tftp iptable_raw
iptable_fil
ter ip_tables xt_NOTRACK nf_conntrack xt_mark ip6table_raw
ip6table_filter ip6_tables x_tables k10temp hwmon amd64_edac_mod
scd(O) microcode kvm_amd kvm
<4>[10618592.012092]
<4>[10618592.012096] Pid: 5586, comm: dnsmasq Tainted: P   O 3.4.43 #1
<4>[10618592.012102] RIP: 0010:[]
[] __nf_conntrack_confirm+0x1fb/0x36c [nf_conntrack]
<4>[10618592.012112] RSP: 0018:88005aa1fb98  EFLAGS: 00010202
<4>[10618592.012116] RAX: 2769 RBX: 880063d58658 RCX:
1cc74948
<4>[10618592.012120] RDX:  RSI: 88010cd8 RDI:
4000
<4>[10618592.012123] RBP: 88005aa1fbc8 R08: 872541be R09:
7aa31682
<4>[10618592.012127] R10: 880063d586d8 R11: 88005aa1fb68 R12:
81648180
<4>[10618592.012130] R13: 17ef R14: bf78 R15:
9da0
<4>[10618592.012135] FS:  ()
GS:88013fb0(0063) knlGS:f74126d0
<4>[10618592.012139] CS:  0010 DS: 002b ES: 002b CR0: 80050033
<4>[10618592.012142] CR2:  CR3: 5b412000 CR4:
07e0
<4>[10618592.012146] DR0:  DR1:  DR2:

<4>[10618592.012149] DR3:  DR6: 0ff0 DR7:
0400
<4>[10618592.012154] Process dnsmasq (pid: 5586, threadinfo
88005aa1e000, task 8800727d6050)
<4>[10618592.012156] Stack:
<4>[10618592.012159]   8800889050c0
8800889050c0 880063d58658
<4>[10618592.012166]  0004 0002
88005aa1fc38 a00e3c54
<4>[10618592.012172]  0004 
88005aa1fc38 a0078168
<4>[10618592.012179] Call Trace:
<4>[10618592.012186] [] ipv4_confirm+0x17e/0x1a5
[nf_conntrack_ipv4]
<4>[10618592.012192] [] ?
iptable_mangle_hook+0xfa/0x116 [iptable_mangle]
<4>[10618592.012199] [] ? ip_finish_output+0x0/0x36f
<4>[10618592.012205] [] nf_iterate+0x43/0x78
<4>[10618592.012210] [] ? ip_finish_output+0x0/0x36f
<4>[10618592.012214] [] nf_hook_slow+0x6e/0x106
<4>[10618592.012219] [] ? ip_finish_output+0x0/0x36f
<4>[10618592.012224] [] ? dst_output+0x0/0x11
<4>[10618592.012229] [] ip_output+0x83/0x97
<4>[10618592.012234] [] ? __ip_local_out+0x9c/0x9e
<4>[10618592.012239] [] ip_local_out+0x24/0x28
<4>[10618592.012244] [] ip_queue_xmit+0x2e4/0x322
<4>[10618592.012249] [] tcp_transmit_skb+0x766/0x7a7
<4>[10618592.012254] [] tcp_send_active_reset+0xd8/0x104
<4>[10618592.012258] [] tcp_close+0x101/0x335
<4>[10618592.012264] [] inet_release+0x7b/0x82
<4>[10618592.012269] [] sock_release+0x1a/0x72
<4>[10618592.012273] [] sock_close+0x22/0x26
<4>[10618592.012278] [] fput+0x117/0x1f8
<4>[10618592.012283] [] filp_close+0x6d/0x78
<4>[10618592.012288] [] sys_close+0x8e/0xc8
<4>[10618592.012293] [] cstar_dispatch+0x7/0x1e
<4>[10618592.012296] Code: 31 d2 0f b6 d2 85 d2 0f 85 61 01 00 00 48
8b 00 a8 01 75 0d 8b 53 68 3b 50 10 75 94 e9 6a ff ff ff 48 8b 43 20
48 8b 53 28 a8 01
<48>
 89 02 75 04 48 89 50 08 49 bd 00 02 20 00 00 00 ad

Re: iproute2 compatibility

2015-10-01 Thread Ani Sinha
On Thu, Oct 1, 2015 at 1:48 PM, Eric Dumazet  wrote:
> On Thu, 2015-10-01 at 13:06 -0700, Ani Sinha wrote:
>> Hi Stephen :
>>
>> I was looking around but could not find clear evidence that a later
>> version of iproute2 is compatible with an older kernel. Specifically,
>> we are wondering if iproute2 v 3.6 is compatible with linux kernel
>> 3.4. Highly appreciate any pointers on this.
>
> We try hard to keep iproute2 compatible with all kernels.
>

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


iproute2 compatibility

2015-10-01 Thread Ani Sinha
Hi Stephen :

I was looking around but could not find clear evidence that a later
version of iproute2 is compatible with an older kernel. Specifically,
we are wondering if iproute2 v 3.6 is compatible with linux kernel
3.4. Highly appreciate any pointers on this.

thanks
ani
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list of all network namespaces

2015-09-17 Thread Ani Sinha
On Thu, Sep 17, 2015 at 2:51 AM, Rosen, Rami  wrote:

> Network namespaces which were created by other ways (like userspace 
> applications
> using the clone() system call) will *not* be reflected by neither of them.

Will there be any interest if I cook up a kernel patch that lists all
network namespaces through /proc?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


list of all network namespaces

2015-09-16 Thread Ani Sinha
Hi guys

just a stupid question. Is it possible to get a list of all active
network namespaces in the kernel through /proc or some other
interface?

thanks
ani
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html