On 1 August 2016 at 13:07, pravin shelar <[email protected]> wrote:
> On Mon, Aug 1, 2016 at 11:46 AM, Joe Stringer <[email protected]> wrote:
>> On 1 August 2016 at 10:21, pravin shelar <[email protected]> wrote:
>>> On Tue, Jul 12, 2016 at 3:26 PM, Joe Stringer <[email protected]> wrote:
>>>> The core fragmentation handling logic is exported on all supported
>>>> kernels, so it's not necessary to backport the latest version of this.
>>>> This greatly simplifies the code due to inconsistencies between the old
>>>> per-lookup garbage collection and the newer workqueue based garbage
>>>> collection.
>>>>
>>>> As a result of simplifying and removing unnecessary backport code, a few
>>>> bugs are fixed for corner cases such as when some fragments remain in
>>>> the fragment cache when openvswitch is unloaded.
>>>>
>>>> Some backported ip functions need a little extra logic than what is seen
>>>> on the latest code due to this, for instance on kernels <3.17:
>>>> * Call inet_frag_evictor() before defrag
>>>> * Limit hashsize in ip{,6}_fragment logic
>>>>
>>>> The pernet init/exit logic also differs a little from upstream. Upstream
>>>> ipv[46]_defrag logic initializes the various pernet fragment parameters
>>>> and its own global fragments cache. In the OVS backport, the pernet
>>>> parameters are shared while the fragments cache is separate. The
>>>> backport relies upon upstream pernet initialization to perform the
>>>> shared setup, and performs no pernet initialization of its own. When it
>>>> comes to pernet exit however, the backport must ensure that all
>>>> OVS-specific fragment state is cleared, while the shared state remains
>>>> untouched so that the regular ipv[46] logic may do its own cleanup. In
>>>> practice this means that OVS must have its own divergent implementation
>>>> of inet_frags_exit_net().
>>>>
>>>> Fixes the following crash:
>>>>
>>>> Call Trace:
>>>> <IRQ>
>>>> [<ffffffff810744f6>] ? call_timer_fn+0x36/0x100
>>>> [<ffffffff8107548f>] run_timer_softirq+0x1ef/0x2f0
>>>> [<ffffffff8106cccc>] __do_softirq+0xec/0x2c0
>>>> [<ffffffff8106d215>] irq_exit+0x105/0x110
>>>> [<ffffffff81737095>] smp_apic_timer_interrupt+0x45/0x60
>>>> [<ffffffff81735a1d>] apic_timer_interrupt+0x6d/0x80
>>>> <EOI>
>>>> [<ffffffff8104f596>] ? native_safe_halt+0x6/0x10
>>>> [<ffffffff8101cb2f>] default_idle+0x1f/0xc0
>>>> [<ffffffff8101d406>] arch_cpu_idle+0x26/0x30
>>>> [<ffffffff810bf3a5>] cpu_startup_entry+0xc5/0x290
>>>> [<ffffffff810415ed>] start_secondary+0x21d/0x2d0
>>>> Code: Bad RIP value.
>>>> RIP [<ffffffffa0177480>] 0xffffffffa0177480
>>>> RSP <ffff88003f703e78>
>>>> CR2: ffffffffa0177480
>>>> ---[ end trace eb98ca80ba07bd9c ]---
>>>> Kernel panic - not syncing: Fatal exception in interrupt
>>>>
>>>> Signed-off-by: Joe Stringer <[email protected]>
>>>> ---
>>>> I've tested this on CentOS kernel 3.10.0-327 and Ubuntu kernels
>>>> 3.13.0-68, 3.16.0-70, 3.19.0-58, and 4.2.0-35. Some earlier kernel
>>>> versions may still trigger fragmentation-related crashes due to upstream
>>>> bugs, for instance on Ubuntu's 3.13.0-24.
>>>> ---
>>>> acinclude.m4 | 1 +
>>>> datapath/linux/compat/include/net/inet_frag.h | 58 ++-
>>>> datapath/linux/compat/inet_fragment.c | 486
>>>> ++------------------------
>>>> datapath/linux/compat/ip_fragment.c | 36 +-
>>>> datapath/linux/compat/nf_conntrack_reasm.c | 17 +
>>>> 5 files changed, 91 insertions(+), 507 deletions(-)
>>>>
>>> ...
>>>
>>>> -void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
>>>> - const char *prefix)
>>>> -{
>>>> - static const char msg[] = "inet_frag_find: Fragment hash bucket"
>>>> - " list length grew over limit "
>>>> __stringify(INETFRAGS_MAXDEPTH)
>>>> - ". Dropping fragment.\n";
>>>> + local_bh_disable();
>>>> + inet_frag_evictor(nf, f, true);
>>>> + local_bh_enable();
>>>>
>>>> - if (PTR_ERR(q) == -ENOBUFS)
>>>> - net_dbg_ratelimited("%s%s", prefix, msg);
>>>> + nf->low_thresh = thresh;
>>>> }
>>>
>>> Upstream inet_frags_exit_net() and back ported looks the same, we can
>>> just use the exported symbol.
>>
>> We share nf->mem with the upstream frag code, so we shouldn't
>> percpu_counter_destroy(&nf->mem) it when we exit the namespace; the
>> upstream frag code will handle that.
>>
>> My other concern was that we use the shared nf->low_thresh to clear
>> out our fragments cache, but if we are exiting our module and the
>> upstream frag code remains, we should restore the nf->low_thresh
>> afterwards so that we don't cause flushing of the upstream frag cache.
>>
> I think we can restore the both objects in compat code after calling
> default exit function.
>
> But the exit code implementation itself is not that complicated so I
> am fine with the code.
>
>>>> +#endif /* HAVE_INET_FRAGS_WITH_FRAGS_WORK */
>>>>
>>>> #endif /* !HAVE_CORRECT_MRU_HANDLING */
>>>> diff --git a/datapath/linux/compat/ip_fragment.c
>>>> b/datapath/linux/compat/ip_fragment.c
>>>> index 8d01088abc0a..64e2cf23c327 100644
>>>> --- a/datapath/linux/compat/ip_fragment.c
>>>> +++ b/datapath/linux/compat/ip_fragment.c
>>>> @@ -76,12 +76,7 @@ struct ipfrag_skb_cb
>>>>
>>>> /* Describe an entry in the "incomplete datagrams" queue. */
>>>> struct ipq {
>>>> - union {
>>>> - struct inet_frag_queue q;
>>>> -#ifndef HAVE_INET_FRAG_QUEUE_WITH_LIST_EVICTOR
>>>> - struct ovs_inet_frag_queue oq;
>>>> -#endif
>>>> - };
>>>> + struct inet_frag_queue q;
>>>>
>>>> u32 user;
>>>> __be32 saddr;
>>>> @@ -119,6 +114,12 @@ static unsigned int ipqhashfn(__be16 id, __be32
>>>> saddr, __be32 daddr, u8 prot)
>>>> (__force u32)saddr, (__force u32)daddr,
>>>> ip4_frags.rnd);
>>>> }
>>>> +/* fb3cfe6e75b9 ("inet: frag: remove hash size assumptions from callers")
>>>> + * shifted this logic into inet_fragment, but prior kernels still need
>>>> this.
>>>> + */
>>>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,17,0)
>>>> +#define ipqhashfn(a, b, c, d) (ipqhashfn(a, b, c, d) & (INETFRAGS_HASHSZ
>>>> - 1))
>>>> +#endif
>>>>
>>>> #ifdef HAVE_INET_FRAGS_CONST
>>>> static unsigned int ip4_hashfn(const struct inet_frag_queue *q)
>>>> @@ -267,6 +268,23 @@ out:
>>>> ipq_put(qp);
>>>> }
>>>>
>>>> +/* Memory limiting on fragments. Evictor trashes the oldest
>>>> + * fragment queue until we are back under the threshold.
>>>> + *
>>>> + * Necessary for kernels earlier than v3.17. Replaced in commit
>>>> + * b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue").
>>>> + */
>>>> +static void ip_evictor(struct net *net)
>>>> +{
>>>> +#ifdef HAVE_INET_FRAG_EVICTOR
>>>> + int evicted;
>>>> +
>>>> + evicted = inet_frag_evictor(&net->ipv4.frags, &ip4_frags, false);
>>>> + if (evicted)
>>>> + IP_ADD_STATS_BH(net, IPSTATS_MIB_REASMFAILS, evicted);
>>>> +#endif
>>>> +}
>>>> +
>>>> /* Find the correct entry in the "incomplete datagrams" queue for
>>>> * this IP datagram, and create new one, if nothing is found.
>>>> */
>>>> @@ -281,6 +299,11 @@ static struct ipq *ip_find(struct net *net, struct
>>>> iphdr *iph,
>>>> arg.user = user;
>>>> arg.vif = vif;
>>>>
>>>> + ip_evictor(net);
>>>> +
>>> IP fragment eviction can be done after this lookup. So that there is
>>> better chance of finding the fragment.
>>
>> I think I was trying to follow upstream more closely, although it's
>> actually in ip_defrag() in v3.16. I could shift this there instead, I
>> don't remember specifically why I put it here in the first place.
>>
>> This seems like a reasonable suggestion at face value, on affected
>> kernels they maintain an LRU for frag lists so if the current frag
>> adds to the oldest queue then it will be bumped to the front of the
>> LRU and be saved, so if it /would/ have been deleted by the
>> ip_evictor() then it's no longer affected; it could also end up
>> reassembling the message, therefore releasing the memory and reducing
>> the amount of work for ip_evictor() to do.
>>
>> The tradeoff is that if it's actually a fragment from a brand new
>> message, then if the fragment cache is full and we don't clean out old
>> fragments first, we'll fail to allocate a fragqueue for the new
>> message. We'll clean out the frags list afterwards, then subsequently
>> receive later fragments for the same message. These fragments won't be
>> able to reassemble the message because we dropped the first one.
>>
>> I'm inclined to think that the latter is more problematic than the
>> former, most likely fragments arrive in quick succession so if we're
>> deleting the oldest frag queue then it is more likely to be stale &
>> not receiving the remaining fragments. So we should prioritize
>> ensuring that the new fragment can create a new fragqueue.
>>
>> Thoughts?
>
> I agree the later seems bigger issue. we could insert the evict
> between search and create, but that will diverge compat code a lot
> compared to upstream. So lets keep it as it is.
>
> At this point I am fine with patch as it is.
>
> Acked-by: Pravin B Shelar <[email protected]>
Thanks for the reviews.
To bring the ip_evictor() closer to upstream I plan to also apply this
incremental on this series and push soon:
diff --git a/datapath/linux/compat/ip_fragment.c
b/datapath/linux/compat/ip_fragment.c
index 3b737f304287..e707c607fb30 100644
--- a/datapath/linux/compat/ip_fragment.c
+++ b/datapath/linux/compat/ip_fragment.c
@@ -268,6 +268,7 @@ out:
ipq_put(qp);
}
+#ifdef HAVE_INET_FRAG_EVICTOR
/* Memory limiting on fragments. Evictor trashes the oldest
* fragment queue until we are back under the threshold.
*
@@ -276,14 +277,13 @@ out:
*/
static void ip_evictor(struct net *net)
{
-#ifdef HAVE_INET_FRAG_EVICTOR
int evicted;
evicted = inet_frag_evictor(&net->ipv4.frags, &ip4_frags, false);
if (evicted)
IP_ADD_STATS_BH(net, IPSTATS_MIB_REASMFAILS, evicted);
-#endif
}
+#endif
/* Find the correct entry in the "incomplete datagrams" queue for
* this IP datagram, and create new one, if nothing is found.
@@ -299,8 +299,6 @@ static struct ipq *ip_find(struct net *net, struct
iphdr *iph,
arg.user = user;
arg.vif = vif;
- ip_evictor(net);
-
#ifdef HAVE_INET_FRAGS_WITH_RWLOCK
read_lock(&ip4_frags.lock);
#endif
@@ -706,6 +704,11 @@ int rpl_ip_defrag(struct net *net, struct sk_buff
*skb, u32 user)
IP_INC_STATS_BH(net, IPSTATS_MIB_REASMREQDS);
skb_orphan(skb);
+#ifdef HAVE_INET_FRAG_EVICTOR
+ /* Start by cleaning up the memory. */
+ ip_evictor(net);
+#endif
+
/* Lookup (or create) queue header */
qp = ip_find(net, ip_hdr(skb), user, vif);
if (qp) {
Lastly, looking over the LRU-related code upstream I saw that we
actually don't bump the entries each time they are touched in our
backport. I sent a patch:
http://openvswitch.org/pipermail/dev/2016-August/076961.html
_______________________________________________
dev mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/dev