On Wed, Sep 12, 2018 at 10:50:46AM +0200, Steffen Klassert wrote:
> On Tue, Sep 11, 2018 at 09:02:48PM +0200, Tobias Hommel wrote:
> > > > Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when 
> > > > skb_dst_force
> > > > clears the dst_entry.
> > > > 
> > > > Since commit 222d7dbd258d ("net: prevent dst uses after free")
> > > > skb_dst_force() might clear the dst_entry attached to the skb.
> > > > The xfrm code don't expect this to happen, so we crash with
> > > > a NULL pointer dereference in this case. Fix it by checking
> > > > skb_dst(skb) for NULL after skb_dst_force() and drop the packet
> > > > in cast the dst_entry was cleared.
> > > > 
> > > > Fixes: 222d7dbd258d ("net: prevent dst uses after free")
> > > > Reported-by: Tobias Hommel <netdev-l...@genoetigt.de>
> > > > Reported-by: Kristian Evensen <kristian.even...@gmail.com>
> > > > Reported-by: Wolfgang Walter <li...@stwm.de>
> > > > Signed-off-by: Steffen Klassert <steffen.klass...@secunet.com>
> > > > ---
> > > 
> > > This patch fixes the problem here.
> > > 
> > > XfrmFwdHdrError gets around 80 at the very beginning and remains so. 
> > > Probably 
> > > this happens when some route are changed/set then. 
> > > 
> > > Regards and thanks,
> > 
> > Same here, we're now running stable for ~6 hours, XfrmFwdHdrError is at 220.
> > This is less than 1 lost packet per minute, which seems to be okay for now.
> 
> Thanks a lot for testing! This is now applied to the ipsec tree.

After running for about 24 hours, I now encountered another panic. This time it
is caused by an out of memory situation. Although the trace shows action in the
filesystem code I'm posting it here because I cannot isolate the error and
maybe it is caused by our NULL pointer bug or by the new fix.
I do not have a serial console attached, so I could only attach a screenshot of
the panic to this mail.

I am running v4.19-rc3 from git with the above mentioned patch applied.
After 19 hours everything still looked fine, XfrmFwdHdrError value was at ~950.
Overall memory usage shown by htop was at 1.2G/15.6G.
I had htop running via ssh so I was able to see at least some status post
mortem. Uptime: 23:50:57
Overall memory usage was at 10.2G/15.6G and user processes were just
using the usual amount of memory, so it looks like the kernel was eating up at
least 9G of RAM.

Maybe this information is not very helpful for debugging, but it is at least a
warning that something might still be wrong.

I'll try to gather some more information and keep you updated.

Reply via email to