On Thu, May 06, 2021 at 06:10:39PM +0200, Alexandr Nedvedicky wrote:
> Hello,
>
> </snip>
> > > to be honest I have no idea what could be causing problems on those two
> > > fairly
> > > distinct machines. The strange thing is that pf_test() currently does not
> > > run in
> > > parallel. I don't quite understand why reverting my earlier change helps
> > > here.
> >
> > it could be two differents ways to trigger a bug somewhere else that
> > your commit expose.
> >
> > the panic doesn't trigger in the same way on both machines:
> > - Olivier's machine seems to trigger it quickly (after some minutes)
> > - mine relatively slowly (~ once a day)
>
> Olivier's machine acts as AP, so it forwards packets between interfaces.
>
> If I remember correctly your machine is laptop/workstation, which
> does not forward traffic.
it doesn't forward, but it acts as a bridge: 2 physical networks cards
grouped in a bridge(4) with only few traffic (a network printer is on
other side, the bridge(4) is here because I had a sparse network card
and no physical-switch to put here).
> the function, which we change back and forth here is
> pf_state_key_link_reverse(), which is being called from pf_find_state()
> here:
>
> 1085
> 1086 if (sk == NULL) {
> 1087 if ((sk = RB_FIND(pf_state_tree, &pf_statetbl,
> 1088 (struct pf_state_key *)key)) == NULL)
> 1089 return (PF_DROP);
> 1090 if (pd->dir == PF_OUT && pkt_sk &&
> 1091 pf_compare_state_keys(pkt_sk, sk, pd->kif, pd->dir)
> == 0)
> 1092 pf_state_key_link_reverse(sk, pkt_sk);
> 1093 else if (pd->dir == PF_OUT && pd->m->m_pkthdr.pf.inp &&
> 1094 !pd->m->m_pkthdr.pf.inp->inp_pf_sk && !sk->inp)
> 1095 pf_state_key_link_inpcb(sk,
> pd->m->m_pkthdr.pf.inp);
> 1096 }
> 1097
>
> the story in human words goes as follows:
>
> sk == NULL -> no matching state key was attached to packet, Thus we
> have to search state key in state tree using RB_FIND()
>
> if we could find state key for packet in table, then we will try
> to set up a 'shortcut', which can save us RB_FIND() later.
>
> 1090 - 1092
> the shortcut can be set up for outbound packet only (pd->dir PF_OUT),
> which is also being forwarded (pkt_sk != NULL, indicates we are seeing
> the packet for the second time pkt_sk holds state key for inbound
> direction). pf_compare_state_keys() is sanity check, it leaves a
> debug message on system console on failure.
>
> So if it is outbound forwarded packet, we've seen earlier, we
> set up a reverse link to save one RB_FIND() operation on next
> forwarded packet, which matches the same state.
>
> 1093 - 1095
> creates similar shortcut for local bound packets. We put reference
> to state key into PCB linked to socket. This will save us RB_FIND()
> operation for next local outbound, which matches the same state.
>
>
> given the bug seems to be triggered/uncovered by pf_state_key_link_reverse()
> is there any chance your laptop/workstation occasionally forwards packets?
> like doing NAT for vmd/qemu virtual machine?
I don't know bridge(4) internals, but it could make sense that it is
using such functions.
> if it is not the case then the question is how does it come we run
> pf_state_key_link_reverse()? which same as why pkt_sk is not NULL at line
> 1090.
>
> >
> > I could try to run with your commit and see if I could trigger it more
> > easily or found some elements influencing it. I could try with GENERIC
> > for example to see if I still trigger the same assert() or if it is
> > more like Olivier.
>
> I need to think of how to further debug the thing.
>
> >
> > my LAN was several hosts with the same kernel and only this machine
> > trigger the panic, so it shouldn't be strictly linked to the
> > environment.
> >
Thanks
--
Sebastien Marie