Re: [External] : Re: parallel forwarding vs. bridges

David Gwynne Wed, 02 Jun 2021 23:16:22 -0700

On Thu, Jun 03, 2021 at 01:09:48AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> looks like my earlier mail got eaten by SPAM filter, let's try again...
> 
> </snip>
> > 
> > moving pf locks to mutexes makes sense to me, but like you say,
> > this will need testing and experimentation. one of the issues
> 
>     I'm not entirely convinced that trading rwlock for mutexes
>     is a good thing. IMO majority of packets are being processed
>     as a readers on state lock. If packet does not match state,
>     then it needs to be pushed through rules, where we run to pf_lock
>     and state lock. Both those locks are grabbed exclusively on
>     this path. However I think we can relax pf_lock to reader
>     in a future.


there's obviously plusses and minuses for trading rwlocks for mutexes.

here's the states info from one of my firewalls at work:

State Table                          Total             Rate
  current entries                   789418               
  half-open tcp                   19101626               
  searches                   2835612773525       221420.0/s
  inserts                      37030235256         2891.5/s
  removals                     37029445838         2891.5/s

the ratio of searches to state table updates is like 40 to 1, so yes,
letting readers through concurrently would be useful.

>     I'd like to better understand why mutexes (spin locks) are better in
>     network subsystem? I'm concerned that removing all sleeping points from
>     network stack will create life harder for scheduler. If such box will
>     be hammered by network, the scheduler won't allow other components to
>     run, because the whole network subsystem got turned to one giant
>     non-interruptible block. I admit my concern may come from lack
>     of understanding of overall concept.

as i said, there's plusses and minuses for mutexes as pflocks. the minus
is that you lose the ability to run pf concurrently. however, i'm pretty
sure that there's a lot of mutable state inside pf that gets updated on
every packet that almost certainly benefits from being run exclusively.
im mostly thinking about the packet and byte counters on rules and
states.

my best argument for mutexes in pf is that currently we use smr critical
sections around things like the bridge and aggr port selection, which
are very obviously read-only workloads. pf may be a largely read-only
workload, but where it is at the moment means it's not going to get run
concurrently anyway.

i think the current scope of the locks in pf needs to be looked at
anyway. we might be able to reduce the contention and allow concurrency
with mutexes by doing something more complicated than just swapping the
lock type around. eg, could the ruleset be wired up with SMR pointers
for traversal? could we split the state trees into bucks and hash
packets into them for lookup? stoeplitz is probably cheap enough to make
that possible...

the scope of the pf locks likely needs reduction anyway. one of my
production firewalls panicked with the pf lock trying to lock against
itself a couple of nights ago:

db_enter() at db_enter+0x5
panic(ffffffff81d47212) at panic+0x12a
rw_enter(ffffffff82060e70,1) at rw_enter+0x261
pf_test(18,2,ffff80000158f000,ffff800024d03db8) at pf_test+0x118c
ip6_output(fffffd80049a3a00,0,0,0,ffff800024d03eb0,0) at
ip6_output+0xd33
nd6_ns_output(ffff80000158f000,0,ffff800024d04228,fffffd8279b3b420,0) at
nd6_ns
_output+0x3e2
nd6_resolve(ffff80000158f000,fffffd816c6b2ae0,fffffd80659d8300,ffff800024d04220
,ffff800024d040d8) at nd6_resolve+0x29d
ether_resolve(ffff80000158f000,fffffd80659d8300,ffff800024d04220,fffffd816c6b2a
e0,ffff800024d040d8) at ether_resolve+0x127
ether_output(ffff80000158f000,fffffd80659d8300,ffff800024d04220,fffffd816c6b2ae
0) at ether_output+0x2a
ip6_output(fffffd80659d8300,0,0,0,0,0) at ip6_output+0x1180
pfsync_undefer_notify(fffffd841dbec7b8) at pfsync_undefer_notify+0xac
pfsync_undefer(fffffd841dbec7b8,0) at pfsync_undefer+0x8d
pfsync_defer(fffffd82303cb310,fffffd8065a20000) at pfsync_defer+0xfe
pf_test_rule(ffff800024d04600,ffff800024d045e8,ffff800024d045e0,ffff800024d045f
0,ffff800024d045d8,ffff800024d045fe) at pf_test_rule+0x693
pf_test(2,3,ffff8000015a9000,ffff800024d04798) at pf_test+0x10f1
ip_output(fffffd8065a20000,0,ffff800024d04850,1,0,0) at ip_output+0x829
ip_forward(fffffd8065a20000,ffff800001541800,fffffd817a7722d0,0) at
ip_forward+
0x27a      
ip_input_if(ffff800024d04a80,ffff800024d04a7c,4,0,ffff800001541800) at
ip_input
_if+0x5fd
ipv4_input(ffff800001541800,fffffd8065a20000) at ipv4_input+0x37
carp_input(ffff80000158f000,fffffd8065a20000,5e000158) at
carp_input+0x1ac
ether_input(ffff80000158f000,fffffd8065a20000) at ether_input+0x1c0
vlan_input(ffff80000152f000,fffffd8065a20000) at vlan_input+0x19a
ether_input(ffff80000152f000,fffffd8065a20000) at ether_input+0x76
if_input_process(ffff8000001df048,ffff800024d04ca8) at
if_input_process+0x5a
ifiq_process(ffff8000001dbe00) at ifiq_process+0x6f
taskq_thread(ffff80000002b080) at taskq_thread+0x7b
end trace frame: 0x0, count: -26

pfsync is a nightmare. whoever wrote that code needs a talking to.

> > identified in another part of this thread (and on icb?) is that the
> > ioctl path does some stuff which can sleep, but you can't sleep
> > while holding a mutex which would now be protecting the data
> > structures you're trying to look at.
> 
>     I'm working on diff to sort it out. the plan is straightforward:
>     waiting on memory while holding a lock, which is shared with network
>     path is not desired. To avoid that we need to allocate memory
>     first, then grab the lock and update data in pf(4).

that sounds great.

>     the easier part of this has been committed last year. The harder
>     part (tables and queues) needs to be done yet. I'm poking to it
>     as time allows.

excellent.

> > a more specific example is if you're trying to dump the state table.
> > like you say, the state table is currently protected by the
> > pf_state_lock. iterating over the state table and giving up this lock to
> > copyout the entries is... not fun.
> 
>     I think I've sent diff here [1], patrick@ suggests an alternative
>     way to lock buffer in memory and than grab the lock [2]. I have
>     not tried that yet. I'm also not able to judge, which one is better.

im biased to the pf_state_list diff...

> > the diff below is a work in progress attempt to address this. it works
> > by locking the pf state list separately so it can support traversal
> > without (long) holds of locks needed in the forwarding path. the very
> > short explanation is that the TAILQ holding the states is locked
> > separately to the links between the states in the list. a much longer
> > explanataion is in the diff in the pfvar_priv.h chunk.
> 
>     I like your explanation in pfvar_priv.h, more details in line.
>     still need more fresh brain to figure out the whole idea/concept.

of course.

> 
> </snip>
> 
> thanks and
> regards
> sashan
> 
> [1] https://marc.info/?l=openbsd-tech&m=162147389818220&w=2
> 
> [2] https://marc.info/?l=openbsd-tech&m=162150683300508&w=2
> 
> </snip>
> >  pf_purge_expired_states(u_int32_t maxcheck)
> >  {
> </snip>
> > -           cur = pf_state_ref(next);
> > +   do {
> > +           if ((cur->timeout == PFTM_UNLINKED) ||
> > +               (pf_state_expires(cur) <= getuptime())) {
> > +                   SLIST_INSERT_HEAD(&gcl, pf_state_ref(cur), gc_list);
>                                               ^^^^^^^^^^^^
>     I wonder: is the extra reference you are chasing for coming from here?
>     I suspect pf_state_ref(cur) is being called two times, as macro expands.
> 
> > +                   gcn++;
> > +           }
> >  
> > -           if (cur == NULL)
> > +           /* don't iterate past the end of our view of the list */
> > +           if (cur == tail) {
> > +                   cur = NULL;
> >                     break;
> > -   }
> > -   PF_STATE_EXIT_READ();
> > +           }
> > +
> > +           cur = TAILQ_NEXT(cur, entry_list);
> > +   } while (maxcheck--);
>                 ^^^^^^^^^^
>     I also wonder: may be it's time to kill maxcheck, I think it does not 
> bring
>     any visible benefits. It just complicates code for us.

we run the main firewalls at work with a state limit of 16 million,
and we usually sit around 1 million states during business hours.

the way the pf purge code is wired up now it tries to gc states
every second or so. if we remove the limit it's going to be iterating
over a million states every second, and up to 16 million with my
limits. the maths at the moment means only 10% of the list is processed
every second, and that seems pretty high to me already.

having looked at it more closely in the last couple of days, i just
realised that it does this in the nettq, and then spins for hte kernel
lock and then takes the net lock.

>     for example I'm still not sure if we should or should not keep reference
>     for cur here. It seems to me it should be OK because of states were
>     explicitly flushed by ioctl(), we can't reach the code, because head == 
> NULL.
>     as I said, I need a fresh brain for this.

the flush ioctl doesnt remove states from the state list, it just marks
them as expired. the only place states get removed from the list is in
the gc code.

> </snip>
> 
> > Index: pfvar_priv.h
> > ===================================================================
> > RCS file: /cvs/src/sys/net/pfvar_priv.h,v
> > retrieving revision 1.6
> > diff -u -p -r1.6 pfvar_priv.h
> > --- pfvar_priv.h    9 Feb 2021 14:06:19 -0000       1.6
> > +++ pfvar_priv.h    2 Jun 2021 07:52:22 -0000
> > @@ -38,6 +38,115 @@
> >  #ifdef _KERNEL
> </snip>
> > +a snapshot of the head and tail of the list and prevent modifications
> > +to the links between states, we can safely traverse between the
> > +head and tail entries. subsequent insertions can add entries after
> > +our view of the tail, but we don't look past our view.
> > +
> > +if both locks must be taken, the rwlock protecting the links between
> > +states is taken before the mutex protecting the head and tail
> > +pointer.
> > +
> > +insertion into the list follows this pattern:
> > +
> > +   // serialise list head/tail modifications
> > +   mtx_enter(&pf_state_list.pfs_mtx);
> > +   TAILQ_INSERT_TAIL(&pf_state_list.pfs_list, state, entry_list);
> > +   mtx_leave(&pf_state_list.pfs_mtx);
> > +
> > +traversal of the list:
> > +
> > +   // lock against the gc removing an item from the list
> > +   rw_enter_read(&pf_state_list.pfs_mtx);
>                                      ^^^^^^^^
>       NIT: this should be pfs_rwl

nice.

> > +   // get a snapshot view of the ends of the list
> > +   mtx_enter(&pf_state_list.pfs_mtx);
> > +   head = TAILQ_FIRST(&pf_state_list.pfs_list);
> > +   tail = TAILQ_LAST(&pf_state_list.pfs_list, pf_state_queue);
> > +   mtx_leave(&pf_state_list.pfs_mtx);
> > +
> > +   state = NULL;
> > +   next = head;
> > +
> > +   while (state != tail) {
> > +           state = next;
> > +           next = TAILQ_NEXT(state, entry_list);
> > +
> > +           // look at the state
> > +   }
> > +
> > +   rw_exit_read(&pf_state_list.pfs_mtx);
>                                      ^^^^^^^^
>       NIT: this should be pfs_rwl

yep.

> > +removing an item from the list:
> > +
> > +   // wait for iterators (readers) to get out
> > +   rw_enter_write(&pf_state_list.pfs_mtx);
> > +
> > +   // serialise list head/tail modifications
> > +   mtx_enter(&pf_state_list.pfs_mtx);
> > +   head = TAILQ_REMOVE(&pf_state_list.pfs_list, entry_list);
>       ^^^^^^^^^^^^^^
>       NIT: not sure TAILQ_REMOVE() yields a value.  

yes :)

it also takes the state you want to remove as an argument.

> > +   mtx_leave(&pf_state_list.pfs_mtx);
> > +
> > +   rw_exit_write(&pf_state_list.pfs_mtx);
> > +
> > +the lock ordering for pf_state_list locks and the rest of the pf
> > +locks are:
> > +
> > +1. KERNEL_LOCK
> > +2. NET_LOCK
> > +3. pf_state_list.pfs_rwl
> > +4. PF_LOCK
> > +5. PF_STATE_LOCK
> > +6. pf_state_list.pfs_mtx
> > +
> > +*/
> > +
> > +struct pf_state_list {
> > +   /* the actual list of states in the system */
> > +   struct pf_state_queue           pfs_list;
> > +
> > +   /* serialises insertion of states in the list */
> > +   struct mutex                    pfs_mtx;
> > +
> > +   /* coordinate access to the list for deletion */
> > +   struct rwlock                   pfs_rwl;
> > +};
> > +
> > +#define PF_STATE_LIST_INITIALIZER(_pfs) {                          \
> > +   .pfs_list       = TAILQ_HEAD_INITIALIZER(_pfs.pfs_list),        \
> > +   .pfs_mtx        = MUTEX_INITIALIZER(IPL_SOFTNET),               \
> > +   .pfs_rwl        = RWLOCK_INITIALIZER("pfstates"),               \
> > +}
> >  
> >  extern struct rwlock pf_lock;
> </snip>

Re: [External] : Re: parallel forwarding vs. bridges

Reply via email to