Re: assert "sc->sc_dev == NUM" failed in if_tun.c (2)

2022-02-24 Thread David Gwynne
On Thu, Feb 24, 2022 at 11:13:48AM +0100, Claudio Jeker wrote:
> On Thu, Feb 24, 2022 at 07:39:54PM +1000, David Gwynne wrote:
> > 
> > here's the diff.
> > 
> > Index: if_tun.c
> > ===
> > RCS file: /cvs/src/sys/net/if_tun.c,v
> > retrieving revision 1.234
> > diff -u -p -r1.234 if_tun.c
> > --- if_tun.c16 Feb 2022 02:22:39 -  1.234
> > +++ if_tun.c24 Feb 2022 08:08:38 -
> > @@ -374,10 +374,19 @@ tun_dev_open(dev_t dev, const struct if_
> > struct ifnet *ifp;
> > int error;
> > u_short stayup = 0;
> > +   struct vnode *vp;
> >  
> 
> Why is there this empty line? It was there before but still wondering.

feng shui? laziness? i'll fix it later.

> > char name[IFNAMSIZ];
> > unsigned int rdomain;
> >  
> > +   /*
> > +* Find the vnode associated with this open before we sleep
> > +* and let something else revoke it. Our caller has a reference
> > +* to it so we don't need to account for it.
> > +*/
> > +   if (!vfinddev(dev, VCHR, ))
> > +   panic("%s vfinddev failed", __func__);
> > +
> > snprintf(name, sizeof(name), "%s%u", ifc->ifc_name, minor(dev));
> > rdomain = rtable_l2(p->p_p->ps_rtableid);
> >  
> > @@ -412,6 +421,12 @@ tun_dev_open(dev_t dev, const struct if_
> > /* XXX if_clone_destroy if stayup? */
> > goto done;
> > }
> > +   }
> > +
> > +   /* Has tun_clone_destroy torn the rug out under us? */
> > +   if (vp->v_type == VBAD) {
> > +   error = ENXIO;
> > +   goto done;
> > }
> >  
> > if (sc->sc_dev != 0) {
> > 
> 
> OK claudio@
> 
> After sleeping over this I think this is the cleanest and simplest way
> around this problem. A bit ugly that tun needs to peek into the vnode.
> 
> Another option is to split the clone destroy from the softc / device node.
> Remove the VOP_REVOKE and actually allow tun to be destroyed and recreated
> while open and in that case the device remains open and only the network
> bits are destroyed and later recreated. So in your example from above:
> 
> > > - ifconfig tun0 create
> > > - open /dev/tun0 -> fd 3
> > > - ifconfig tun0 destroy
> > > - ifconfig tun0 create
> > > - write to fd 3
> 
> The write would be perfectly fine since the destroy did not distroy this
> connection (only close(2) would do that). Actually a call to open
> /dev/tun0 after the 2nd create would fail because the device is still
> open. Doing this seems a lot more complex.

we talked about this a lot around the time of src/sys/net/if_tun.c
r1.210. there seemed to be more weight on the side of the argument for
VOP_REVOKE than against, and i still think that's the case now. tun
going away and coming back in between open and write could go either
way, but what about these:

- open() /dev/tun0, tun0 is destroyed, write() to tun0

should the write error? if we run with "only close can destroy the
connection" does this mean the write will create tun0 again? in which
rdomain should it be?

- open() /dev/tun0, tun0 is destroyed, read() from tun0

same as above?

- begin blocking read of tun0, tun0 is destroyed, let's go shopping!

should the read wake up and return an error, or does it just block?

- poll on tun0, tun0 is destroyed

same as above?

if we leave the /dev side of things operational if the interface goes
away, then this would be inconsistent with something workign with bpf on
the same interfaces. wouldnt this be inconsistent with hotplug devices
and their /dev things?

maybe destroy should be blocked if the /dev entry is open?

i dunno. it feels more natural that detaching or destroying an interface
should push userland off it.



Re: assert "sc->sc_dev == NUM" failed in if_tun.c (2)

2022-02-24 Thread David Gwynne
On Mon, Feb 21, 2022 at 03:00:01PM +1000, David Gwynne wrote:
> On Sun, Feb 20, 2022 at 10:30:22AM +1000, David Gwynne wrote:
> > 
> > 
> > > On 20 Feb 2022, at 09:46, David Gwynne  wrote:
> > > 
> > > On Sat, Feb 19, 2022 at 02:58:08PM -0800, Greg Steuck wrote:
> > >> There's no reproducer, but maybe this race is approachable without one?
> > >> 
> > >> dev = sc->sc_dev;
> > >> if (dev) {
> > >> struct vnode *vp;
> > >> 
> > >> if (vfinddev(dev, VCHR, ))
> > >> VOP_REVOKE(vp, REVOKEALL);
> > >> 
> > >> KASSERT(sc->sc_dev == 0);
> > >> }
> > > 
> > > this was my last run at it:
> > > https://marc.info/?l=openbsd-tech=164489981621957=2
> > > 
> > > maybe i need another dvthing thread to push it a bit harder...
> > 
> > adding another dvthing thread or two made it blow up pretty quickly :(
> 
> so it is this section:
> 
> /* kick userland off the device */
> dev = sc->sc_dev;
> if (dev) {
> struct vnode *vp;
> 
> if (vfinddev(dev, VCHR, ))
> VOP_REVOKE(vp, REVOKEALL);
> 
> KASSERT(sc->sc_dev == 0);
> }
> 
> my assumption was/is that VOP_REVOKE would call tunclose (or tapclose)
> on the currently open tun (or tap) device, and swap the vfs backend out
> behind the scenes with deadfs.
> 
> the context behind this is that isnt really a strong binding between an
> open /dev entry (eg, /dev/tun0) and an instance of an interface (eg,
> tun0). all the device entrypoints (eg, tunopen, tunread, tunwrite,
> etc) pass a dev_t, and that's used to look up an interface instance to
> work with.
> 
> the problem with this is an interface could be destroyed and recreated
> in between calls to the device entrypoints. ie, you could do the
> following high level steps:
> 
> - ifconfig tun0 create
> - open /dev/tun0 -> fd 3
> - ifconfig tun0 destroy
> - ifconfig tun0 create
> - write to fd 3
> 
> and that would send a packet on the newly created tun0 because it had
> the same minor number as the previous one.
> 
> there was a lot of consensus that this was Not The Best(tm), and that if
> a tun interface was destroyed while the /dev entry was open, it should
> act like the interface was detached and the open /dev entry should stop
> working. this is what VOP_REVOKE helps with. or it is supposed to help
> with.
> 
> when we create a tun interface, it's added to a global list of tun
> interfaces. when a tun device is opened, we look for the interface on
> that list (and create it if it doesnt exist), and then check to see if
> it is already open by looking at sc_dev. if sc_dev is 0, it's not open
> and tunopen can set sc_dev to claim ownership of it. if sc_dev is
> non-zero, then the device is busy and open fails.
> 
> tunclose clears sc_dev to say ownership is given up.
> 
> tun destroy checks sc_dev, and if it is != 0 then it knows something has
> it open and will call VOP_REVOKE. VOP_REVOKE is supposed to do what i
> described above, which is call tunclose on the programs behalf and swap
> the vfs ops out.
> 
> what i'm seeing is that sometimes VOP_REVOKE gets called, it happily
> returns 0, but tunclose is not called. this means sc_dev is not cleared,
> and then this KASSERT fires.
> 
> ive tried changing it something like this in the destroy path:
> 
> - KASSERT(sc->sc_dev == 0);
> + while (sc->sc_dev != 0)
> + tsleep_nsec(>sc_dev, PWAIT, "tunclose", INFSLP);
> 
> with tunclose calling wakeup(>dev) when it's finished, but this ends
> up getting stuck in the tsleep.
> 
> however, if i cut the KASSERT out and let destroy keep going, i do see
> tunclose get called against the now non-existent interface. this would
> be fine, but now we're back where we started. if someone recreates tun0
> after it's been destroyed, tunclose will find the new interface and try
> to close it.
> 
> ive tried to follow what VOP_REVOKE actually does and how it finds
> tunclose to call it, but it's pretty twisty and i got tired.
> 
> i guess my question at this point is are my assumptions about how
> VOP_REVOKE works valid? specifically, should it reliably be calling
> tunclose? if not, what causes tunclose to be called in the future and
> why can't i wait for it in tun_clone_destroy?

claudio figured it out. his clue was that multiple concurrent calls
to tunopen (or tapopen) will share a vnode. because tunopen can sleep,
multiple programs can be inside tunopen for the same tun interface at
the sam

ifconfig(8): always print the mtu, don't hide it on "bridges"

2022-02-21 Thread David Gwynne
this lets ifconfig show the MTU on interfaces like nvgre, vxlan, etc.
they currently don't show it because they also implement a bridge ioctl,
so ifconfig thinks they're a bridge.

why ifconfig hides the mtu on bridges looks to be a hold over from when
brconfig was merged into ifconfig. if we dont want bridge(4) to report
an mtu, then i can make bridge(4) itself hide the mtu or stop setting
the mtu.

found by jason tubnor.

ok?

Index: ifconfig.c
===
RCS file: /cvs/src/sbin/ifconfig/ifconfig.c,v
retrieving revision 1.451
diff -u -p -r1.451 ifconfig.c
--- ifconfig.c  23 Nov 2021 19:13:45 -  1.451
+++ ifconfig.c  22 Feb 2022 05:38:48 -
@@ -1027,11 +1027,7 @@ getinfo(struct ifreq *ifr, int create)
metric = 0;
else
metric = ifr->ifr_metric;
-#ifdef SMALL
if (ioctl(sock, SIOCGIFMTU, (caddr_t)ifr) == -1)
-#else
-   if (is_bridge() || ioctl(sock, SIOCGIFMTU, (caddr_t)ifr) == -1)
-#endif
mtu = 0;
else
mtu = ifr->ifr_mtu;



Re: rewritten vxlan(4)

2022-02-14 Thread David Gwynne
On Fri, Feb 11, 2022 at 03:13:25PM +1000, David Gwynne wrote:
> On Fri, Mar 05, 2021 at 05:09:29PM +1000, David Gwynne wrote:
> > On Thu, Mar 04, 2021 at 03:36:19PM +1000, David Gwynne wrote:
> > > as the subject says, this is a rewrite of vxlan(4).
> > > 
> > > vxlan(4) relies on bridge(4) to implement learning, but i want to be
> > > able to remove bridge(4) one day. while working on veb(4), i wrote
> > > the guts of a learning bridge implementation that is now used by veb(4),
> > > bpe(4), and nvgre(4). that learning bridge code is now also used by
> > > vxlan(4).
> > > 
> > > this means that a few of the modes that the manpage talks about are
> > > different now. because vxlan doesnt need a bridge for learning, there's
> > > no "multicast mode" anymore, it just does "dynamic mode" out of the box
> > > when configured with a multicast destination address. there's no
> > > multipoint mode now too.
> > > 
> > > another thing that's always bothered me about vxlan(4) is how it occupies
> > > the "udp namespace" and gets how it steals packets from the udp stack.
> > > the new code actually creates and bind udp sockets to handle the
> > > vxlan packets. this means userland can't collide with a vxlan interface,
> > > and you get to see that the port is in use in things like netstat. e.g.:
> > > 
> > > dlg@ikkaku ~$ ifconfig vxlan0
> > > vxlan0: flags=8843 mtu 1500
> > >   lladdr fe:e1:ba:d1:17:2a
> > >   index 11 llprio 3
> > >   encap: vnetid none parent aggr0 txprio 0 rxprio outer
> > >   groups: vxlan
> > >   tunnel: inet 192.0.2.36 port 4789 --> 239.0.0.1 ttl 1 nodf
> > >   Addresses (max cache: 100, timeout: 240):
> > >   inet 100.64.1.36 netmask 0xff00 broadcast 100.64.1.255
> > > dlg@ikkaku ~$ netstat -na -f inet -p udp
> > > Active Internet connections (including servers)
> > > Proto   Recv-Q Send-Q  Local Address  Foreign Address   
> > > udp  0  0  130.102.96.36.29742129.250.35.250.123
> > > udp  0  0  130.102.96.36.8965 162.159.200.123.123   
> > > udp  0  0  130.102.96.36.13189162.159.200.1.123 
> > > udp  0  0  130.102.96.36.46580220.158.215.20.123
> > > udp  0  0  130.102.96.36.23109103.38.121.36.123 
> > > udp  0  0  239.0.0.1.4789 *.*   
> > > udp  0  0  192.0.2.36.4789*.*   
> > > 
> > > ive also added loop prevention, ie, sending an interfaces vxlan
> > > packets over itself should fail rather than panic now.
> > 
> > here's an updated diff with a few fixes.
> >
> 
> this diff better supports vxlan p2p and multicast vxlan configs that
> share a UDP listener.

it doesn't look like anyone (else) cares about vxlan(4), so i'm
going to commit this tomorrow unless anyone really objects.

> Index: net/if_vxlan.c
> =======
> RCS file: /cvs/src/sys/net/if_vxlan.c,v
> retrieving revision 1.83
> diff -u -p -r1.83 if_vxlan.c
> --- net/if_vxlan.c10 Jan 2022 14:07:59 -  1.83
> +++ net/if_vxlan.c11 Feb 2022 05:11:13 -
> @@ -1,7 +1,7 @@
> -/*   $OpenBSD: if_vxlan.c,v 1.83 2022/01/10 14:07:59 jan Exp $   */
> +/*   $OpenBSD$ */
>  
>  /*
> - * Copyright (c) 2013 Reyk Floeter 
> + * Copyright (c) 2021 David Gwynne 
>   *
>   * Permission to use, copy, modify, and distribute this software for any
>   * purpose with or without fee is hereby granted, provided that the above
> @@ -17,475 +17,781 @@
>   */
>  
>  #include "bpfilter.h"
> -#include "vxlan.h"
> -#include "vlan.h"
>  #include "pf.h"
> -#include "bridge.h"
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> -#include 
>  #include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
>  
>  #include 
>  #include 
> +#include 
>  #include 
> +#include 
>  #include 
> -
> -#if NBPFILTER > 0
> -#include 
> -#endif
> +#include 
>  
>  #include 
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
> -#include 
>  #include 
> +#include 
>  
> -#if NPF > 0
> -#include 
> +#ifdef INET6
> +#include 
> +#include 
> +#include 
>  #endif
>  
> -#if NBRIDGE > 0
> +/* for bridge stuff */
>  #include 
> 

Re: Adjust bpf event filter

2022-02-14 Thread David Gwynne
On Mon, Feb 14, 2022 at 08:34:42AM +, Visa Hankala wrote:
> This makes bpf(4) use knote_modify_fn() and knote_process_fn().
> 
> Since the code has only one event function, it makes sense to use it
> directly instead of going through struct filterops f_event pointer.
> 
> OK?

ok.

> 
> Index: net/bpf.c
> ===
> RCS file: src/sys/net/bpf.c,v
> retrieving revision 1.214
> diff -u -p -r1.214 bpf.c
> --- net/bpf.c 13 Feb 2022 23:11:10 -  1.214
> +++ net/bpf.c 14 Feb 2022 08:08:07 -
> @@ -1229,8 +1229,13 @@ filt_bpfrdetach(struct knote *kn)
>  }
>  
>  int
> -filt_bpfread_common(struct knote *kn, struct bpf_d *d)
> +filt_bpfread(struct knote *kn, long hint)
>  {
> + struct bpf_d *d = kn->kn_hook;
> +
> + if (hint == NOTE_SUBMIT) /* ignore activation from selwakeup */
> + return (0);
> +
>   MUTEX_ASSERT_LOCKED(>bd_mtx);
>  
>   kn->kn_data = d->bd_hlen;
> @@ -1241,25 +1246,13 @@ filt_bpfread_common(struct knote *kn, st
>  }
>  
>  int
> -filt_bpfread(struct knote *kn, long hint)
> -{
> - struct bpf_d *d = kn->kn_hook;
> -
> - if (hint == NOTE_SUBMIT) /* ignore activation from selwakeup */
> - return (0);
> -
> - return (filt_bpfread_common(kn, d));
> -}
> -
> -int
>  filt_bpfreadmodify(struct kevent *kev, struct knote *kn)
>  {
>   struct bpf_d *d = kn->kn_hook;
>   int active;
>  
>   mtx_enter(>bd_mtx);
> - knote_assign(kev, kn);
> - active = filt_bpfread_common(kn, d);
> + active = knote_modify_fn(kev, kn, filt_bpfread);
>   mtx_leave(>bd_mtx);
>  
>   return (active);
> @@ -1272,12 +1265,7 @@ filt_bpfreadprocess(struct knote *kn, st
>   int active;
>  
>   mtx_enter(>bd_mtx);
> - if (kev != NULL && (kn->kn_flags & EV_ONESHOT))
> - active = 1;
> - else
> - active = filt_bpfread_common(kn, d);
> - if (active)
> - knote_submit(kn, kev);
> + active = knote_process_fn(kn, kev, filt_bpfread);
>   mtx_leave(>bd_mtx);
>  
>   return (active);
> 



check pf rule set prio values consistently in the pf ioctl code

2022-02-14 Thread David Gwynne
consistently means we do the check in pf_rule_copyin() so both
DIOCADDRULE and DIOCCHANGERULE have the prio values checked. this in
turn prevents invalid prio values getting set on a rule via
DIOCCHANGERULE, which in turn stops a kassert in the ifq priq code
firing.

i think this fixes
https://syzkaller.appspot.com/bug?id=c5cf86b2a0fc06f60463e60c02086756747970d4.

ok?

Index: pf_ioctl.c
===
RCS file: /cvs/src/sys/net/pf_ioctl.c,v
retrieving revision 1.372
diff -u -p -r1.372 pf_ioctl.c
--- pf_ioctl.c  9 Feb 2022 11:42:58 -   1.372
+++ pf_ioctl.c  14 Feb 2022 03:22:44 -
@@ -1370,15 +1370,6 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
break;
}
 
-   if (rule->scrub_flags & PFSTATE_SETPRIO &&
-   (rule->set_prio[0] > IFQ_MAXPRIO ||
-   rule->set_prio[1] > IFQ_MAXPRIO)) {
-   error = EINVAL;
-   pf_rule_free(rule);
-   rule = NULL;
-   break;
-   }
-
NET_LOCK();
PF_LOCK();
pr->anchor[sizeof(pr->anchor) - 1] = '\0';
@@ -3070,6 +3061,11 @@ int
 pf_rule_copyin(struct pf_rule *from, struct pf_rule *to)
 {
int i;
+
+   if (from->scrub_flags & PFSTATE_SETPRIO &&
+   (from->set_prio[0] > IFQ_MAXPRIO ||
+   from->set_prio[1] > IFQ_MAXPRIO))
+   return (EINVAL);
 
to->src = from->src;
to->src.addr.p.tbl = NULL;



prevent opening of dead tun/tap devices

2022-02-14 Thread David Gwynne
if an open tun (or tap) device is destroyed via ifconfig destroy,
there is a window while the open device is being revoked on the vfs
side that a third thread can come and open it again. this in turn
triggers a kassert in the ifconfig destroy path where it expects the
device to be closed.

i think this diff fixes it by having the open code refuse to work with a
device that's in the process of being destroyed.

this was found by syzkaller. the detail is at
https://syzkaller.appspot.com/bug?id=1d1ea7860c36e5066edea1145cf2d56715d5042b.

i was able to reproduce the problem with this code. it was pretty hard
to hit, but ive had no luck since adding the TUN_DEAD handling to
tun_dev_open.

#include 
#include 
#include 

#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void *
ifioctl(void *arg)
{
const char *name = arg;
struct ifreq ifr;
size_t len;
int s;

len = strlen(name);
if (len >= sizeof(ifr.ifr_name))
errx(1, "interface name is too long");

s = socket(AF_INET, SOCK_DGRAM, 0);
if (s == -1)
err(1, "%s socket", __func__);

for (;;) {
memset(, 0, sizeof(ifr));
memcpy(ifr.ifr_name, name, len);
if (ioctl(s, SIOCIFCREATE, ) == -1)
warn("SIOCIFCREATE %s", name);

memset(, 0, sizeof(ifr));
memcpy(ifr.ifr_name, name, len);
if (ioctl(s, SIOCIFDESTROY, ) == -1)
warn("SIOCIFDESTROY %s", name);
}

return (NULL);
}

static void *
dvthing(void *arg)
{
const char *name = arg;
int fd;
unsigned int yay = 0, nay = 0;
ssize_t rv;
uint8_t buf[65536];

for (;;) {
fd = open(name, O_RDWR);
if (fd == -1) {
nay++;
if (errno != EBUSY || (nay % 1000) == 0) {
warn("open %s (yay %u vs nay %u)", name,
yay, nay);
}
continue;
}
yay++;

pain:
rv = read(fd, buf, sizeof(buf));
if (rv == -1) {
warn("read");
if (errno == EIO) {
int nfd = open(name, O_RDWR);
if (nfd == -1)
warn("open new %s", name);
else {
close(fd);
fd = nfd;
goto pain;
}
}
}

close(fd);
}

return (NULL);
}

__dead static void
usage(void)
{
extern char *__progname;

fprintf(stderr, "usage: %s ifname\n", __progname);

exit(1);
}

int
main(int argc, char *argv[])
{
int ecode;
pthread_t if_tid, dv_tid;
void *ret;

if (argc != 2)
usage();

if (chdir("/dev") == -1)
err(1, "/dev");

ecode = pthread_create(_tid, NULL, ifioctl, argv[1]);
if (ecode != 0)
errc(1, ecode, "create ifioctl thread");

ecode = pthread_create(_tid, NULL, dvthing, argv[1]);
if (ecode != 0)
errc(1, ecode, "create dvthing thread");

ecode = pthread_join(if_tid, );
if (ecode != 0)
errc(1, ecode, "join ifioctl thread");

ecode = pthread_join(dv_tid, );
if (ecode != 0)
errc(1, ecode, "join dvthing thread");

exit(0);
}

Index: if_tun.c
===
RCS file: /cvs/src/sys/net/if_tun.c,v
retrieving revision 1.233
diff -u -p -r1.233 if_tun.c
--- if_tun.c15 Feb 2022 04:19:52 -  1.233
+++ if_tun.c15 Feb 2022 04:21:53 -
@@ -218,6 +218,8 @@ tun_create(struct if_clone *ifc, int uni
KERNEL_ASSERT_LOCKED();
 
sc = malloc(sizeof(*sc), M_DEVBUF, M_WAITOK|M_ZERO);
+   refcnt_init(>sc_refs);
+
ifp = >sc_if;
snprintf(ifp->if_xname, sizeof(ifp->if_xname),
"%s%d", ifc->ifc_name, unit);
@@ -267,7 +269,6 @@ tun_create(struct if_clone *ifc, int uni
}
 
sigio_init(>sc_sigio);
-   refcnt_init(>sc_refs);
 
/* tell tun_dev_open we're initialised */
 
@@ -381,7 +382,7 @@ tun_dev_open(dev_t dev, const struct if_
rdomain = rtable_l2(p->p_p->ps_rtableid);
 
/* let's find or make an interface to work with */
-   while ((ifp = if_unit(name)) == NULL) {
+   while ((sc = tun_name_lookup(name)) == NULL) {
error = if_clone_create(name, rdomain);
switch (error) {
case 0: /* it's probably ours 

Re: rewritten vxlan(4)

2022-02-10 Thread David Gwynne
On Fri, Mar 05, 2021 at 05:09:29PM +1000, David Gwynne wrote:
> On Thu, Mar 04, 2021 at 03:36:19PM +1000, David Gwynne wrote:
> > as the subject says, this is a rewrite of vxlan(4).
> > 
> > vxlan(4) relies on bridge(4) to implement learning, but i want to be
> > able to remove bridge(4) one day. while working on veb(4), i wrote
> > the guts of a learning bridge implementation that is now used by veb(4),
> > bpe(4), and nvgre(4). that learning bridge code is now also used by
> > vxlan(4).
> > 
> > this means that a few of the modes that the manpage talks about are
> > different now. because vxlan doesnt need a bridge for learning, there's
> > no "multicast mode" anymore, it just does "dynamic mode" out of the box
> > when configured with a multicast destination address. there's no
> > multipoint mode now too.
> > 
> > another thing that's always bothered me about vxlan(4) is how it occupies
> > the "udp namespace" and gets how it steals packets from the udp stack.
> > the new code actually creates and bind udp sockets to handle the
> > vxlan packets. this means userland can't collide with a vxlan interface,
> > and you get to see that the port is in use in things like netstat. e.g.:
> > 
> > dlg@ikkaku ~$ ifconfig vxlan0
> > vxlan0: flags=8843 mtu 1500
> > lladdr fe:e1:ba:d1:17:2a
> > index 11 llprio 3
> > encap: vnetid none parent aggr0 txprio 0 rxprio outer
> > groups: vxlan
> > tunnel: inet 192.0.2.36 port 4789 --> 239.0.0.1 ttl 1 nodf
> > Addresses (max cache: 100, timeout: 240):
> > inet 100.64.1.36 netmask 0xff00 broadcast 100.64.1.255
> > dlg@ikkaku ~$ netstat -na -f inet -p udp
> > Active Internet connections (including servers)
> > Proto   Recv-Q Send-Q  Local Address  Foreign Address   
> > udp  0  0  130.102.96.36.29742129.250.35.250.123
> > udp  0  0  130.102.96.36.8965 162.159.200.123.123   
> > udp  0  0  130.102.96.36.13189162.159.200.1.123 
> > udp  0  0  130.102.96.36.46580220.158.215.20.123
> > udp  0  0  130.102.96.36.23109103.38.121.36.123 
> > udp  0  0  239.0.0.1.4789 *.*   
> > udp  0  0  192.0.2.36.4789*.*   
> > 
> > ive also added loop prevention, ie, sending an interfaces vxlan
> > packets over itself should fail rather than panic now.
> 
> here's an updated diff with a few fixes.
>

this diff better supports vxlan p2p and multicast vxlan configs that
share a UDP listener.

Index: net/if_vxlan.c
===
RCS file: /cvs/src/sys/net/if_vxlan.c,v
retrieving revision 1.83
diff -u -p -r1.83 if_vxlan.c
--- net/if_vxlan.c  10 Jan 2022 14:07:59 -  1.83
+++ net/if_vxlan.c  11 Feb 2022 05:11:13 -
@@ -1,7 +1,7 @@
-/* $OpenBSD: if_vxlan.c,v 1.83 2022/01/10 14:07:59 jan Exp $   */
+/* $OpenBSD$ */
 
 /*
- * Copyright (c) 2013 Reyk Floeter 
+ * Copyright (c) 2021 David Gwynne 
  *
  * Permission to use, copy, modify, and distribute this software for any
  * purpose with or without fee is hereby granted, provided that the above
@@ -17,475 +17,781 @@
  */
 
 #include "bpfilter.h"
-#include "vxlan.h"
-#include "vlan.h"
 #include "pf.h"
-#include "bridge.h"
 
 #include 
 #include 
+#include 
 #include 
 #include 
-#include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
-
-#if NBPFILTER > 0
-#include 
-#endif
+#include 
 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
 #include 
+#include 
 
-#if NPF > 0
-#include 
+#ifdef INET6
+#include 
+#include 
+#include 
 #endif
 
-#if NBRIDGE > 0
+/* for bridge stuff */
 #include 
+#include 
+
+#if NBPFILTER > 0
+#include 
 #endif
 
-#include 
+/*
+ * The protocol.
+ */
+
+#define VXLANMTU   1492
+#define VXLAN_PORT 4789
+
+struct vxlan_header {
+   uint32_tvxlan_flags;
+#define VXLAN_F_I  (1U << 27)
+   uint32_tvxlan_id;
+#define VXLAN_VNI_SHIFT8
+#defineVXLAN_VNI_MASK  (0xffU << VXLAN_VNI_SHIFT)
+};
+
+#define VXLAN_VNI_MAX  0x00ffU
+#define VXLAN_VNI_MIN  0xU
+
+/*
+ * The driver.
+ */
+
+union vxlan_addr {
+   struct in_addr  in4;
+   struct in6_addr in6;
+};
+
+struct vxlan_softc;
+
+struct vxlan_peer {
+   RBT_ENTRY(vxlan_peer)p_entry;

Re: hardware checksum ix and ixl

2022-02-09 Thread David Gwynne
On Wed, Jan 26, 2022 at 01:29:42AM +0100, Alexander Bluhm wrote:
> Hi,
> 
> There were some problems with ix(4) and ixl(4) hardware checksumming
> for the output path on strict alignment architectures.
> 
> I have merged jan@'s diffs and added some sanity checks and
> workarounds.
> 
> - If the first mbuf is not aligned or not contigous, use m_copydata()
>   to extract the IP, IPv6, TCP header.
> - If the header is in the first mbuf, use m_data for the fast path.
> - Add netstat counter for invalid header chains.  This makes
>   us aware when hardware checksumming fails.
> - Add netstat counter for header copies.  This indicates that
>   better storage allocation in the network stack is possible.
>   It also allows to recognize alignment problems on non-strict
>   architectures.
> - There is not risk of crashes on sparc64.
> 
> Does this aproach make sense?
> 
> ix(4) works quite well, but finds some UDP packets that need copy.
> ixl(4) has not been tested yet.  I would like to have some feedback
> for the idea first.
> 
> bluhm
> 
> Index: sys/dev/pci/if_ixl.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/dev/pci/if_ixl.c,v
> retrieving revision 1.78
> diff -u -p -r1.78 if_ixl.c
> --- sys/dev/pci/if_ixl.c  9 Jan 2022 05:42:54 -   1.78
> +++ sys/dev/pci/if_ixl.c  25 Jan 2022 23:50:01 -
> @@ -1388,6 +1398,7 @@ static int  ixl_rxeof(struct ixl_softc *,
>  static void  ixl_rxfill(struct ixl_softc *, struct ixl_rx_ring *);
>  static void  ixl_rxrefill(void *);
>  static int   ixl_rxrinfo(struct ixl_softc *, struct if_rxrinfo *);
> +static void  ixl_rx_checksum(struct mbuf *, uint64_t);
>  
>  #if NKSTAT > 0
>  static void  ixl_kstat_attach(struct ixl_softc *);
> @@ -3190,6 +3318,7 @@ ixl_rxeof(struct ixl_softc *sc, struct i
>   m->m_pkthdr.csum_flags |= M_FLOWID;
>   }
>  
> + ixl_rx_checksum(m, word);
>   ml_enqueue(, m);
>   } else {
>   ifp->if_ierrors++; /* XXX */
> @@ -3320,6 +3449,23 @@ ixl_rxrinfo(struct ixl_softc *sc, struct
>   free(ifr, M_TEMP, ixl_nqueues(sc) * sizeof(*ifr));
>  
>   return (rv);
> +}
> +
> +static void
> +ixl_rx_checksum(struct mbuf *m, uint64_t word)
> +{
> + if (!ISSET(word, IXL_RX_DESC_L3L4P))
> + return;
> +
> + if (ISSET(word, IXL_RX_DESC_IPE))
> + return;
> +
> + m->m_pkthdr.csum_flags |= M_IPV4_CSUM_IN_OK;
> +
> + if (ISSET(word, IXL_RX_DESC_L4E))
> + return;
> +
> + m->m_pkthdr.csum_flags |= M_TCP_CSUM_IN_OK | M_UDP_CSUM_IN_OK;
>  }
>  
>  static int

this is ok by me. tested on both amd64 and sparc64.

dlg



Re: hardware checksum ix and ixl

2022-02-07 Thread David Gwynne
On Mon, Feb 07, 2022 at 02:39:38PM +1000, David Gwynne wrote:
> On Fri, Feb 04, 2022 at 09:29:56PM +1000, David Gwynne wrote:
> > On Fri, Jan 28, 2022 at 05:26:01PM +0100, Alexander Bluhm wrote:
> > > On Wed, Jan 26, 2022 at 11:05:51AM +0100, Claudio Jeker wrote:
> > > > On Wed, Jan 26, 2022 at 01:29:42AM +0100, Alexander Bluhm wrote:
> > > > > Hi,
> > > > > 
> > > > > There were some problems with ix(4) and ixl(4) hardware checksumming
> > > > > for the output path on strict alignment architectures.
> > > > > 
> > > > > I have merged jan@'s diffs and added some sanity checks and
> > > > > workarounds.
> > > > > 
> > > > > - If the first mbuf is not aligned or not contigous, use m_copydata()
> > > > >   to extract the IP, IPv6, TCP header.
> > > > > - If the header is in the first mbuf, use m_data for the fast path.
> > > > > - Add netstat counter for invalid header chains.  This makes
> > > > >   us aware when hardware checksumming fails.
> > > > > - Add netstat counter for header copies.  This indicates that
> > > > >   better storage allocation in the network stack is possible.
> > > > >   It also allows to recognize alignment problems on non-strict
> > > > >   architectures.
> > > > > - There is not risk of crashes on sparc64.
> > > > > 
> > > > > Does this aproach make sense?
> > > > 
> > > > I think it is overly complicated and too much data is copied around.
> > > > First of all there is no need to extract ipproto.
> > > > The code can just use the M_TCP_CSUM_OUT and M_UDP_CSUM_OUT flags (they
> > > > are not set for other protos).
> > > > Because of this only they ip_hlen needs to be accessed and this can be
> > > > done with m_getptr().
> > > 
> > > A solution with m_getptr() is where we started.  It has already
> > > been rejected.  The problem is that there are 6 ways to implement
> > > this feature.  Every option has its drawbacks and was rejected.
> > > 
> > > Options are:
> > > 1. m_getptr() and access the struct.  Alignment cannot be guaranteed.
> > > 2. m_getptr() and access the byte or word.  Header fields should be
> > >accessed by structs.
> > > 3. Always m_copydata.  Too much overhead.
> > > 4. Always use m_data.  Kernel may crash or use invalid data.
> > > 5. Combination of m_data and m_copydata.  Too complex.
> > > 6. Track the fields in mbuf header.  Too fragile and memory overhead.
> > > 
> > > In my measurements checksum offloading gave us 10% performance boost
> > > so I want this feature.  Other drivers also have it.
> > > 
> > > Could we get another idea or find a consensus which option to use?
> > 
> > after staring at this for a few hours my conclusion is option 1 actually
> > is the right approach, but the diff for ixl has a bug.
> > 
> > > > In the IP6 case even more can be skipped since ip_hlen is static for 
> > > > IPv6.
> > > > 
> > > > In ixl(4) also the tcp header lenght needs to be extracted. Again the 
> > > > code
> > > > can be simplified because HW checksumming is only enabled if ip_hlen == 
> > > > 5
> > > > and so the offset of the th_off field is static (for both IPv4 and 
> > > > IPv6).
> > > > Again m_getptr can be used to just access the byte with th_off.
> > > > 
> > > > Longterm in_proto_cksum_out() should probably help provide the th_off
> > > > field. I think enforcing ip_hlen == 5 for UDP and TCP is fine, who needs
> > > > IP options on UDP and TCP?
> > > 
> > > Other diffs have been rejected as they make too many assumtions how
> > > the stack works.
> > 
> > my opinion is we can make these assumptions. the CSUM_OUT flags are
> > only set in very specific places where the stack itself is constructing
> > or checking properly aligned headers, and then the stack maintains
> > this alignment until it reaches the driver. this is where the first
> > of the bugs in the ixl diff comes in.
> > 
> > in the diff ixl_tx_setup_offload() is called after the dma mapping
> > occurs. this is implemented in this code:
> > 
> > static inline int
> > ixl_load_mbuf(bus_dma_tag_t dmat, bus_dmamap_t map, struct mbuf *m)
> > {
> > int error;
> > 
> > error = bus_dmamap_load_mbuf(dmat

Re: hardware checksum ix and ixl

2022-02-06 Thread David Gwynne
On Fri, Feb 04, 2022 at 09:29:56PM +1000, David Gwynne wrote:
> On Fri, Jan 28, 2022 at 05:26:01PM +0100, Alexander Bluhm wrote:
> > On Wed, Jan 26, 2022 at 11:05:51AM +0100, Claudio Jeker wrote:
> > > On Wed, Jan 26, 2022 at 01:29:42AM +0100, Alexander Bluhm wrote:
> > > > Hi,
> > > > 
> > > > There were some problems with ix(4) and ixl(4) hardware checksumming
> > > > for the output path on strict alignment architectures.
> > > > 
> > > > I have merged jan@'s diffs and added some sanity checks and
> > > > workarounds.
> > > > 
> > > > - If the first mbuf is not aligned or not contigous, use m_copydata()
> > > >   to extract the IP, IPv6, TCP header.
> > > > - If the header is in the first mbuf, use m_data for the fast path.
> > > > - Add netstat counter for invalid header chains.  This makes
> > > >   us aware when hardware checksumming fails.
> > > > - Add netstat counter for header copies.  This indicates that
> > > >   better storage allocation in the network stack is possible.
> > > >   It also allows to recognize alignment problems on non-strict
> > > >   architectures.
> > > > - There is not risk of crashes on sparc64.
> > > > 
> > > > Does this aproach make sense?
> > > 
> > > I think it is overly complicated and too much data is copied around.
> > > First of all there is no need to extract ipproto.
> > > The code can just use the M_TCP_CSUM_OUT and M_UDP_CSUM_OUT flags (they
> > > are not set for other protos).
> > > Because of this only they ip_hlen needs to be accessed and this can be
> > > done with m_getptr().
> > 
> > A solution with m_getptr() is where we started.  It has already
> > been rejected.  The problem is that there are 6 ways to implement
> > this feature.  Every option has its drawbacks and was rejected.
> > 
> > Options are:
> > 1. m_getptr() and access the struct.  Alignment cannot be guaranteed.
> > 2. m_getptr() and access the byte or word.  Header fields should be
> >accessed by structs.
> > 3. Always m_copydata.  Too much overhead.
> > 4. Always use m_data.  Kernel may crash or use invalid data.
> > 5. Combination of m_data and m_copydata.  Too complex.
> > 6. Track the fields in mbuf header.  Too fragile and memory overhead.
> > 
> > In my measurements checksum offloading gave us 10% performance boost
> > so I want this feature.  Other drivers also have it.
> > 
> > Could we get another idea or find a consensus which option to use?
> 
> after staring at this for a few hours my conclusion is option 1 actually
> is the right approach, but the diff for ixl has a bug.
> 
> > > In the IP6 case even more can be skipped since ip_hlen is static for IPv6.
> > > 
> > > In ixl(4) also the tcp header lenght needs to be extracted. Again the code
> > > can be simplified because HW checksumming is only enabled if ip_hlen == 5
> > > and so the offset of the th_off field is static (for both IPv4 and IPv6).
> > > Again m_getptr can be used to just access the byte with th_off.
> > > 
> > > Longterm in_proto_cksum_out() should probably help provide the th_off
> > > field. I think enforcing ip_hlen == 5 for UDP and TCP is fine, who needs
> > > IP options on UDP and TCP?
> > 
> > Other diffs have been rejected as they make too many assumtions how
> > the stack works.
> 
> my opinion is we can make these assumptions. the CSUM_OUT flags are
> only set in very specific places where the stack itself is constructing
> or checking properly aligned headers, and then the stack maintains
> this alignment until it reaches the driver. this is where the first
> of the bugs in the ixl diff comes in.
> 
> in the diff ixl_tx_setup_offload() is called after the dma mapping
> occurs. this is implemented in this code:
> 
> static inline int
> ixl_load_mbuf(bus_dma_tag_t dmat, bus_dmamap_t map, struct mbuf *m)
> {
>   int error;
> 
>   error = bus_dmamap_load_mbuf(dmat, map, m,
>   BUS_DMA_STREAMING | BUS_DMA_NOWAIT);
>   if (error != EFBIG)
>   return (error);
> 
>   error = m_defrag(m, M_DONTWAIT);
>   if (error != 0)
>   return (error);
> 
>   return (bus_dmamap_load_mbuf(dmat, map, m,
>   BUS_DMA_STREAMING | BUS_DMA_NOWAIT));
> }
> 
> the problem is that when we get a heavily fragmented mbuf we call
> m_defrag, which basically allocates a cluster and copies all the
> data from the fragments into it. howeve

Re: convert bgpd to stdint.h types

2022-02-04 Thread David Gwynne
ok

On Sat, 5 Feb 2022, 01:08 Claudio Jeker,  wrote:

> This is something I wanted to do for a while. Switch from u_intXY_t to
> uintXY_t from stdint.h. The diff is mostly mechanical and was done with
> sed -i 's/u_intX_t/uintX_t/g' but uint8_t changes the tab spacing and so
> I had a look over the code and reindented where it made sense.
> Using stdint.h types will mostly help portability.
>
> Sorry for the size of this diff.
> --
> :wq Claudio
>
> Index: bgpctl/bgpctl.c
> ===
> RCS file: /cvs/src/usr.sbin/bgpctl/bgpctl.c,v
> retrieving revision 1.274
> diff -u -p -r1.274 bgpctl.c
> --- bgpctl/bgpctl.c 4 Feb 2022 12:01:33 -   1.274
> +++ bgpctl/bgpctl.c 4 Feb 2022 14:31:26 -
> @@ -54,9 +54,9 @@ void   show_mrt_dump(struct mrt_rib *, s
>  voidnetwork_mrt_dump(struct mrt_rib *, struct mrt_peer *,
> void *);
>  voidshow_mrt_state(struct mrt_bgp_state *, void *);
>  voidshow_mrt_msg(struct mrt_bgp_msg *, void *);
> -const char *msg_type(u_int8_t);
> +const char *msg_type(uint8_t);
>  voidnetwork_bulk(struct parse_result *);
> -int match_aspath(void *, u_int16_t, struct filter_as *);
> +int match_aspath(void *, uint16_t, struct filter_as *);
>
>  struct imsgbuf *ibuf;
>  struct mrt_parser show_mrt = { show_mrt_dump, show_mrt_state,
> show_mrt_msg };
> @@ -624,7 +624,7 @@ fmt_monotime(time_t t)
>  }
>
>  const char *
> -fmt_fib_flags(u_int16_t flags)
> +fmt_fib_flags(uint16_t flags)
>  {
> static char buf[8];
>
> @@ -665,7 +665,7 @@ fmt_fib_flags(u_int16_t flags)
>  }
>
>  const char *
> -fmt_origin(u_int8_t origin, int sum)
> +fmt_origin(uint8_t origin, int sum)
>  {
> switch (origin) {
> case ORIGIN_IGP:
> @@ -680,7 +680,7 @@ fmt_origin(u_int8_t origin, int sum)
>  }
>
>  const char *
> -fmt_flags(u_int8_t flags, int sum)
> +fmt_flags(uint8_t flags, int sum)
>  {
> static char buf[80];
> char flagstr[5];
> @@ -723,7 +723,7 @@ fmt_flags(u_int8_t flags, int sum)
>  }
>
>  const char *
> -fmt_ovs(u_int8_t validation_state, int sum)
> +fmt_ovs(uint8_t validation_state, int sum)
>  {
> switch (validation_state) {
> case ROA_INVALID:
> @@ -747,7 +747,7 @@ fmt_mem(long long num)
>  }
>
>  const char *
> -fmt_errstr(u_int8_t errcode, u_int8_t subcode)
> +fmt_errstr(uint8_t errcode, uint8_t subcode)
>  {
> static char  errbuf[256];
> const char  *errstr = NULL;
> @@ -814,7 +814,7 @@ fmt_errstr(u_int8_t errcode, u_int8_t su
>  }
>
>  const char *
> -fmt_attr(u_int8_t type, int flags)
> +fmt_attr(uint8_t type, int flags)
>  {
>  #define CHECK_FLAGS(s, t, m)   \
> if (((s) & ~(ATTR_DEFMASK | (m))) != (t)) pflags = 1
> @@ -909,7 +909,7 @@ fmt_attr(u_int8_t type, int flags)
>  }
>
>  const char *
> -fmt_community(u_int16_t a, u_int16_t v)
> +fmt_community(uint16_t a, uint16_t v)
>  {
> static char buf[12];
>
> @@ -936,7 +936,7 @@ fmt_community(u_int16_t a, u_int16_t v)
>  }
>
>  const char *
> -fmt_large_community(u_int32_t d1, u_int32_t d2, u_int32_t d3)
> +fmt_large_community(uint32_t d1, uint32_t d2, uint32_t d3)
>  {
> static char buf[33];
>
> @@ -945,14 +945,14 @@ fmt_large_community(u_int32_t d1, u_int3
>  }
>
>  const char *
> -fmt_ext_community(u_int8_t *data)
> +fmt_ext_community(uint8_t *data)
>  {
> static char buf[32];
> -   u_int64_t   ext;
> +   uint64_text;
> struct in_addr  ip;
> -   u_int32_t   as4, u32;
> -   u_int16_t   as2, u16;
> -   u_int8_ttype, subtype;
> +   uint32_tas4, u32;
> +   uint16_tas2, u16;
> +   uint8_t type, subtype;
>
> type = data[0];
> subtype = data[1];
> @@ -1057,7 +1057,7 @@ network_bulk(struct parse_result *res)
> char *line = NULL;
> size_t linesize = 0;
> ssize_t linelen;
> -   u_int8_t len;
> +   uint8_t len;
> FILE *f;
>
> if ((f = fdopen(STDIN_FILENO, "r")) == NULL)
> @@ -1108,7 +1108,7 @@ show_mrt_dump_neighbors(struct mrt_rib *
>  {
> struct mrt_peer_entry *p;
> struct in_addr ina;
> -   u_int16_t i;
> +   uint16_t i;
>
> ina.s_addr = htonl(mp->bgp_id);
> printf("view: %s BGP ID: %s Number of peers: %u\n\n",
> @@ -1132,7 +1132,7 @@ show_mrt_dump(struct mrt_rib *mr, struct
> struct ctl_show_rib_request *req = arg;
> struct mrt_rib_entry*mre;
> time_t   now;
> -   u_int16_ti, j;
> +   uint16_t i, j;
>
> memset(, 0, sizeof(res));
> res.flags = req->flags;
> @@ -1214,7 +1214,7 @@ network_mrt_dump(struct mrt_rib *mr, str
> struct mrt_rib_entry*mre;
> struct ibuf *msg;
> time_t   now;
> - 

Re: always align data in m_pullup on all archs

2022-02-04 Thread David Gwynne
On Fri, Feb 04, 2022 at 11:39:56AM +0100, Alexander Bluhm wrote:
> On Fri, Feb 04, 2022 at 07:32:52PM +1000, David Gwynne wrote:
> > as discussed in "m_pullup alingment crash armv7 sparc64", at worst it
> > doesnt hurt to have m_pullup maintain the data alignment of payloads,
> > and at best it will encourage aligned loads even if the arch allows
> > unaligned accesses. aligned loads are faster than unaligned.
> >
> > ok?
> 
> adj is unsigned int, so assigning a mtod(m0, unsigned long) looks
> strange.  Of course the higher bits are cut off anyway, but an
> explicit & is clearer than a assingment with different types.
> 
> Please use
> 
>   adj = mtod(m, unsigned long) & (sizeof(long) - 1);
>   adj = mtod(m0, unsigned long) & (sizeof(long) - 1);
> 
> otherwise OK bluhm@

it's been pointed out to me that ALIGNBYTES is actually very
conservative about alignment rather than optimistic. ie, it's at least
sizeof(long) - 1, but can be bigger. i think i've been confusing
it with ALIGNED_POINTER.

> 
> > Index: uipc_mbuf.c
> > ===
> > RCS file: /cvs/src/sys/kern/uipc_mbuf.c,v
> > retrieving revision 1.280
> > diff -u -p -r1.280 uipc_mbuf.c
> > --- uipc_mbuf.c 18 Jan 2022 12:38:21 -  1.280
> > +++ uipc_mbuf.c 4 Feb 2022 09:30:02 -
> > @@ -945,9 +945,11 @@ m_pullup(struct mbuf *m0, int len)
> > goto freem0;
> > }
> >
> > -   adj = mtod(m, unsigned long) & ALIGNBYTES;
> > +   adj = mtod(m, unsigned long);
> > } else
> > -   adj = mtod(m0, unsigned long) & ALIGNBYTES;
> > +   adj = mtod(m0, unsigned long);
> > +
> > +   adj &= sizeof(long) - 1;
> >
> > tail = head + M_SIZE(m0);
> > head += adj;



Re: openbgpd vs illumos

2022-02-04 Thread David Gwynne
Ok

On Fri, 4 Feb 2022, 21:19 Claudio Jeker,  wrote:

> On illumos sun is defined by some header so better not use sun as a
> variable name. Rename variable to sa_un to reduce hacks in -portable.
>
> --
> :wq Claudio
>
> Index: bgpctl/bgpctl.c
> ===
> RCS file: /cvs/src/usr.sbin/bgpctl/bgpctl.c,v
> retrieving revision 1.273
> diff -u -p -r1.273 bgpctl.c
> --- bgpctl/bgpctl.c 9 Aug 2021 08:24:36 -   1.273
> +++ bgpctl/bgpctl.c 4 Feb 2022 11:10:31 -
> @@ -78,7 +78,7 @@ usage(void)
>  int
>  main(int argc, char *argv[])
>  {
> -   struct sockaddr_un   sun;
> +   struct sockaddr_un   sa_un;
> int  fd, n, done, ch, verbose = 0;
> struct imsg  imsg;
> struct network_confignet;
> @@ -160,12 +160,12 @@ main(int argc, char *argv[])
> if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
> err(1, "control_init: socket");
>
> -   bzero(, sizeof(sun));
> -   sun.sun_family = AF_UNIX;
> -   if (strlcpy(sun.sun_path, sockname, sizeof(sun.sun_path)) >=
> -   sizeof(sun.sun_path))
> +   bzero(_un, sizeof(sa_un));
> +   sa_un.sun_family = AF_UNIX;
> +   if (strlcpy(sa_un.sun_path, sockname, sizeof(sa_un.sun_path)) >=
> +   sizeof(sa_un.sun_path))
> errx(1, "socket name too long");
> -   if (connect(fd, (struct sockaddr *), sizeof(sun)) == -1)
> +   if (connect(fd, (struct sockaddr *)_un, sizeof(sa_un)) == -1)
> err(1, "connect: %s", sockname);
>
> if (pledge("stdio", NULL) == -1)
> Index: bgpd/control.c
> ===
> RCS file: /cvs/src/usr.sbin/bgpd/control.c,v
> retrieving revision 1.105
> diff -u -p -r1.105 control.c
> --- bgpd/control.c  27 Apr 2021 15:34:18 -  1.105
> +++ bgpd/control.c  4 Feb 2022 11:07:25 -
> @@ -42,19 +42,19 @@ ssize_t  imsg_read_nofd(struct imsgbuf
>  int
>  control_check(char *path)
>  {
> -   struct sockaddr_un   sun;
> +   struct sockaddr_un   sa_un;
> int  fd;
>
> -   bzero(, sizeof(sun));
> -   sun.sun_family = AF_UNIX;
> -   strlcpy(sun.sun_path, path, sizeof(sun.sun_path));
> +   bzero(_un, sizeof(sa_un));
> +   sa_un.sun_family = AF_UNIX;
> +   strlcpy(sa_un.sun_path, path, sizeof(sa_un.sun_path));
>
> if ((fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0)) == -1) {
> log_warn("%s: socket", __func__);
> return (-1);
> }
>
> -   if (connect(fd, (struct sockaddr *), sizeof(sun)) == 0) {
> +   if (connect(fd, (struct sockaddr *)_un, sizeof(sa_un)) == 0) {
> log_warnx("control socket %s already in use", path);
> close(fd);
> return (-1);
> @@ -68,7 +68,7 @@ control_check(char *path)
>  int
>  control_init(int restricted, char *path)
>  {
> -   struct sockaddr_un   sun;
> +   struct sockaddr_un   sa_un;
> int  fd;
> mode_t   old_umask, mode;
>
> @@ -78,10 +78,10 @@ control_init(int restricted, char *path)
> return (-1);
> }
>
> -   bzero(, sizeof(sun));
> -   sun.sun_family = AF_UNIX;
> -   if (strlcpy(sun.sun_path, path, sizeof(sun.sun_path)) >=
> -   sizeof(sun.sun_path)) {
> +   bzero(_un, sizeof(sa_un));
> +   sa_un.sun_family = AF_UNIX;
> +   if (strlcpy(sa_un.sun_path, path, sizeof(sa_un.sun_path)) >=
> +   sizeof(sa_un.sun_path)) {
> log_warn("control_init: socket name too long");
> close(fd);
> return (-1);
> @@ -102,7 +102,7 @@ control_init(int restricted, char *path)
> mode = S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP;
> }
>
> -   if (bind(fd, (struct sockaddr *), sizeof(sun)) == -1) {
> +   if (bind(fd, (struct sockaddr *)_un, sizeof(sa_un)) == -1) {
> log_warn("control_init: bind: %s", path);
> close(fd);
> umask(old_umask);
> @@ -159,12 +159,12 @@ control_accept(int listenfd, int restric
>  {
> int  connfd;
> socklen_tlen;
> -   struct sockaddr_un   sun;
> +   struct sockaddr_un   sa_un;
> struct ctl_conn *ctl_conn;
>
> -   len = sizeof(sun);
> +   len = sizeof(sa_un);
> if ((connfd = accept4(listenfd,
> -   (struct sockaddr *), ,
> +   (struct sockaddr *)_un, ,
> SOCK_NONBLOCK | SOCK_CLOEXEC)) == -1) {
> if (errno == ENFILE || errno == EMFILE) {
> pauseaccept = getmonotime();
>
>


Re: hardware checksum ix and ixl

2022-02-04 Thread David Gwynne
On Fri, Jan 28, 2022 at 05:26:01PM +0100, Alexander Bluhm wrote:
> On Wed, Jan 26, 2022 at 11:05:51AM +0100, Claudio Jeker wrote:
> > On Wed, Jan 26, 2022 at 01:29:42AM +0100, Alexander Bluhm wrote:
> > > Hi,
> > > 
> > > There were some problems with ix(4) and ixl(4) hardware checksumming
> > > for the output path on strict alignment architectures.
> > > 
> > > I have merged jan@'s diffs and added some sanity checks and
> > > workarounds.
> > > 
> > > - If the first mbuf is not aligned or not contigous, use m_copydata()
> > >   to extract the IP, IPv6, TCP header.
> > > - If the header is in the first mbuf, use m_data for the fast path.
> > > - Add netstat counter for invalid header chains.  This makes
> > >   us aware when hardware checksumming fails.
> > > - Add netstat counter for header copies.  This indicates that
> > >   better storage allocation in the network stack is possible.
> > >   It also allows to recognize alignment problems on non-strict
> > >   architectures.
> > > - There is not risk of crashes on sparc64.
> > > 
> > > Does this aproach make sense?
> > 
> > I think it is overly complicated and too much data is copied around.
> > First of all there is no need to extract ipproto.
> > The code can just use the M_TCP_CSUM_OUT and M_UDP_CSUM_OUT flags (they
> > are not set for other protos).
> > Because of this only they ip_hlen needs to be accessed and this can be
> > done with m_getptr().
> 
> A solution with m_getptr() is where we started.  It has already
> been rejected.  The problem is that there are 6 ways to implement
> this feature.  Every option has its drawbacks and was rejected.
> 
> Options are:
> 1. m_getptr() and access the struct.  Alignment cannot be guaranteed.
> 2. m_getptr() and access the byte or word.  Header fields should be
>accessed by structs.
> 3. Always m_copydata.  Too much overhead.
> 4. Always use m_data.  Kernel may crash or use invalid data.
> 5. Combination of m_data and m_copydata.  Too complex.
> 6. Track the fields in mbuf header.  Too fragile and memory overhead.
> 
> In my measurements checksum offloading gave us 10% performance boost
> so I want this feature.  Other drivers also have it.
> 
> Could we get another idea or find a consensus which option to use?

after staring at this for a few hours my conclusion is option 1 actually
is the right approach, but the diff for ixl has a bug.

> > In the IP6 case even more can be skipped since ip_hlen is static for IPv6.
> > 
> > In ixl(4) also the tcp header lenght needs to be extracted. Again the code
> > can be simplified because HW checksumming is only enabled if ip_hlen == 5
> > and so the offset of the th_off field is static (for both IPv4 and IPv6).
> > Again m_getptr can be used to just access the byte with th_off.
> > 
> > Longterm in_proto_cksum_out() should probably help provide the th_off
> > field. I think enforcing ip_hlen == 5 for UDP and TCP is fine, who needs
> > IP options on UDP and TCP?
> 
> Other diffs have been rejected as they make too many assumtions how
> the stack works.

my opinion is we can make these assumptions. the CSUM_OUT flags are
only set in very specific places where the stack itself is constructing
or checking properly aligned headers, and then the stack maintains
this alignment until it reaches the driver. this is where the first
of the bugs in the ixl diff comes in.

in the diff ixl_tx_setup_offload() is called after the dma mapping
occurs. this is implemented in this code:

static inline int
ixl_load_mbuf(bus_dma_tag_t dmat, bus_dmamap_t map, struct mbuf *m)
{
int error;

error = bus_dmamap_load_mbuf(dmat, map, m,
BUS_DMA_STREAMING | BUS_DMA_NOWAIT);
if (error != EFBIG)
return (error);

error = m_defrag(m, M_DONTWAIT);
if (error != 0)
return (error);

return (bus_dmamap_load_mbuf(dmat, map, m,
BUS_DMA_STREAMING | BUS_DMA_NOWAIT));
}

the problem is that when we get a heavily fragmented mbuf we call
m_defrag, which basically allocates a cluster and copies all the
data from the fragments into it. however, m_defrag does not maintain
the alignment of payload, meaning the ethernet header gets to start
on a 4 byte boundary cos that's what we get from the cluster pool,
and the IP and TCP payloads end up with a 2 byte misalignmnet.

my proposed solution to this is to set up the offload bits before
calling ixl_load_mbuf. i'm confident this is the bug that deraadt@ hit.

because of when the CSUM_OUT flags are set, i think m_getptr is fine for
pulling the mbuf apart. if it's not i want the code to blow up so it
gets fixed. that's why the code in my diff below lacks defenses.

the other bug is that the uint64_t holding the offload flags isn't
reset between being called for different mbufs. my solution to that is
to have ixl_tx_setup_offload() return the flags so the value in
ixl_start is overwritten every time.

lastly, ixl shouldn't (doesn't need to?) spelunk in the packet if

always align data in m_pullup on all archs

2022-02-04 Thread David Gwynne
as discussed in "m_pullup alingment crash armv7 sparc64", at worst it
doesnt hurt to have m_pullup maintain the data alignment of payloads,
and at best it will encourage aligned loads even if the arch allows
unaligned accesses. aligned loads are faster than unaligned.

ok?

Index: uipc_mbuf.c
===
RCS file: /cvs/src/sys/kern/uipc_mbuf.c,v
retrieving revision 1.280
diff -u -p -r1.280 uipc_mbuf.c
--- uipc_mbuf.c 18 Jan 2022 12:38:21 -  1.280
+++ uipc_mbuf.c 4 Feb 2022 09:30:02 -
@@ -945,9 +945,11 @@ m_pullup(struct mbuf *m0, int len)
goto freem0;
}
 
-   adj = mtod(m, unsigned long) & ALIGNBYTES;
+   adj = mtod(m, unsigned long);
} else
-   adj = mtod(m0, unsigned long) & ALIGNBYTES;
+   adj = mtod(m0, unsigned long);
+
+   adj &= sizeof(long) - 1;
 
tail = head + M_SIZE(m0);
head += adj;



Re: m_pullup alingment crash armv7 sparc64

2022-02-04 Thread David Gwynne
On Thu, Feb 03, 2022 at 11:39:50PM +0100, Alexander Bluhm wrote:
> On Thu, Feb 03, 2022 at 10:22:46PM +1000, David Gwynne wrote:
> > bpf should know better than this. it has all the information it needs to
> > align the payload properly, it just doesnt make enough of an effort. can
> > you try the diff below?
> 
> Diff seems to work.  regress/sbin/slaacd passes.
> I have started a full regress run on armv7 and sparc64.

cool.

> 
> > +   if (len < hlen)
> > +   return (EPERM);
> 
> This is not a permission problem.  Maybe EINVAL.

this is preveserving the existing semantics of this chunk:

@@ -232,10 +247,6 @@ bpf_movein(struct uio *uio, struct bpf_d
goto bad;
}
 
-   if (m->m_len < hlen) {
-   error = EPERM;
-   goto bad;
-   }
/*
 * Make room for link header, and copy it to sockaddr
 */

i was going to follow this diff up with another one to tweak it.
tun(4) uses EMSGSIZE for this, but errno(2) says EMSGSIZE means
"Message too long". EINVAL sounds good.

want me to do it now?

> 
> > +   /*
> > + * Get the length of the payload so we can align it properly.
> > +*/
> 
> Whitespace misaligned.

yep.

> > +   if (mlen > MAXMCLBYTES)
> > +   return (EIO);
> 
> Should we use EMSGSIZE like in bpfwrite() if the data does not fit.

yes. want me to do that now too?

> 
> > +   mlen = max(max_linkhdr, hlen) + roundup(alen, sizeof(long));
> ...
> > +   m_align(m, alen); /* Align the payload. */
> 
> sizeof(long) seems the correct alignment for mbuf.  In rev 1.237
> of m_pullup() you introduced ALIGNBYTES.  This is architecture
> dependent and the only place in the network stack where it is used.
> Should it also be sizeof(long) there?

i probably did that when i was adding ALIGNED_POINTER checks to ethernet
tunnel drivers.

my initial thought is it doesnt save m_pullup any work, so it's
probably better to use sizeof(long) so there's a better chance that
accesses to words in a pulled up payload are aligned. even if an
arch doesnt fault on unaligned access it still goes faster with
aligned accesses. if m_pullup is going to move data around it may as
well move it to the best place possible.

i'll spit a diff out for that next.

Index: bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.210
diff -u -p -r1.210 bpf.c
--- bpf.c   16 Jan 2022 06:27:14 -  1.210
+++ bpf.c   4 Feb 2022 09:25:40 -
@@ -144,7 +144,7 @@ bpf_movein(struct uio *uio, struct bpf_d
struct mbuf *m;
struct m_tag *mtag;
int error;
-   u_int hlen;
+   u_int hlen, alen, mlen;
u_int len;
u_int linktype;
u_int slen;
@@ -198,23 +198,38 @@ bpf_movein(struct uio *uio, struct bpf_d
return (EIO);
}
 
-   if (uio->uio_resid > MAXMCLBYTES)
-   return (EIO);
len = uio->uio_resid;
+   if (len < hlen)
+   return (EPERM);
 
-   MGETHDR(m, M_WAIT, MT_DATA);
-   m->m_pkthdr.ph_ifidx = 0;
-   m->m_pkthdr.len = len - hlen;
+   /*
+* Get the length of the payload so we can align it properly.
+*/
+   alen = len - hlen;
+
+   /*
+* Allocate enough space for headers and the aligned payload.
+*/
+   mlen = max(max_linkhdr, hlen) + roundup(alen, sizeof(long));
+
+   if (mlen > MAXMCLBYTES)
+   return (EIO);
 
-   if (len > MHLEN) {
-   MCLGETL(m, M_WAIT, len);
+   MGETHDR(m, M_WAIT, MT_DATA);
+   if (mlen > MHLEN) {
+   MCLGETL(m, M_WAIT, mlen);
if ((m->m_flags & M_EXT) == 0) {
error = ENOBUFS;
goto bad;
}
}
+
+   m_align(m, alen); /* Align the payload. */
+   m->m_data -= hlen;
+
+   m->m_pkthdr.ph_ifidx = 0;
+   m->m_pkthdr.len = len;
m->m_len = len;
-   *mp = m;
 
error = uiomove(mtod(m, caddr_t), len, uio);
if (error)
@@ -232,10 +247,6 @@ bpf_movein(struct uio *uio, struct bpf_d
goto bad;
}
 
-   if (m->m_len < hlen) {
-   error = EPERM;
-   goto bad;
-   }
/*
 * Make room for link header, and copy it to sockaddr
 */
@@ -249,8 +260,10 @@ bpf_movein(struct uio *uio, struct bpf_d
sockp->sa_family = ntohl(af);
} else
memcpy(sockp->sa_data, m->m_data, hlen);
+
+   m->m_pkthdr.len -= hlen;
m->m_len -= hlen;
-   m->m_data += hlen; /* XXX */
+   m->m_data += hlen;
}
 
/*
@@ -260,6 +273,7 @@ bpf_movein(struct uio *uio, struct bpf_d
*(u_int *)(mtag + 1) = linktype;
m_tag_prepend(m, mtag);
 
+   *mp = m;
return (0);
  bad:
m_freem(m);



Re: m_pullup alingment crash armv7 sparc64

2022-02-03 Thread David Gwynne
On Wed, Feb 02, 2022 at 10:30:24PM +0100, Alexander Bluhm wrote:
> Hi,
> 
> With the new m_pullup() code, I see crashes on armv7 and sparc64.
> regress/sbin/slaacd sends some IPv6 ND6 packets over pair(4) from
> scapy.  It crashes after m_pullup() in ipv6_check().
> 
>  run-parse-ra 
> ifconfig pair1 destroy 2>/dev/null || true
> ifconfig pair2 destroy 2>/dev/null || true
> ifconfig pair1 rdomain 1 10.11.12.1/24 up
> ifconfig pair2 rdomain 1 10.11.12.2/24 up
> ifconfig pair1 rdomain 1 patch pair2
> ifconfig pair1 inet6 rdomain 1 eui64
> ifconfig pair2 inet6 rdomain 1 eui64
> ifconfig pair2 inet6 rdomain 1 autoconf
> route -nq -T 1 add -host default 10.11.12.1 -reject
> route -T1 exec python3 -B -u /usr/src/regress/sbin/slaacd/process_ra.py  
> pair1 pair2 /dev/slaacd.sock
> Timeout, server ot21 not responding.
> 
> panic: trap type 0x34 (mem address not aligned): pc=14a1a7c npc=14a1a80 
> pstate=99820006
> Stopped at  db_enter+0x8:   nop
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> *317234  41858  0 0x14000  0x2000  softnet
>  416946  63125  0 0x14000  0x2001  systq
> trap(40025e91650, 34, 14a1a7c, 99820006, 180, 3b9ac800) at trap+0x328
> Lslowtrap_reenter(40005090d00, 0, fffe, 1832866, 6000, 6) at 
> Lslowtrap_reenter+0xf8
> ipv6_check(40005364000, 40005090d00, 0, 2018000, e0018000, 1a) at 
> ipv6_check+0x21c
> ip6_input_if(40025e91a68, 40025e91a74, 29, 0, 40005364000, 3b9ac800) at 
> ip6_input_if+0x7c
> ipv6_input(40005364000, 198a2f0, 40005090d00, 1, e, 0) at ipv6_input+0x3c
> ether_input(40005364000, 40005090d00, 20, 1780210, 0, 1780210) at 
> ether_input+0x274
> if_input_process(40005364000, 40025e91ce0, 20, 1780210, 0, 6) at 
> if_input_process+0x44
> ifiq_process(40005364410, 40025e91dc8, 1cd6288, 1cc1840, 1c00, 6) at 
> ifiq_process+0x68
> taskq_thread(4000432e080, 40004300810, 1775908, 480, 180, 3b9ac800) at 
> taskq_thread+0x7c
> proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x14
> 
> bpfwrite() calls bpf_movein() which fills the mbuf with these
> wonderful comments:
> 
> case DLT_EN10MB:
> sockp->sa_family = AF_UNSPEC;
> /* XXX Would MAXLINKHDR be better? */
> hlen = ETHER_HDR_LEN;
> ...
> m->m_data += hlen; /* XXX */
> 
> Note that ETHER_HDR_LEN is 14, so the mbuf is not 4 byte aligned.
> Next ether_encap() makes room for an aligned ehternet header.  This
> is 16 bytes and does not fit.  So we prepend a new empty mbuf.
> 
> m = m_prepend(m, ETHER_ALIGN + sizeof(eh), M_DONTWAIT);
> 
> When we reach ip6_input(), the ethernet header was added and removed.
> So we still have an empty mbuf in front of an mbuf with an unaligned
> IPv6 header.
> 
> New m_pullup() code preserves the alingment of the second mbuf data.
> During review we agreed that this might be a good idea.  Mbuf
> alingnmet is very fragile, but it works unless you change anything.
> So back to the old behavior, align the data at the beginning of the
> empty mbuf.
> 
> ok?

bpf should know better than this. it has all the information it needs to
align the payload properly, it just doesnt make enough of an effort. can
you try the diff below?

Index: bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.210
diff -u -p -r1.210 bpf.c
--- bpf.c   16 Jan 2022 06:27:14 -  1.210
+++ bpf.c   3 Feb 2022 12:21:33 -
@@ -144,7 +144,7 @@ bpf_movein(struct uio *uio, struct bpf_d
struct mbuf *m;
struct m_tag *mtag;
int error;
-   u_int hlen;
+   u_int hlen, alen, mlen;
u_int len;
u_int linktype;
u_int slen;
@@ -198,23 +198,38 @@ bpf_movein(struct uio *uio, struct bpf_d
return (EIO);
}
 
-   if (uio->uio_resid > MAXMCLBYTES)
-   return (EIO);
len = uio->uio_resid;
+   if (len < hlen)
+   return (EPERM);
 
-   MGETHDR(m, M_WAIT, MT_DATA);
-   m->m_pkthdr.ph_ifidx = 0;
-   m->m_pkthdr.len = len - hlen;
+   /*
+ * Get the length of the payload so we can align it properly.
+*/
+   alen = len - hlen;
+
+   /*
+* Allocate enough space for headers and the aligned payload.
+*/
+   mlen = max(max_linkhdr, hlen) + roundup(alen, sizeof(long));
+
+   if (mlen > MAXMCLBYTES)
+   return (EIO);
 
-   if (len > MHLEN) {
-   MCLGETL(m, M_WAIT, len);
+   MGETHDR(m, M_WAIT, MT_DATA);
+   if (mlen > MHLEN) {
+   MCLGETL(m, M_WAIT, mlen);
if ((m->m_flags & M_EXT) == 0) {
error = ENOBUFS;
goto bad;
}
}
+
+   m_align(m, alen); /* Align the payload. */
+   m->m_data -= hlen;
+
+   m->m_pkthdr.ph_ifidx = 0;
+   m->m_pkthdr.len = len;
m->m_len = len;
-   *mp = m;
 
error = 

Re: don't try and wakeup select/poll/kq in tun/bpf close

2022-01-15 Thread David Gwynne
On Sun, Jan 16, 2022 at 06:17:16AM +, Visa Hankala wrote:
> On Sun, Jan 16, 2022 at 02:41:46PM +1000, David Gwynne wrote:
> > if you're in bpfclose or tun/tap close, you're the last one out. this
> > means that there shouldn't be anything else in poll/select/kevent/etc
> > because you're the last one out.
> > 
> > from what i can tell, tun and bpf are the only drivers that do this, and
> > i dont think they need to.
> 
> The last one out holds when the device closing is triggered by the
> releasing of the last file reference. However, bpf/tun/tap close can
> also be called when the device is detached through VOP_REVOKE(). In
> that case any associated file descriptors remain open.

iirc, revoke can call fooclose() before it swaps the vnode ops out for
the dead ones. so yes, the fd is open but it's state becomes divorced
from the special device it was originally associated with.

> I think poll/select/kevent should wake up if the device is detached.
> kevent(2) and kqueue-based select(2) get notified as a result of
> klist_invalidate().
> 
> There is also the SIGIO case, but only bpf close raises the signal.
> 
> To be on the safer side, I would wait until poll(2) uses kqueue backed.

grumble grumble. yes.

or the detach code that calls revoke and klist_invalidate should also do
selwakeup...



don't try and wakeup select/poll/kq in tun/bpf close

2022-01-15 Thread David Gwynne
if you're in bpfclose or tun/tap close, you're the last one out. this
means that there shouldn't be anything else in poll/select/kevent/etc
because you're the last one out.

from what i can tell, tun and bpf are the only drivers that do this, and
i dont think they need to.

ok?

Index: bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.209
diff -u -p -r1.209 bpf.c
--- bpf.c   13 Jan 2022 14:15:27 -  1.209
+++ bpf.c   16 Jan 2022 04:33:02 -
@@ -401,7 +401,6 @@ bpfclose(dev_t dev, int flag, int mode, 
d = bpfilter_lookup(minor(dev));
mtx_enter(>bd_mtx);
bpf_detachd(d);
-   bpf_wakeup(d);
LIST_REMOVE(d, bd_list);
mtx_leave(>bd_mtx);
bpf_put(d);
Index: if_tun.c
===
RCS file: /cvs/src/sys/net/if_tun.c,v
retrieving revision 1.231
diff -u -p -r1.231 if_tun.c
--- if_tun.c9 Mar 2021 20:05:14 -   1.231
+++ if_tun.c16 Jan 2022 04:33:02 -
@@ -460,7 +460,6 @@ tun_dev_close(dev_t dev, struct proc *p)
ifq_purge(>if_snd);
 
CLR(sc->sc_flags, TUN_ASYNC);
-   selwakeup(>sc_rsel);
sigio_free(>sc_sigio);
 
if (!ISSET(sc->sc_flags, TUN_DEAD)) {



msk(4): handle status ring entries as a single 64bit word

2022-01-05 Thread David Gwynne
and then shift and mask the interesting bits out.

this works on an overdrive 1000, where i discovered that arm64 appears
to have a single instruction for shift/mask.

maybe too much churn to be worth it?

Index: if_msk.c
===
RCS file: /cvs/src/sys/dev/pci/if_msk.c,v
retrieving revision 1.137
diff -u -p -r1.137 if_msk.c
--- if_msk.c5 Jan 2022 03:53:26 -   1.137
+++ if_msk.c6 Jan 2022 00:38:18 -
@@ -120,6 +120,53 @@
 #include 
 #include 
 
+#define MSK_STATUS_OWN_SHIFT   63
+#define MSK_STATUS_OWN_MASK0x1
+#define MSK_STATUS_OPCODE_SHIFT56
+#define MSK_STATUS_OPCODE_MASK 0x7f
+
+#define MSK_STATUS_OWN(_d) \
+(((_d) >> MSK_STATUS_OWN_SHIFT) & MSK_STATUS_OWN_MASK)
+#define MSK_STATUS_OPCODE(_d) \
+(((_d) >> MSK_STATUS_OPCODE_SHIFT) & MSK_STATUS_OPCODE_MASK)
+
+#define MSK_STATUS_OPCODE_RXSTAT   0x60
+#define MSK_STATUS_OPCODE_RXTIMESTAMP  0x61
+#define MSK_STATUS_OPCODE_RXVLAN   0x62
+#define MSK_STATUS_OPCODE_RXCKSUM  0x64
+#define MSK_STATUS_OPCODE_RXCKSUMVLAN  \
+(MSK_STATUS_OPCODE_RXVLAN | MSK_STATUS_OPCODE_RXCKSUM)
+#define MSK_STATUS_OPCODE_RXTIMEVLAN   \
+(MSK_STATUS_OPCODE_RXVLAN | MSK_STATUS_OPCODE_RXTIMESTAMP)
+#define MSK_STATUS_OPCODE_RSS_HASH 0x65
+#define MSK_STATUS_OPCODE_TXIDX0x68
+#define MSK_STATUS_OPCODE_MACSEC   0x6c
+#define MSK_STATUS_OPCODE_PUTIDX   0x70
+
+#define MSK_STATUS_RXSTAT_PORT_SHIFT   48
+#define MSK_STATUS_RXSTAT_PORT_MASK0x1
+#define MSK_STATUS_RXSTAT_LEN_SHIFT32
+#define MSK_STATUS_RXSTAT_LEN_MASK 0x
+#define MSK_STATUS_RXSTAT_STATUS_SHIFT 0
+#define MSK_STATUS_RXSTAT_STATUS_MASK  0x
+
+#define MSK_STATUS_RXSTAT_PORT(_d) \
+(((_d) >> MSK_STATUS_RXSTAT_PORT_SHIFT) & MSK_STATUS_RXSTAT_PORT_MASK)
+#define MSK_STATUS_RXSTAT_LEN(_d) \
+(((_d) >> MSK_STATUS_RXSTAT_LEN_SHIFT) & MSK_STATUS_RXSTAT_LEN_MASK)
+#define MSK_STATUS_RXSTAT_STATUS(_d) \
+(((_d) >> MSK_STATUS_RXSTAT_STATUS_SHIFT) & MSK_STATUS_RXSTAT_STATUS_MASK)
+
+#define MSK_STATUS_TXIDX_PORTA_SHIFT   0
+#define MSK_STATUS_TXIDX_PORTA_MASK0xfff
+#define MSK_STATUS_TXIDX_PORTB_SHIFT   24
+#define MSK_STATUS_TXIDX_PORTB_MASK0xfff
+
+#define MSK_STATUS_TXIDX_PORTA(_d) \
+(((_d) >> MSK_STATUS_TXIDX_PORTA_SHIFT) & MSK_STATUS_TXIDX_PORTA_MASK)
+#define MSK_STATUS_TXIDX_PORTB(_d) \
+(((_d) >> MSK_STATUS_TXIDX_PORTB_SHIFT) & MSK_STATUS_TXIDX_PORTB_MASK)
+
 int mskc_probe(struct device *, void *, void *);
 void mskc_attach(struct device *, struct device *self, void *aux);
 int mskc_detach(struct device *, int);
@@ -624,6 +671,7 @@ mskc_reset(struct sk_softc *sc)
 {
u_int32_t imtimer_ticks, reg1;
int reg;
+   unsigned int i;
 
DPRINTFN(2, ("mskc_reset\n"));
 
@@ -758,8 +806,8 @@ mskc_reset(struct sk_softc *sc)
}
 
/* Reset status ring. */
-   bzero(sc->sk_status_ring,
-   MSK_STATUS_RING_CNT * sizeof(struct msk_status_desc));
+   for (i = 0; i < MSK_STATUS_RING_CNT; i++)
+   sc->sk_status_ring[i] = htole64(0);
sc->sk_status_idx = 0;
 
sk_win_write_4(sc, SK_STAT_BMU_CSR, SK_STAT_BMU_RESET);
@@ -1138,8 +1186,8 @@ mskc_attach(struct device *parent, struc
sc->sk_pc = pc;
 
if (bus_dmamem_alloc(sc->sc_dmatag,
-   MSK_STATUS_RING_CNT * sizeof(struct msk_status_desc),
-   MSK_STATUS_RING_CNT * sizeof(struct msk_status_desc),
+   MSK_STATUS_RING_CNT * sizeof(uint64_t),
+   MSK_STATUS_RING_CNT * sizeof(uint64_t),
0, >sk_status_seg, 1, >sk_status_nseg,
BUS_DMA_NOWAIT | BUS_DMA_ZERO)) {
printf(": can't alloc status buffers\n");
@@ -1148,27 +1196,27 @@ mskc_attach(struct device *parent, struc
 
if (bus_dmamem_map(sc->sc_dmatag,
>sk_status_seg, sc->sk_status_nseg,
-   MSK_STATUS_RING_CNT * sizeof(struct msk_status_desc),
+   MSK_STATUS_RING_CNT * sizeof(uint64_t),
, BUS_DMA_NOWAIT)) {
-   printf(": can't map dma buffers (%lu bytes)\n",
-   (ulong)(MSK_STATUS_RING_CNT * sizeof(struct 
msk_status_desc)));
+   printf(": can't map dma buffers (%zu bytes)\n",
+   MSK_STATUS_RING_CNT * sizeof(uint64_t));
goto fail_3;
}
if (bus_dmamap_create(sc->sc_dmatag,
-   MSK_STATUS_RING_CNT * sizeof(struct msk_status_desc), 1,
-   MSK_STATUS_RING_CNT * sizeof(struct msk_status_desc), 0,
+   MSK_STATUS_RING_CNT * sizeof(uint64_t), 1,
+   MSK_STATUS_RING_CNT * sizeof(uint64_t), 0,
BUS_DMA_NOWAIT | BUS_DMA_ALLOCNOW | BUS_DMA_64BIT,
>sk_status_map)) {
printf(": can't create dma map\n");
goto fail_4;
}
if (bus_dmamap_load(sc->sc_dmatag, sc->sk_status_map, kva,
-   MSK_STATUS_RING_CNT * sizeof(struct msk_status_desc),

msk(4) txeof should use completion indexes from status descriptors

2022-01-04 Thread David Gwynne
this is what i want to commit to fix "msk(4) not working with trunk(4)
(Marvell Yukon 88E8042)" as reported on bugs@.

the register that msk_txeof currently uses to figure out where it
should read up to appears to be more a counter than an actual index.
the linux driver doesnt use it though, it gets values out of the
txstat completion descriptor in the status ring. this means we don't
know where the chip is up to unless we get a txstat event for it,
which in turn means the watchdog can't know what is safe to reclaim.

the chip also appears to need to be told to start at index 0 in the tx
ring when it's coming up.

ok?

Index: if_msk.c
===
RCS file: /cvs/src/sys/dev/pci/if_msk.c,v
retrieving revision 1.136
diff -u -p -r1.136 if_msk.c
--- if_msk.c12 Dec 2020 11:48:53 -  1.136
+++ if_msk.c4 Jan 2022 23:43:32 -
@@ -135,7 +135,7 @@ int msk_intr(void *);
 void msk_intr_yukon(struct sk_if_softc *);
 static inline int msk_rxvalid(struct sk_softc *, u_int32_t, u_int32_t);
 void msk_rxeof(struct sk_if_softc *, struct mbuf_list *, uint16_t, uint32_t);
-void msk_txeof(struct sk_if_softc *);
+void msk_txeof(struct sk_if_softc *, unsigned int);
 static unsigned int msk_encap(struct sk_if_softc *, struct mbuf *, uint32_t);
 void msk_start(struct ifnet *);
 int msk_ioctl(struct ifnet *, u_long, caddr_t);
@@ -1557,11 +1557,6 @@ msk_watchdog(struct ifnet *ifp)
 {
struct sk_if_softc *sc_if = ifp->if_softc;
 
-   /*
-* Reclaim first as there is a possibility of losing Tx completion
-* interrupts.
-*/
-   msk_txeof(sc_if);
if (sc_if->sk_cdata.sk_tx_prod != sc_if->sk_cdata.sk_tx_cons) {
printf("%s: watchdog timeout\n", sc_if->sk_dev.dv_xname);
 
@@ -1639,26 +1634,19 @@ msk_rxeof(struct sk_if_softc *sc_if, str
 }
 
 void
-msk_txeof(struct sk_if_softc *sc_if)
+msk_txeof(struct sk_if_softc *sc_if, unsigned int prod)
 {
struct ifnet*ifp = _if->arpcom.ac_if;
struct sk_softc *sc = sc_if->sk_softc;
-   uint32_tprod, cons;
+   uint32_tcons;
struct mbuf *m;
bus_dmamap_tmap;
-   bus_size_t  reg;
-
-   if (sc_if->sk_port == SK_PORT_A)
-   reg = SK_STAT_BMU_TXA1_RIDX;
-   else
-   reg = SK_STAT_BMU_TXA2_RIDX;
 
/*
 * Go through our tx ring and free mbufs for those
 * frames that have been sent.
 */
cons = sc_if->sk_cdata.sk_tx_cons;
-   prod = sk_win_read_2(sc, reg);
 
if (cons == prod)
return;
@@ -1770,7 +1758,7 @@ msk_intr(void *xsc)
};
struct ifnet*ifp0 = NULL, *ifp1 = NULL;
int claimed = 0;
-   u_int32_t   status;
+   u_int32_t   status, sk_status;
struct msk_status_desc  *cur_st;
 
status = CSR_READ_4(sc, SK_Y2_ISSR2);
@@ -1812,10 +1800,19 @@ msk_intr(void *xsc)
lemtoh32(_st->sk_status));
break;
case SK_Y2_STOPC_TXSTAT:
+   sk_status = lemtoh32(_st->sk_status);
if (sc_if0)
-   msk_txeof(sc_if0);
-   if (sc_if1)
-   msk_txeof(sc_if1);
+   msk_txeof(sc_if0, sk_status & 0xfff);
+   if (sc_if1) {
+   /*
+* this would be easier as a 64bit
+* load of the whole status descriptor,
+* a shift, and a mask.
+*/
+   unsigned int cons = (sk_status >> 24) & 0xff;
+   cons |= (lemtoh16(_st->sk_len) & 0xf) << 8;
+   msk_txeof(sc_if1, cons);
+   }
break;
default:
printf("opcode=0x%x\n", cur_st->sk_opcode);
@@ -2065,8 +2062,13 @@ msk_init(void *xsc_if)
 
SK_IF_WRITE_2(sc_if, 0, SK_RXQ1_Y2_PREF_PUTIDX,
sc_if->sk_cdata.sk_rx_prod);
-   SK_IF_WRITE_2(sc_if, 1, SK_TXQA1_Y2_PREF_PUTIDX,
-   sc_if->sk_cdata.sk_tx_prod);
+
+   /*
+* tell the chip the tx ring is empty for now. the first
+* msk_start will end up posting the ADDR64 tx descriptor
+* that resets the high address.
+*/
+   SK_IF_WRITE_2(sc_if, 1, SK_TXQA1_Y2_PREF_PUTIDX, 0);
 
/* Configure interrupt handling */
if (sc_if->sk_port == SK_PORT_A)



tcpdump: basic parsing of EAPOL PDUs

2022-01-03 Thread David Gwynne
i was trying to understand some packets that tcpdump didnt know about,
and discovered they were EAPOL. turns out EAPOL is a little container
around a bunch of different types of messages including EAP, the
MACsec Key Agreement protocol, and it's own type of capabilities
advertisements.

this implements basic parsing of the EAPOL container so it can print the
type. any further parsing will need code for each type of message to be
added.

i didnt think it was worth adding a new file just for eapol, so i put it
in print-ether.c. we do have something in src/sys/net/ethertypes.h for
0x888e, but i want to rename it to ETHERTYPE_EAPOL anyway.

ok?

Index: print-ether.c
===
RCS file: /cvs/src/usr.sbin/tcpdump/print-ether.c,v
retrieving revision 1.39
diff -u -p -r1.39 print-ether.c
--- print-ether.c   1 Dec 2021 18:28:46 -   1.39
+++ print-ether.c   4 Jan 2022 02:56:21 -
@@ -36,6 +36,7 @@
 #include 
 
 #include 
+#include 
 #include 
 
 
@@ -49,6 +50,7 @@ const u_char *snapend;
 
 void ether_macctl(const u_char *, u_int);
 void ether_pbb_print(const u_char *, u_int, u_int);
+void ether_eapol_print(const u_char *, u_int, u_int);
 
 void
 ether_print(const u_char *bp, u_int length)
@@ -294,6 +296,13 @@ recurse:
nsh_print(p, length);
return (1);
 
+#ifndef ETHERTYPE_EAPOL
+#define ETHERTYPE_EAPOL 0x888e
+#endif
+   case ETHERTYPE_EAPOL:
+   ether_eapol_print(p, length, caplen);
+   return (1);
+
 #ifndef ETHERTYPE_PBB
 #define ETHERTYPE_PBB 0x88e7
 #endif
@@ -367,4 +376,87 @@ ether_macctl(const u_char *p, u_int leng
 
 trunc:
printf("[|MACCTL]");
+}
+
+/*
+ * 802.1X EAPOL PDU
+ */
+
+struct eapol_header {
+   uint8_t version;
+   uint8_t type;
+#define EAPOL_T_EAP0x00
+#define EAPOL_T_START  0x01
+#define EAPOL_T_LOGOFF 0x02
+#define EAPOL_T_KEY0x03
+#define EAPOL_T_ENCAP_ASF_ALERT0x04
+#define EAPOL_T_MKA0x05
+#define EAPOL_T_ANNOUNCEMENT_GENERIC   0x06
+#define EAPOL_T_ANNOUNCEMENT_SPECIFIC  0x07
+#define EAPOL_T_ANNOUNCEMENT_REQ   0x08
+   uint16_tlength;
+};
+
+void
+ether_eapol_print(const u_char *bp, u_int length, u_int caplen)
+{
+   struct eapol_header h;
+
+   printf("EAPOL");
+
+   if (caplen < sizeof(h))
+   goto trunc;
+
+   h.version = *(bp + offsetof(struct eapol_header, version));
+   h.type = *(bp + offsetof(struct eapol_header, type));
+   h.length = EXTRACT_16BITS(bp + offsetof(struct eapol_header, length));
+
+   bp += sizeof(h);
+   length -= sizeof(h);
+   caplen -= sizeof(h);
+
+   if (vflag)
+   printf(" (v%u, len %u)", h.version, h.length);
+
+   if (length > h.length)
+   length = h.length;
+   else if (length < h.length) {
+   printf(" truncated-eapol - %u bytes missing!",
+   h.length - length);
+   }
+
+   switch (h.type) {
+   case EAPOL_T_EAP:
+   printf(" EAP");
+   break;
+   case EAPOL_T_START:
+   printf(" Start");
+   break;
+   case EAPOL_T_LOGOFF:
+   printf(" Logoff");
+   break;
+   case EAPOL_T_KEY:
+   printf(" Key");
+   break;
+   case EAPOL_T_MKA:
+   printf(" MKA");
+   break;
+   case EAPOL_T_ANNOUNCEMENT_GENERIC:
+   printf(" Announcement (Generic)");
+   break;
+   case EAPOL_T_ANNOUNCEMENT_SPECIFIC:
+   printf(" Announcement (Specific)");
+   break;
+   case EAPOL_T_ANNOUNCEMENT_REQ:
+   printf(" Announcement Req");
+   break;
+   default:
+   printf(" unknown (%u)", h.type);
+   break;
+   }
+
+   return;
+
+trunc:
+   printf(" [|eapol] ");
 }



Re: explain priority codepoints and their mapping in vlan.4

2021-12-20 Thread David Gwynne
are you able to find a reference for this in any of the current specs? or in 
documentation from other vendors?

i've faithfully maintained this behaviour, but the only place i've seen in with 
my own eyes is in our code. if we're doing it different to literally everyone 
else, then maybe we shouldnt?

> On 21 Dec 2021, at 00:54, Christopher Zimmermann  wrote:
> 
> Hi,
> 
> it might be helpful for others to understand the meaning of codepoints 0 and 
> 1 of the vlan priority field. OK?
> 
> Christopher
> 
> Here is the formatted content:
> 
> The 802.1Q and 802.1ad protocols include a 3-bit priority code point
> (PCP):
>PCP 1 is defined as the lowest priority (“background”)
>PCP 0 is the default (“best effort”) - second lowest priority.
>PCPs 2-7 are increasing priority above best effort.
> The priority may be altered via pf.conf(5); see the prio option for more
> information.
>“prio 0” is mapped to PCP 1.
>“prio 1” is mapped to PCP 0.
>“prio 2-7” are mapped to PCP 2-7.
> Alternatively, the txprio property of an interface can set a specific
> priority for transmitted packets.
> 
> 
> And here's the diff:
> 
> Index: vlan.4
> ===
> RCS file: /cvs/src/share/man/man4/vlan.4,v
> retrieving revision 1.51
> diff -u -p -r1.51 vlan.4
> --- vlan.44 Oct 2020 12:44:49 -   1.51
> +++ vlan.420 Dec 2021 14:51:31 -
> @@ -83,16 +83,38 @@ interfaces by their respective protocol  identifiers, and 
> decapsulated for reception on the associated virtual
> interfaces.
> .Pp
> -The 802.1Q and 802.1ad protocols include a priority field.
> -By default, the 802.1p priority in a transmitted packet is based on the
> -priority of packets sent over the interface, which may
> -be altered via
> +The 802.1Q and 802.1ad protocols include a 3-bit priority code point
> +(PCP):
> +.Bl -item -compact -offset MMM
> +.It
> +PCP 1 is defined as the lowest priority
> +.Pq Dq background
> +.It
> +PCP 0 is the default
> +.Pq Dq best effort
> +- second lowest priority.
> +.It
> +PCPs 2-7 are increasing priority above best effort.
> +.El
> +The priority may be altered via
> .Xr pf.conf 5 ;
> see the
> .Cm prio
> option for more information.
> -Alternatively,
> +.Bl -item -compact -offset MMM
> +.It
> +.Dq prio 0
> +is mapped to PCP 1.
> +.It
> +.Dq prio 1
> +is mapped to PCP 0.
> +.It
> +.Dq prio 2-7
> +are mapped to PCP 2-7.
> +.El
> +Alternatively, the
> .Cm txprio
> +property of an interface
> can set a specific priority for transmitted packets.
> .Pp
> .Nm vlan
> 



Re: vport: set UP on ip assign

2021-11-15 Thread David Gwynne
On Mon, Nov 15, 2021 at 02:31:42PM +, Klemens Nanni wrote:
> On Mon, Nov 15, 2021 at 01:37:49PM +, Stuart Henderson wrote:
> > On 2021/11/15 12:27, Klemens Nanni wrote:
> > > On Sun, Nov 14, 2021 at 07:04:42PM -0700, Theo de Raadt wrote:
> > > > I think physical interfaces should come up when something is configured
> > > > on them, but virtual interfaces shouldn't -- mostly because the order of
> > > > configuration is often muddled.
> > > 
> > > So "inet6 2001:db8::1" in hostname.em0 will do the trick but
> > > hostname.vport0 would need another "up" for the same behaviour:  that's
> > > rather confusing me as a user.
> > 
> > hostname.* files are orthogonal to this; netstart can process all the lines,
> > then if it has seen a line doing address configuration and has not seen an
> > explicit "down", it can bring the interface up automatically at the end.
> > (if this changed, it would be a nightmare for users to do anything else).
> 
> Yes, netstart can and should deal with this correctly, just like you
> describe.
> 
> > Users would need to make sure they have a netstart which does that if
> > updating a kernel, but that's just a case of matching kernel+userland and is
> > nothing new for OpenBSD.
> > 
> > The different behaviour would be apparent with separate runs of ifconfig.
> > some scripts may need adapting and users might need to run "ifconfig XX up"
> > themselves but I don't think that would be a problem.
> 
> Agreed.
> 
> Having the implicit-up logic entirely contained in netstart would make
> lifer much easier, both for network stack hackers and users, imho.

this was my attempt at just that.

the installer has its own netstart though, right?

Index: etc/netstart
===
RCS file: /cvs/src/etc/netstart,v
retrieving revision 1.216
diff -u -p -r1.216 netstart
--- etc/netstart2 Sep 2021 19:38:20 -   1.216
+++ etc/netstart15 Nov 2021 23:20:00 -
@@ -71,6 +71,9 @@ parse_hn_line() {
dhcp)   _cmds[${#_cmds[*]}]="ifconfig $_if inet autoconf"
V4_AUTOCONF=true
;;
+   down)   _c[0]=
+   _ifup=down
+   ;;
'!'*)   _cmd=$(print -- "${_c[@]}" | sed 's/\$if/'$_if'/g')
_cmds[${#_cmds[*]}]="${_cmd#!}"
;;
@@ -118,6 +121,7 @@ vifscreate() {
 ifstart() {
local _if=$1 _hn=/etc/hostname.$1 _cmds _i=0 _line _stat
set -A _cmds
+   _ifup=up
 
# Interface names must be alphanumeric only.  We check to avoid
# configuring backup or temp files, and to catch the "*" case.
@@ -145,6 +149,8 @@ ifstart() {
while IFS= read -- _line; do
parse_hn_line $_line
done <$_hn
+
+   _cmds[${#_cmds[*]}]="ifconfig $_if $_ifup"
 
# Apply the interface configuration commands stored in _cmds array.
while ((_i < ${#_cmds[*]})); do
Index: share/man/man5/hostname.if.5
===
RCS file: /cvs/src/share/man/man5/hostname.if.5,v
retrieving revision 1.77
diff -u -p -r1.77 hostname.if.5
--- share/man/man5/hostname.if.517 Jul 2021 15:28:31 -  1.77
+++ share/man/man5/hostname.if.515 Nov 2021 23:20:01 -
@@ -57,6 +57,9 @@ the administrator should not expect magi
 and the
 per-driver manual pages to see what arguments are permitted.
 .Pp
+Interfaces are implicitly configured to be brought up and running.
+This behaviour can be disabled by adding a line containing down to the file.
+.Pp
 Arguments containing either whitespace or single quote
 characters must be double quoted.
 For example:



Re: let dhcpd start on down interfaces

2021-11-15 Thread David Gwynne
On Mon, Nov 15, 2021 at 12:19:48PM +, Klemens Nanni wrote:
> On Mon, Nov 15, 2021 at 02:08:33PM +1000, David Gwynne wrote:
> > The subject line only tells half the story. The other half is that
> > instead of hoping it only identifies Ethernet interfaces it now actually
> > checks and restricts itself to Ethernet interfaces. I couldn't help
> > myself.
> > 
> > If it helps my case for this code, I've been using this semantic for a
> > few years in a home grown dhcp-relay we run on our firewalls at
> > work. We used to run the dhcp-relay on a bunch of vlan(4) interfaces,
> > but we currently run them on carp(4). There's no significant difference
> > between using an Ethernet or carp interface from a dhcpd or relay point
> > of view, so the code accepts either.
> 
> This sounds reasonable, but I have no means to test these setups.
> 
> > The kernel lets you bind to addresses on down interfaces, so I don't
> > know why dhcpd has to be precious about it. We often start the
> > firewalls up with whole groups of these interfaces forced down.
> > We want DHCP packets to start moving when then interface comes up
> > though, therefore the daemon should start and be running even if
> > the interface is down.
> 
> I wondered about this as well but was hesitant to touch dhcpd(8) because
> I rarely use it.
> 
> FWIW, ISC DHCP has the same code and comment wrt. interface check,
> altough I didn't do any runtime tests.

our dhcpd came from isc dhcpd, and the guts havent been touched
much.

> > Should I warn if the interface is down on startup?
> 
> Wouldn't hurt, although you don't see it at boot time anyway, so not
> much help, imho.
> 
> > There's some other questionable code in here, but I'm going to take
> > some deep breaths and try to ignore them for now.
> 
> Here is the minimal diff I first used to fix my vport/DOWN issue,
> before going the implicit UP route.

this is a good start. im ok with you putting your change in, just get
someone else to ok it too.

> Regardless of vport's behaviour, this seemed sensible;  I mean, I can
> `nc -l' on down interfaces, other daemons don't care about UP and I'd
> rather have dhcpd running and pull interfaces UP to fix DHCP instead of
> restarting failed dhcpd (which could still serve on other UP interfaces).

exactly.

> 
> Index: dispatch.c
> ===
> RCS file: /cvs/src/usr.sbin/dhcpd/dispatch.c,v
> retrieving revision 1.43
> diff -u -p -r1.43 dispatch.c
> --- dispatch.c12 Apr 2017 19:17:30 -  1.43
> +++ dispatch.c13 Nov 2021 23:04:36 -
> @@ -112,13 +112,12 @@ discover_interfaces(int *rdomain)
>   for (ifa = ifap; ifa != NULL; ifa = ifa->ifa_next) {
>   /*
>* See if this is the sort of interface we want to
> -  * deal with.  Skip loopback, point-to-point and down
> +  * deal with.  Skip loopback and point-to-point
>* interfaces, except don't skip down interfaces if we're
>* trying to get a list of configurable interfaces.
>*/
>   if ((ifa->ifa_flags & IFF_LOOPBACK) ||
>   (ifa->ifa_flags & IFF_POINTOPOINT) ||
> - (!(ifa->ifa_flags & IFF_UP)) ||
>   (!(ifa->ifa_flags & IFF_BROADCAST)))
>   continue;
>  
> 
> > OK?
> > 
> > Index: dispatch.c
> > ===
> > RCS file: /cvs/src/usr.sbin/dhcpd/dispatch.c,v
> > retrieving revision 1.43
> > diff -u -p -r1.43 dispatch.c
> > --- dispatch.c  12 Apr 2017 19:17:30 -  1.43
> > +++ dispatch.c  15 Nov 2021 03:50:46 -
> > @@ -47,6 +47,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  
> > @@ -112,14 +113,10 @@ discover_interfaces(int *rdomain)
> > for (ifa = ifap; ifa != NULL; ifa = ifa->ifa_next) {
> > /*
> >  * See if this is the sort of interface we want to
> > -* deal with.  Skip loopback, point-to-point and down
> > -* interfaces, except don't skip down interfaces if we're
> > -* trying to get a list of configurable interfaces.
> > +* deal with, which is basically Ethernet.
> >  */
> > if ((ifa->ifa_flags & IFF_LOOPBACK) ||
> > -   (ifa->ifa_flags & IFF_POINTOPOINT) ||
> > -   (!(ifa->ifa_flags & IFF_UP)) ||
> > -   (!(ifa->

let dhcpd start on down interfaces

2021-11-14 Thread David Gwynne
The subject line only tells half the story. The other half is that
instead of hoping it only identifies Ethernet interfaces it now actually
checks and restricts itself to Ethernet interfaces. I couldn't help
myself.

If it helps my case for this code, I've been using this semantic for a
few years in a home grown dhcp-relay we run on our firewalls at
work. We used to run the dhcp-relay on a bunch of vlan(4) interfaces,
but we currently run them on carp(4). There's no significant difference
between using an Ethernet or carp interface from a dhcpd or relay point
of view, so the code accepts either.

The kernel lets you bind to addresses on down interfaces, so I don't
know why dhcpd has to be precious about it. We often start the
firewalls up with whole groups of these interfaces forced down.
We want DHCP packets to start moving when then interface comes up
though, therefore the daemon should start and be running even if
the interface is down.

Should I warn if the interface is down on startup?

There's some other questionable code in here, but I'm going to take
some deep breaths and try to ignore them for now.

OK?

Index: dispatch.c
===
RCS file: /cvs/src/usr.sbin/dhcpd/dispatch.c,v
retrieving revision 1.43
diff -u -p -r1.43 dispatch.c
--- dispatch.c  12 Apr 2017 19:17:30 -  1.43
+++ dispatch.c  15 Nov 2021 03:50:46 -
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -112,14 +113,10 @@ discover_interfaces(int *rdomain)
for (ifa = ifap; ifa != NULL; ifa = ifa->ifa_next) {
/*
 * See if this is the sort of interface we want to
-* deal with.  Skip loopback, point-to-point and down
-* interfaces, except don't skip down interfaces if we're
-* trying to get a list of configurable interfaces.
+* deal with, which is basically Ethernet.
 */
if ((ifa->ifa_flags & IFF_LOOPBACK) ||
-   (ifa->ifa_flags & IFF_POINTOPOINT) ||
-   (!(ifa->ifa_flags & IFF_UP)) ||
-   (!(ifa->ifa_flags & IFF_BROADCAST)))
+   (ifa->ifa_flags & IFF_POINTOPOINT))
continue;
 
/* See if we've seen an interface that matches this one. */
@@ -156,11 +153,27 @@ discover_interfaces(int *rdomain)
/* If we have the capability, extract link information
   and record it in a linked list. */
if (ifa->ifa_addr->sa_family == AF_LINK) {
-   struct sockaddr_dl *sdl =
-   ((struct sockaddr_dl *)(ifa->ifa_addr));
+   struct if_data *ifi;
+   struct sockaddr_dl *sdl;
+
+   ifi = (struct if_data *)ifa->ifa_data;
+   if (ifi->ifi_type != IFT_ETHER &&
+   ifi->ifi_type != IFT_CARP) {
+   /* this interface will be cleaned up later */
+   continue;
+   }
+
+   sdl = ((struct sockaddr_dl *)(ifa->ifa_addr));
+   if (sdl->sdl_alen > sizeof(tmp->hw_address.haddr)) {
+   log_warnx("interface %s long hardware address",
+   tmp->name);
+   /* this interface will be cleaned up later */
+   continue;
+   }
+
tmp->index = sdl->sdl_index;
tmp->hw_address.hlen = sdl->sdl_alen;
-   tmp->hw_address.htype = HTYPE_ETHER; /* XXX */
+   tmp->hw_address.htype = HTYPE_ETHER;
memcpy(tmp->hw_address.haddr,
LLADDR(sdl), sdl->sdl_alen);
} else if (ifa->ifa_addr->sa_family == AF_INET) {
@@ -236,6 +249,17 @@ discover_interfaces(int *rdomain)
last = NULL;
for (tmp = interfaces; tmp; tmp = next) {
next = tmp->next;
+
+   if (tmp->index == 0) {
+   log_warnx("Can't listen on %s - wrong interface type",
+   tmp->name);
+   /* Remove tmp from the list of interfaces. */
+   if (!last)
+   interfaces = interfaces->next;
+   else
+   last->next = tmp->next;
+   continue;
+   }
 
if (!tmp->ifp) {
log_warnx("Can't listen on %s - it has no IP address.",



Re: vport: set UP on ip assign

2021-11-14 Thread David Gwynne
On Sat, Nov 13, 2021 at 11:59:59PM +, Klemens Nanni wrote:
> Practically all interfaces pull itself up when IPs get assigned, but
> vport(4) does not.

Yes, I do (or don't do) this very deliberately when I get the chance.

> This broke IPv4 networking for me on a router I switched from bridge(4)
> to veb(4) because hostname.vport0 only contained the equivalent of
> 
>   descr LAN
>   inet 192.0.2.1
>   inet6 2001:db8::1
> 
> e.g. the explicit "up" was missing.
> 
> dhcpd(8) only considers UP interfaces to listen on during start;
> being the only interface it could potentially listen on, dhcpd thus
> ignored vport0 and failed to start.
> 
> `ifconfig vport0 up && rcctl restart dhcpd' fixed this.
> Adding "up" to thostname.vport0 also fixed it.
> 
> Nonetheless, vport should UP itself as the rest does.

My counter argument is that no interface should implicitly bring itself
up when an address is configured because it has been a road block to
figuring out how to lock the ioctl paths better, and it confuses error
handling. If address config works, but the interface fails to come up,
the address remains configured but we report an error. But it looks like
configuring an address failed? Wat.

I've suggested previously that netstart should handle bringing an
interface up. look for "netstart: implicit up and explicit down for
hostname.if conf files" on tech@. I didn't hanve the energy to push
it forward though.

dhcpd should cope with an interface being down too. It should be about
whetherthe addresses are right more than if the interface is up or not.

> 
> OK?
> 
> Index: net/if_veb.c
> ===
> RCS file: /cvs/src/sys/net/if_veb.c,v
> retrieving revision 1.21
> diff -u -p -r1.21 if_veb.c
> --- net/if_veb.c  8 Nov 2021 04:15:46 -   1.21
> +++ net/if_veb.c  13 Nov 2021 23:47:58 -
> @@ -2122,6 +2122,10 @@ vport_ioctl(struct ifnet *ifp, u_long cm
>   return (ENXIO);
>  
>   switch (cmd) {
> + case SIOCSIFADDR:
> + ifp->if_flags |= IFF_UP;
> + /* FALLTHROUGH */
> +
>   case SIOCSIFFLAGS:
>   if (ISSET(ifp->if_flags, IFF_UP)) {
>   if (!ISSET(ifp->if_flags, IFF_RUNNING))
> 



Re: demystify vport(4) in vport(4) and ifconfig(8)

2021-10-28 Thread David Gwynne
On Thu, Oct 28, 2021 at 03:43:11PM +0100, Jason McIntyre wrote:
> On Thu, Oct 28, 2021 at 04:53:39PM +1000, David Gwynne wrote:
> > 
> > 
> > > On 28 Oct 2021, at 15:35, Jason McIntyre  wrote:
> > > 
> > > On Thu, Oct 28, 2021 at 01:27:17PM +1000, David Gwynne wrote:
> > >> 
> > >>> that strategy does rely on individual driver docs saying upfront that
> > >>> they can be created though, even if using create is not common. i 
> > >>> wonder if
> > >>> ifconfig already knows what it can create, and could maybe be more
> > >>> helpful if "create" without an ifname gave a hint.
> > >> 
> > >> dlg@rpi3b trees$ ifconfig -C
> > >> aggr bpe bridge carp egre enc eoip etherip gif gre lo mgre mpe mpip mpw 
> > >> nvgre pair pflog pflow pfsync ppp pppoe svlan switch tap tpmr trunk tun 
> > >> veb vether vlan vport vxlan wg
> > >> 
> > >> the "interface can be created paragraph" is in most of the manpages for
> > >> these drivers, except for pair, pfsync, pppoe, and vether (and
> > >> veb/vport). some of them could be improved, eg, bpe and switch.
> > >> 
> > > 
> > > oops, missed that flag!
> > 
> > maybe the doco for "create" in ifconfig.8 should refer back to it too.
> > 
> 
> good idea - this fits in nicely with stuart's proposal to not list every
> device.
> 
> > 
> > > i had thought maybe if there was such an option, we wouldn;t need the
> > > "if can be created" blurb in every page. but i suppose we do need to say
> > > it somewhere.
> > 
> > the driver manpages are in a weird place because they're supposed to be for 
> > programmers, and the ifconfig manpage is for "operators". however, the 
> > driver pages have the "theory of operation" for their interfaces. so you 
> > have the high level theory in the driver manpage, the way 99% of use 
> > actually interact with interfaces in ifconfig.8, and then you go back to 
> > the driver doco for the low level programming detail. it's not the best 
> > sandwich.
> > 
> 
> yep.
> 
> > > another issue is that the text is inconsistent across pages.
> > 
> > yeah, but that can be fixed easily.
> > 
> 
> hopefully!
> 
> anyway, here' my proposal following your and sthen's advice.
> ok?
> 
> jmc
> 
> Index: ifconfig.8
> ===
> RCS file: /cvs/src/sbin/ifconfig/ifconfig.8,v
> retrieving revision 1.377
> diff -u -p -r1.377 ifconfig.8
> --- ifconfig.827 Oct 2021 06:36:51 -  1.377
> +++ ifconfig.828 Oct 2021 14:41:06 -
> @@ -177,42 +177,9 @@ network.
>  The default broadcast address is the address with a host part of all 1's.
>  .It Cm create
>  Create the specified network pseudo-device.
> -At least the following devices can be created on demand:
> -.Pp
> -.Xr aggr 4 ,
> -.Xr bpe 4 ,
> -.Xr bridge 4 ,
> -.Xr carp 4 ,
> -.Xr egre 4 ,
> -.Xr enc 4 ,
> -.Xr eoip 4 ,
> -.Xr etherip 4 ,
> -.Xr gif 4 ,
> -.Xr gre 4 ,
> -.Xr lo 4 ,
> -.Xr mgre 4 ,
> -.Xr mpe 4 ,
> -.Xr mpip 4 ,
> -.Xr mpw 4 ,
> -.Xr nvgre 4 ,
> -.Xr pair 4 ,
> -.Xr pflog 4 ,
> -.Xr pflow 4 ,
> -.Xr pfsync 4 ,
> -.Xr ppp 4 ,
> -.Xr pppoe 4 ,
> -.Xr svlan 4 ,
> -.Xr switch 4 ,
> -.Xr tap 4 ,
> -.Xr tpmr 4 ,
> -.Xr trunk 4 ,
> -.Xr tun 4 ,
> -.Xr veb 4 ,
> -.Xr vether 4 ,
> -.Xr vlan 4 ,
> -.Xr vport 4 ,
> -.Xr vxlan 4 ,
> -.Xr wg 4
> +A list of devices which can be dynamically created may be shown with the
> +.Fl C
> +option.
>  .It Cm debug
>  Enable driver-dependent debugging code; usually, this turns on
>  extra console error logging.

if solene@ doesnt think it hurts then i'm ok with it.

dlg



Re: demystify vport(4) in vport(4) and ifconfig(8)

2021-10-28 Thread David Gwynne
On Thu, Oct 28, 2021 at 04:06:39PM +0100, Stuart Henderson wrote:
> On 2021/10/28 13:11, David Gwynne wrote:
> > On Wed, Oct 27, 2021 at 10:12:35AM +0100, Stuart Henderson wrote:
> > > On 2021/10/27 17:44, David Gwynne wrote:
> > > > 
> > > > > benno@ suggested I look at vether(4) to adapt the text related to
> > > > > bridge(4) but I'm not sure how to rewrite it properly for veb(4).
> > > > 
> > > > i get that, but for a different reason. im too close to veb/vport, so i
> > > > think it's all very obvious.
> > > > 
> > > > maybe we could split the first paragraph into separate ones for veb
> > > > and vport, and flesh them out a bit. what is it about vport that
> > > > needs to be said?
> > > 
> > > I'm not so close to veb/vport (and haven't run into a situation to use
> > > it yet, though maybe I'll give it a try converting an etherip/ipsec
> > > bridge that I have). I think it's pretty obvious too, though I think
> > > people aren't grasping what "the network stack can be connected" means,
> > > would the diff below help? it feels a bit more like spelling things out
> > > than is usual for a manual page but maybe that's needed here.
> > 
> > I think it is needed here. My only issue is that the layer 3 stack is
> > more than just IP, it also includes mpls and pppoe and bpe(4). Listing
> > all that is heading into "list all the things" territory again though.
> 
> I'll commit it with this tweaked a bit
> 
> +They are then independent of the host network stack: the individual
> +Ethernet ports no longer function as independent devices and cannot
> +be configured with
> +.Xr inet 4
> +or
> +.Xr inet6 4
> +addresses or other layer-3 features themselves.
> 
> happy to tweak further, but I think it's an improvement already

that's better than all the attempts i came up with myself. ok by me.

> 
> > > for ifconfig I would be in favour of _not_ listing all the possible
> > > cloneable interface types that can be used with create, it's just another
> > > thing to get out of sync - maybe just a few of the common ones and tell
> > > the reader about ifconfig -C at that point.. "ifconfig create" barely
> > > seems necessary except possibly for tun/tap - for most interface types
> > > you are going to run another ifconfig command (like "ifconfig veb0 add
> > > em0") which creates the interface automatically anyway.
> > 
> > Having clonable interfaces be implicitly created is something that
> > annoys me. If I ifconfig bridge0 add gre0, it should actually fail
> > without side effects such as bringing an unwanted child^Winterface
> > into the world. This and other implicit behaviours like bringing
> > an interface up when an address on it is configured are also painful
> > from a locking point of view, even after trying to figure out what's
> > reasonable to clean up when a later step fails.
> > 
> > I seem to lose this argument every time though.
> 
> The "auto up when configuring an address" is very convenient but
> it can result in actual problems too (pppoe needs to know
> which protocols to negotiate so it's racy) as well as making
> locking painful

mmm.



Re: demystify vport(4) in vport(4) and ifconfig(8)

2021-10-28 Thread David Gwynne



> On 28 Oct 2021, at 15:35, Jason McIntyre  wrote:
> 
> On Thu, Oct 28, 2021 at 01:27:17PM +1000, David Gwynne wrote:
>> 
>>> that strategy does rely on individual driver docs saying upfront that
>>> they can be created though, even if using create is not common. i wonder if
>>> ifconfig already knows what it can create, and could maybe be more
>>> helpful if "create" without an ifname gave a hint.
>> 
>> dlg@rpi3b trees$ ifconfig -C
>> aggr bpe bridge carp egre enc eoip etherip gif gre lo mgre mpe mpip mpw 
>> nvgre pair pflog pflow pfsync ppp pppoe svlan switch tap tpmr trunk tun veb 
>> vether vlan vport vxlan wg
>> 
>> the "interface can be created paragraph" is in most of the manpages for
>> these drivers, except for pair, pfsync, pppoe, and vether (and
>> veb/vport). some of them could be improved, eg, bpe and switch.
>> 
> 
> oops, missed that flag!

maybe the doco for "create" in ifconfig.8 should refer back to it too.


> i had thought maybe if there was such an option, we wouldn;t need the
> "if can be created" blurb in every page. but i suppose we do need to say
> it somewhere.

the driver manpages are in a weird place because they're supposed to be for 
programmers, and the ifconfig manpage is for "operators". however, the driver 
pages have the "theory of operation" for their interfaces. so you have the high 
level theory in the driver manpage, the way 99% of use actually interact with 
interfaces in ifconfig.8, and then you go back to the driver doco for the low 
level programming detail. it's not the best sandwich.

> another issue is that the text is inconsistent across pages.

yeah, but that can be fixed easily.

dlg


Re: demystify vport(4) in vport(4) and ifconfig(8)

2021-10-27 Thread David Gwynne
On Wed, Oct 27, 2021 at 02:32:31PM +0100, Jason McIntyre wrote:
> On Wed, Oct 27, 2021 at 10:12:35AM +0100, Stuart Henderson wrote:
> > On 2021/10/27 17:44, David Gwynne wrote:
> > > 
> > > > benno@ suggested I look at vether(4) to adapt the text related to
> > > > bridge(4) but I'm not sure how to rewrite it properly for veb(4).
> > > 
> > > i get that, but for a different reason. im too close to veb/vport, so i
> > > think it's all very obvious.
> > > 
> > > maybe we could split the first paragraph into separate ones for veb
> > > and vport, and flesh them out a bit. what is it about vport that
> > > needs to be said?
> > 
> > I'm not so close to veb/vport (and haven't run into a situation to use
> > it yet, though maybe I'll give it a try converting an etherip/ipsec
> > bridge that I have). I think it's pretty obvious too, though I think
> > people aren't grasping what "the network stack can be connected" means,
> > would the diff below help? it feels a bit more like spelling things out
> > than is usual for a manual page but maybe that's needed here.
> > 
> > for ifconfig I would be in favour of _not_ listing all the possible
> > cloneable interface types that can be used with create, it's just another
> > thing to get out of sync - maybe just a few of the common ones and tell
> > the reader about ifconfig -C at that point.. "ifconfig create" barely
> > seems necessary except possibly for tun/tap - for most interface types
> > you are going to run another ifconfig command (like "ifconfig veb0 add
> > em0") which creates the interface automatically anyway.
> > 
> 
> hi.
> 
> i agree with staurt about "create": this list was once short and made
> sense. now it just keeps going out of date, without providing any help
> to the reader. i don;t want to stack diff on diff, but maybe once the
> veb stuff is sorted i will zap the create list.

makes sense to me.

> that strategy does rely on individual driver docs saying upfront that
> they can be created though, even if using create is not common. i wonder if
> ifconfig already knows what it can create, and could maybe be more
> helpful if "create" without an ifname gave a hint.

dlg@rpi3b trees$ ifconfig -C
aggr bpe bridge carp egre enc eoip etherip gif gre lo mgre mpe mpip mpw nvgre 
pair pflog pflow pfsync ppp pppoe svlan switch tap tpmr trunk tun veb vether 
vlan vport vxlan wg

the "interface can be created paragraph" is in most of the manpages for
these drivers, except for pair, pfsync, pppoe, and vether (and
veb/vport). some of them could be improved, eg, bpe and switch.

> anyway, to that end i'm ok with solene's diff.
> 
> i'm also ok with your diff, with one tweak:
> 
> > Index: veb.4
> > ===
> > RCS file: /cvs/src/share/man/man4/veb.4,v
> > retrieving revision 1.2
> > diff -u -p -r1.2 veb.4
> > --- veb.4   23 Feb 2021 11:43:41 -  1.2
> > +++ veb.4   27 Oct 2021 09:03:49 -
> > @@ -28,20 +28,30 @@ The
> >  .Nm veb
> >  pseudo-device supports the creation of a single layer 2 Ethernet
> >  network between multiple ports.
> > -Ethernet interfaces are added to the bridge to be used as ports.
> > +Ethernet interfaces are added to the
> >  .Nm veb
> > -takes over the operation of the interfaces that are added as ports
> > -and uses them independently of the host network stack.
> > -The network stack can be connected to the Ethernet network managed
> > -by
> > +bridge to be used as ports.
> > +Unlike
> > +.Xr bridge 4 ,
> >  .Nm veb
> > -by creating a
> > +takes over the operation of the interfaces that are added as ports.
> > +They are then independent of the host network stack; the individual
> 
> i think a colon would work better than a semi-colon here, since i think
> the info after it is essentially an explainer of what independent means.
> 
> jmc
> 
> > +Ethernet ports no longer function as independent layer 3 devices
> > +and cannot be configured with
> > +.Xr inet 4
> > +or
> > +.Xr inet6 4
> > +addresses themselves.
> > +.Pp
> > +The Ethernet network managed by
> > +.Nm veb
> > +can be connected to the network stack as a whole by creating a
> >  .Nm vport
> >  interface and attaching it as a port to the bridge.
> >  From the perspective of the host network stack, a
> >  .Nm vport
> >  interface acts as a normal interface connected to an Ethernet
> > -network.
> > +network and can be configured with addresses.
> >  .Pp
> >  .Nm veb
> >  is a learning bridge that maintains a table of Ethernet addresses
> > 



Re: demystify vport(4) in vport(4) and ifconfig(8)

2021-10-27 Thread David Gwynne
On Wed, Oct 27, 2021 at 10:12:35AM +0100, Stuart Henderson wrote:
> On 2021/10/27 17:44, David Gwynne wrote:
> > 
> > > benno@ suggested I look at vether(4) to adapt the text related to
> > > bridge(4) but I'm not sure how to rewrite it properly for veb(4).
> > 
> > i get that, but for a different reason. im too close to veb/vport, so i
> > think it's all very obvious.
> > 
> > maybe we could split the first paragraph into separate ones for veb
> > and vport, and flesh them out a bit. what is it about vport that
> > needs to be said?
> 
> I'm not so close to veb/vport (and haven't run into a situation to use
> it yet, though maybe I'll give it a try converting an etherip/ipsec
> bridge that I have). I think it's pretty obvious too, though I think
> people aren't grasping what "the network stack can be connected" means,
> would the diff below help? it feels a bit more like spelling things out
> than is usual for a manual page but maybe that's needed here.

I think it is needed here. My only issue is that the layer 3 stack is
more than just IP, it also includes mpls and pppoe and bpe(4). Listing
all that is heading into "list all the things" territory again though.

> for ifconfig I would be in favour of _not_ listing all the possible
> cloneable interface types that can be used with create, it's just another
> thing to get out of sync - maybe just a few of the common ones and tell
> the reader about ifconfig -C at that point.. "ifconfig create" barely
> seems necessary except possibly for tun/tap - for most interface types
> you are going to run another ifconfig command (like "ifconfig veb0 add
> em0") which creates the interface automatically anyway.

Having clonable interfaces be implicitly created is something that
annoys me. If I ifconfig bridge0 add gre0, it should actually fail
without side effects such as bringing an unwanted child^Winterface
into the world. This and other implicit behaviours like bringing
an interface up when an address on it is configured are also painful
from a locking point of view, even after trying to figure out what's
reasonable to clean up when a later step fails.

I seem to lose this argument every time though.

> 
> Index: veb.4
> ===
> RCS file: /cvs/src/share/man/man4/veb.4,v
> retrieving revision 1.2
> diff -u -p -r1.2 veb.4
> --- veb.4 23 Feb 2021 11:43:41 -  1.2
> +++ veb.4 27 Oct 2021 09:03:49 -
> @@ -28,20 +28,30 @@ The
>  .Nm veb
>  pseudo-device supports the creation of a single layer 2 Ethernet
>  network between multiple ports.
> -Ethernet interfaces are added to the bridge to be used as ports.
> +Ethernet interfaces are added to the
>  .Nm veb
> -takes over the operation of the interfaces that are added as ports
> -and uses them independently of the host network stack.
> -The network stack can be connected to the Ethernet network managed
> -by
> +bridge to be used as ports.
> +Unlike
> +.Xr bridge 4 ,
>  .Nm veb
> -by creating a
> +takes over the operation of the interfaces that are added as ports.
> +They are then independent of the host network stack; the individual
> +Ethernet ports no longer function as independent layer 3 devices
> +and cannot be configured with
> +.Xr inet 4
> +or
> +.Xr inet6 4
> +addresses themselves.
> +.Pp
> +The Ethernet network managed by
> +.Nm veb
> +can be connected to the network stack as a whole by creating a
>  .Nm vport
>  interface and attaching it as a port to the bridge.
>  From the perspective of the host network stack, a
>  .Nm vport
>  interface acts as a normal interface connected to an Ethernet
> -network.
> +network and can be configured with addresses.
>  .Pp
>  .Nm veb
>  is a learning bridge that maintains a table of Ethernet addresses
> 



Re: demystify vport(4) in vport(4) and ifconfig(8)

2021-10-27 Thread David Gwynne
On Wed, Oct 27, 2021 at 08:34:48AM +0200, Solene Rapenne wrote:
> On Wed, 27 Oct 2021 07:28:32 +1000
> David Gwynne :
> 
> > On Tue, Oct 26, 2021 at 09:18:30PM +0200, Solene Rapenne wrote:
> > > I tried to figure out how to use veb interfaces but the man page
> > > wasn't obvious in regards to the "vport" thing. It turns out it's
> > > a kind of interface that can be created with ifconfig.
> > > 
> > > I think we should make this clearer.  
> > 
> > agreed. the man page for veb/vport is definitely not... rigorous.
> > 
> > > Because ifconfig(8) mentions many type of interfaces I've searched
> > > for "vport" without success while "most" types are referenced in
> > > the man page. Like I added veb(4) recently, the diff adds vport(4)
> > > and missing mpip(4) so a search would give a clue it's related to
> > > ifconfig.  
> > 
> > I'm ok with the ifconfig chunk.
> > 
> > > in veb(4), I think we should add vport in the synposis because the
> > > man page is shared for veb and vport interfaces but at first look
> > > it seems only veb is a type of interface.  
> > 
> > The synopsis shows what you put into a kernel config file (eg 
> > src/sys/conf/GENERIC) to enable the driver, but "pseudo-device
> > vport" is not valid kernel config. You enable the veb driver and that
> > one driver provides both veb and vport interfaces. Another example of
> > this is the gre driver which provides gre, egre, mgre, nvgre, and eoip
> > interfaces.
> > 
> > > And finally, I added a mention that vport can be created with
> > > ifconfig(8) so it's really obvious. Maybe it's too much and can be
> > > removed.  
> > 
> > It should definitely be said. The other man pages for clonable
> > interfaces generally have a paragraph like this:
> > 
> > .Nm gre ,
> > .Nm mgre ,
> > .Nm egre ,
> > and
> > .Nm nvgre
> > interfaces can be created at runtime using the
> > .Ic ifconfig iface Ns Ar N Ic create
> > command or by setting up a
> > .Xr hostname.if 5
> > configuration file for
> > .Xr netstart 8 .
> > 
> > I just noticed vether.4 is also missing a paragraph like that too :(
> > 
> > > comments? ok?  
> > 
> > Apart from it not being obvious where vport interfaces come from, is
> > there anything else not obvious about veb?
> > 
> 
> veb is fine to me, here is a diff that adds the ifconfig paragraph
> to veb(4) and vether(4), I removed my first change from veb.

ok by me.

> benno@ suggested I look at vether(4) to adapt the text related to
> bridge(4) but I'm not sure how to rewrite it properly for veb(4).

i get that, but for a different reason. im too close to veb/vport, so i
think it's all very obvious.

maybe we could split the first paragraph into separate ones for veb
and vport, and flesh them out a bit. what is it about vport that
needs to be said?

> Index: share/man/man4//veb.4
> ===
> RCS file: /home/reposync/src/share/man/man4/veb.4,v
> retrieving revision 1.2
> diff -u -p -r1.2 veb.4
> --- share/man/man4//veb.4 23 Feb 2021 11:43:41 -  1.2
> +++ share/man/man4//veb.4 27 Oct 2021 06:28:45 -
> @@ -43,6 +43,17 @@ From the perspective of the host network
>  interface acts as a normal interface connected to an Ethernet
>  network.
>  .Pp
> +A
> +.Nm veb
> +or
> +.Nm vport
> +interface can be created at runtime using the
> +.Ic ifconfig iface Ns Ar N Ic create
> +command or by setting up a
> +.Xr hostname.if 5
> +configuration file for
> +.Xr netstart 8 .
> +.Pp
>  .Nm veb
>  is a learning bridge that maintains a table of Ethernet addresses
>  and the port that each address is reachable with.
> Index: share/man/man4//vether.4
> ===
> RCS file: /home/reposync/src/share/man/man4/vether.4,v
> retrieving revision 1.5
> diff -u -p -r1.5 vether.4
> --- share/man/man4//vether.4  17 Oct 2017 22:47:58 -  1.5
> +++ share/man/man4//vether.4  27 Oct 2021 06:29:54 -
> @@ -30,6 +30,15 @@ standard network frames with an Ethernet
>  for use as a member in a
>  .Xr bridge 4 .
>  .Pp
> +A
> +.Nm
> +interface can be created at runtime using the
> +.Ic ifconfig vether Ns Ar N Ic create
> +command or by setting up a
> +.Xr hostname.if 5
> +configuration file for
> +.Xr netstart 8 .
> +.Pp
>  To use
>  .Nm
>  the administrator needs to configure an address onto the interface



Re: demystify vport(4) in vport(4) and ifconfigt(8)

2021-10-26 Thread David Gwynne
On Tue, Oct 26, 2021 at 09:18:30PM +0200, Solene Rapenne wrote:
> I tried to figure out how to use veb interfaces but the man page
> wasn't obvious in regards to the "vport" thing. It turns out it's
> a kind of interface that can be created with ifconfig.
> 
> I think we should make this clearer.

agreed. the man page for veb/vport is definitely not... rigorous.

> Because ifconfig(8) mentions many type of interfaces I've searched
> for "vport" without success while "most" types are referenced in
> the man page. Like I added veb(4) recently, the diff adds vport(4)
> and missing mpip(4) so a search would give a clue it's related to
> ifconfig.

I'm ok with the ifconfig chunk.

> in veb(4), I think we should add vport in the synposis because the
> man page is shared for veb and vport interfaces but at first look
> it seems only veb is a type of interface.

The synopsis shows what you put into a kernel config file (eg 
src/sys/conf/GENERIC) to enable the driver, but "pseudo-device
vport" is not valid kernel config. You enable the veb driver and that
one driver provides both veb and vport interfaces. Another example of
this is the gre driver which provides gre, egre, mgre, nvgre, and eoip
interfaces.

> And finally, I added a mention that vport can be created with
> ifconfig(8) so it's really obvious. Maybe it's too much and can be
> removed.

It should definitely be said. The other man pages for clonable
interfaces generally have a paragraph like this:

.Nm gre ,
.Nm mgre ,
.Nm egre ,
and
.Nm nvgre
interfaces can be created at runtime using the
.Ic ifconfig iface Ns Ar N Ic create
command or by setting up a
.Xr hostname.if 5
configuration file for
.Xr netstart 8 .

I just noticed vether.4 is also missing a paragraph like that too :(

> comments? ok?

Apart from it not being obvious where vport interfaces come from, is
there anything else not obvious about veb?

> Index: share/man/man4//veb.4
> ===
> RCS file: /home/reposync/src/share/man/man4/veb.4,v
> retrieving revision 1.2
> diff -u -p -r1.2 veb.4
> --- share/man/man4//veb.4 23 Feb 2021 11:43:41 -  1.2
> +++ share/man/man4//veb.4 26 Oct 2021 19:10:17 -
> @@ -23,6 +23,7 @@
>  .Nd Virtual Ethernet Bridge network device
>  .Sh SYNOPSIS
>  .Cd "pseudo-device veb"
> +.Cd "pseudo-device vport"
>  .Sh DESCRIPTION
>  The
>  .Nm veb
> @@ -37,7 +38,9 @@ by
>  .Nm veb
>  by creating a
>  .Nm vport
> -interface and attaching it as a port to the bridge.
> +interface using
> +.Xr ifconfig 8
> +and attaching it as a port to the bridge.
>  From the perspective of the host network stack, a
>  .Nm vport
>  interface acts as a normal interface connected to an Ethernet
> Index: sbin/ifconfig//ifconfig.8
> ===
> RCS file: /home/reposync/src/sbin/ifconfig/ifconfig.8,v
> retrieving revision 1.375
> diff -u -p -r1.375 ifconfig.8
> --- sbin/ifconfig//ifconfig.8 18 Aug 2021 18:10:33 -  1.375
> +++ sbin/ifconfig//ifconfig.8 26 Oct 2021 19:13:27 -
> @@ -192,6 +192,7 @@ At least the following devices can be cr
>  .Xr lo 4 ,
>  .Xr mgre 4 ,
>  .Xr mpe 4 ,
> +.Xr mpip 4 ,
>  .Xr mpw 4 ,
>  .Xr nvgre 4 ,
>  .Xr pair 4 ,
> @@ -209,6 +210,7 @@ At least the following devices can be cr
>  .Xr veb 4 ,
>  .Xr vether 4 ,
>  .Xr vlan 4 ,
> +.Xr vport 4 ,
>  .Xr vxlan 4 ,
>  .Xr wg 4
>  .It Cm debug
> 



fix uchcom(4) handling of parity and character size config

2021-10-26 Thread David Gwynne
i bought some random usb to rs485 serial adapters so i can talk
modbus to a thing, but then discovered i can't talk to the modbus
thing because uchcom doesn't support configuring parity.

this ports the functionality to support configuring parity and char size
masks from netbsd src/sys/dev/usb/uchcom.c r1.26. part of that change
including tweaks to uchcom_reset_chip, which was then changed in r1.28
back to what we already have, so i left that chunk out.

ive tested this talking to a device at 19200 with cs8 and even
parity. more tests would be appreciated to make sure i haven't
broken existing use functionality.

Index: uchcom.c
===
RCS file: /cvs/src/sys/dev/usb/uchcom.c,v
retrieving revision 1.28
diff -u -p -r1.28 uchcom.c
--- uchcom.c31 Jul 2020 10:49:33 -  1.28
+++ uchcom.c26 Oct 2021 20:33:36 -
@@ -72,9 +72,8 @@ int   uchcomdebug = 0;
 #define UCHCOM_REG_BPS_DIV 0x13
 #define UCHCOM_REG_BPS_MOD 0x14
 #define UCHCOM_REG_BPS_PAD 0x0F
-#define UCHCOM_REG_BREAK1  0x05
-#define UCHCOM_REG_BREAK2  0x18
-#define UCHCOM_REG_LCR10x18
+#define UCHCOM_REG_BREAK   0x05
+#define UCHCOM_REG_LCR 0x18
 #define UCHCOM_REG_LCR20x25
 
 #define UCHCOM_VER_20  0x20
@@ -83,11 +82,25 @@ int uchcomdebug = 0;
 #define UCHCOM_BPS_MOD_BASE2000
 #define UCHCOM_BPS_MOD_BASE_OFS1100
 
+#define UCHCOM_BPS_PRE_IMM 0x80/* CH341: immediate RX forwarding */
+
 #define UCHCOM_DTR_MASK0x20
 #define UCHCOM_RTS_MASK0x40
 
-#define UCHCOM_BRK1_MASK   0x01
-#define UCHCOM_BRK2_MASK   0x40
+#define UCHCOM_BREAK_MASK  0x01
+
+#define UCHCOM_LCR_CS5 0x00
+#define UCHCOM_LCR_CS6 0x01
+#define UCHCOM_LCR_CS7 0x02
+#define UCHCOM_LCR_CS8 0x03
+#define UCHCOM_LCR_STOPB   0x04
+#define UCHCOM_LCR_PARENB  0x08
+#define UCHCOM_LCR_PARODD  0x00
+#define UCHCOM_LCR_PAREVEN 0x10
+#define UCHCOM_LCR_PARMARK 0x20
+#define UCHCOM_LCR_PARSPACE0x30
+#define UCHCOM_LCR_TXE 0x40
+#define UCHCOM_LCR_RXE 0x80
 
 #define UCHCOM_INTR_STAT1  0x02
 #define UCHCOM_INTR_STAT2  0x03
@@ -577,23 +590,21 @@ int
 uchcom_set_break(struct uchcom_softc *sc, int onoff)
 {
usbd_status err;
-   uint8_t brk1, brk2;
+   uint8_t brk, lcr;
 
-   err = uchcom_read_reg(sc, UCHCOM_REG_BREAK1, , UCHCOM_REG_BREAK2,
-   );
+   err = uchcom_read_reg(sc, UCHCOM_REG_BREAK, , UCHCOM_REG_LCR, );
if (err)
return EIO;
if (onoff) {
/* on - clear bits */
-   brk1 &= ~UCHCOM_BRK1_MASK;
-   brk2 &= ~UCHCOM_BRK2_MASK;
+   brk &= ~UCHCOM_BREAK_MASK;
+   lcr &= ~UCHCOM_LCR_TXE;
} else {
/* off - set bits */
-   brk1 |= UCHCOM_BRK1_MASK;
-   brk2 |= UCHCOM_BRK2_MASK;
+   brk |= UCHCOM_BREAK_MASK;
+   lcr |= UCHCOM_LCR_TXE;
}
-   err = uchcom_write_reg(sc, UCHCOM_REG_BREAK1, brk1, UCHCOM_REG_BREAK2,
-   brk2);
+   err = uchcom_write_reg(sc, UCHCOM_REG_BREAK, brk, UCHCOM_REG_LCR, lcr);
if (err)
return EIO;
 
@@ -665,23 +676,50 @@ uchcom_set_dte_rate(struct uchcom_softc 
 int
 uchcom_set_line_control(struct uchcom_softc *sc, tcflag_t cflag)
 {
-   /*
-* XXX: it is difficult to handle the line control appropriately:
-*   work as chip default - CS8, no parity, !CSTOPB
-*   other modes are not supported.
-*/
+   usbd_status err;
+   uint8_t lcr = 0, lcr2 = 0;
+
+   err = uchcom_read_reg(sc, UCHCOM_REG_LCR, ,
+   UCHCOM_REG_LCR2, );
+   if (err) {
+   printf("%s: cannot get LCR: %s\n",
+   sc->sc_dev.dv_xname, usbd_errstr(err));
+   return EIO;
+   }
+
+   lcr = UCHCOM_LCR_RXE | UCHCOM_LCR_TXE;
 
switch (ISSET(cflag, CSIZE)) {
case CS5:
+   lcr |= UCHCOM_LCR_CS5;
+   break;
case CS6:
+   lcr |= UCHCOM_LCR_CS6;
+   break;
case CS7:
-   return EINVAL;
+   lcr |= UCHCOM_LCR_CS7;
+   break;
case CS8:
+   lcr |= UCHCOM_LCR_CS8;
break;
}
 
-   if (ISSET(cflag, PARENB) || ISSET(cflag, CSTOPB))
-   return EINVAL;
+   if (ISSET(cflag, PARENB)) {
+   lcr |= UCHCOM_LCR_PARENB;
+   if (!ISSET(cflag, PARODD))
+   lcr |= UCHCOM_LCR_PAREVEN;
+   }
+
+   if (ISSET(cflag, CSTOPB)) {
+   lcr |= UCHCOM_LCR_STOPB;
+   }
+
+   err = uchcom_write_reg(sc, UCHCOM_REG_LCR, lcr, UCHCOM_REG_LCR2, lcr2);
+   if (err) {
+   printf("%s: cannot set LCR: %s\n",
+   sc->sc_dev.dv_xname, usbd_errstr(err));
+   

Re: teach pf to refragment ipv4 packets

2021-08-31 Thread David Gwynne
On Tue, Aug 31, 2021 at 09:34:19PM +0200, Alexander Bluhm wrote:
> Hi,
> 
> This looks like a hack for a problem that should not exist.

I should unconditionally refragment reassembled packets?

> What is the MTU of the outgoing interface on your pf router?  If
> the layer 2 switches do not support 9k jumbo frames, it must be
> 1500.

There are two outgoing interfaces on each of the pf routers, and
they all use 9000 as their MTU. Because we're using OSPF for peering at
the moment, the MTU has to agree for the protocol to come up.

I could modify or add a route with a low MTU to the endpoint in question,
but there's no way to integrate that with the dynamic routing and
failover handling that OSPF provides.

The network they're connected to is made up of over 2000 switches
though, which is best described as "an ongoing effort to maintain".
In my current situation with the tunnelled packets, the l2 hop with a
1500 byte MTU is not directly connected to any of my hosts.

An example topology for this situation is:

- Tunnel endpoint
  - etherip0 has 1500 byte MTU and -tunneldf
  - em0 has 1500 byte MTU
- PF+OSPF box
 - vlan881 facing tunnel endpoint has 1500 byte MTU
 - vlan362/vlan363 facing campus has 9000 byte MTU
- Campus core
  - here be dragons, but also 9000 byte MTU
- Building router
  - 9000 byte MTU
- distribution switch
  - 1500 byte MTU
- access switch
  - 9000 byte MTU
- Tunnel endpoint
  - bge0 has 1500 byte MTU
  - etherip0 has 1500 byte MTU and -tunneldf

Annoyingly, this would be fine except for that distribution switch with
the 1500 byte mtu. I'm forever grateful that OpenBSD accepts large
packets regardless of what the MTU is set to, it's only limited by
hardware.

However, I have had similar problem where the network supports 9k
the whole way through, but the hosts on each end of the network
only handle 1500.

> Why are the outgoing packets not fragmented to the MTU?

Say I have a 4k UDP packet that enters the firewall as 1500 byte
fragments. PF reassembles it, and then sends it as a single 4k frame to
one of my outgoing links.

The second topology is like this:

- A server sending 4k UDP packets
  - ix0 has 1500 byte MTU
- PF+OSPF box
 - vlan82 facing the server has 1500 byte MTU
 - vlan362/vlan363 facing campus has 9000 byte MTU
- Campus core
  - here be dragons, but also 9000 byte MTU
- Building router
  - 9000 byte MTU
- distribution switch
  - 9000 byte MTU
- access switch
  - 9000 byte MTU
- Windows box
  - MTU is 1500, so MRU is also 1500

I guess the overall question is if an IPv4 packet size is a hop by hop
thing, or whether the endpoints should have final say about what they're
prepared to deal with.

> Is the dont-fragment flag set?

No. It wasn't set entering the firewall, and it's not set leaving the
firewall.

> Does pf preserve the DF flag when reassembling and forwarding?

Yes, but that's not what's being discussed here. It's vanilla
fragmented packets I'm having trouble with.

> I think we must clear the DF of reassembled forwarded packets.  Do
> you see fragments with DF?

I don't have fragmented packets with DF set.

> If we have fragments with DF and reassemble them, we should have
> the logic like you suggest.  Then we are basically in the same
> sitution as IPv6 and have to preserve the fragment length for path
> MTU discovery.

Agreed.

> I have not seen this in the last 10 years.  And to reduce complexity
> I would prefer to clear the DF instead.  Have you tried the no-df
> option?

There's no DF to clear.

> If your solution fixes a DF problem, and makes the no-df option
> superfluous, and works out of the box, we can consider implementing
> it for DF packets.

I'll have to think about that for a bit.

> 
> bluhm
> 
> On Tue, Aug 31, 2021 at 10:56:34PM +1000, David Gwynne wrote:
> > i am in an annoying situation where i peer with a campus network on an
> > ospf link with a 9k mtu, but some corners of that network have layer 2
> > hops that don't support 9k packets. i sometimes want to tunnel large
> > (1500 byte) packets to hosts in those corners of the network by
> > letting the encapsulation protocol fragment. the tunnel endpoint
> > will then reassemble the packet and forward the full sized frame
> > as if nothing untoward happened.
> >
> > the problem is that pf on the ospf hop "helps" by reassembling these
> > fragmented tunnel packets before sending them out the 9k ospf link.
> > the layer 2 hops then drop the packet because it's too big. i do
> > want pf to reassemble the packets so it can check it, but i also
> > want it to refragment it again afterward.
> >
> > it turns out this is something that happens for ipv6 already, because
> > fragmentation in v6 is only supposed to be done by the endpoints. this
> > diff allows this same semantic for v4 packets if requ

teach pf to refragment ipv4 packets

2021-08-31 Thread David Gwynne
i am in an annoying situation where i peer with a campus network on an
ospf link with a 9k mtu, but some corners of that network have layer 2
hops that don't support 9k packets. i sometimes want to tunnel large
(1500 byte) packets to hosts in those corners of the network by
letting the encapsulation protocol fragment. the tunnel endpoint
will then reassemble the packet and forward the full sized frame
as if nothing untoward happened.

the problem is that pf on the ospf hop "helps" by reassembling these
fragmented tunnel packets before sending them out the 9k ospf link.
the layer 2 hops then drop the packet because it's too big. i do
want pf to reassemble the packets so it can check it, but i also
want it to refragment it again afterward.

it turns out this is something that happens for ipv6 already, because
fragmentation in v6 is only supposed to be done by the endpoints. this
diff allows this same semantic for v4 packets if requested. to enable
it, configure "set reassemble yes refragment" in your pf.conf and it
will do the same for v4 that it does for v6.

i've only tested this lightly and now i need sleep. anyone have any
thoughts on this?

note that m_tag_find is really cheap if the tag doesnt exist thanks to
henning@.

Index: sys/net/pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1122
diff -u -p -r1.1122 pf.c
--- sys/net/pf.c7 Jul 2021 18:38:25 -   1.1122
+++ sys/net/pf.c31 Aug 2021 12:42:51 -
@@ -6049,6 +6049,7 @@ void
 pf_route(struct pf_pdesc *pd, struct pf_state *s)
 {
struct mbuf *m0;
+   struct m_tag*mtag;
struct mbuf_list fml;
struct sockaddr_in  *dst, sin;
struct rtentry  *rt = NULL;
@@ -6132,6 +6133,15 @@ pf_route(struct pf_pdesc *pd, struct pf_
ip = mtod(m0, struct ip *);
}
 
+   /*
+* If packet has been reassembled by PF earlier, we might have to
+* use pf_refragment4() here to turn it back to fragments.
+*/
+   if ((mtag = m_tag_find(m0, PACKET_TAG_PF_REASSEMBLED, NULL))) {
+   (void) pf_refragment4(, mtag, dst, ifp, rt);
+   goto done;
+   }
+
in_proto_cksum_out(m0, ifp);
 
if (ntohs(ip->ip_len) <= ifp->if_mtu) {
@@ -7357,16 +7367,20 @@ done:
break;
}
 
-#ifdef INET6
/* if reassembled packet passed, create new fragments */
-   if (pf_status.reass && action == PF_PASS && pd.m && fwdir == PF_FWD &&
-   pd.af == AF_INET6) {
+   if (pf_status.reass && action == PF_PASS && pd.m && fwdir == PF_FWD) {
struct m_tag*mtag;
 
-   if ((mtag = m_tag_find(pd.m, PACKET_TAG_PF_REASSEMBLED, NULL)))
+   mtag = m_tag_find(pd.m, PACKET_TAG_PF_REASSEMBLED, NULL);
+   if (mtag == NULL)
+   ; /* no reassembly required */
+#ifdef INET6
+   else if (pd.af == AF_INET6)
action = pf_refragment6(, mtag, NULL, NULL, NULL);
-   }
 #endif /* INET6 */
+   else
+   action = pf_refragment4(, mtag, NULL, NULL, NULL);
+   }
if (s && action != PF_DROP) {
if (!s->if_index_in && dir == PF_IN)
s->if_index_in = ifp->if_index;
Index: sys/net/pf_norm.c
===
RCS file: /cvs/src/sys/net/pf_norm.c,v
retrieving revision 1.223
diff -u -p -r1.223 pf_norm.c
--- sys/net/pf_norm.c   10 Mar 2021 10:21:48 -  1.223
+++ sys/net/pf_norm.c   31 Aug 2021 12:42:51 -
@@ -782,7 +782,7 @@ pf_reassemble(struct mbuf **m0, int dir,
struct pf_frent *frent;
struct pf_fragment  *frag;
struct pf_frnode key;
-   u_int16_ttotal, hdrlen;
+   u_int16_ttotal, maxlen, hdrlen;
 
/* Get an entry for the fragment queue */
if ((frent = pf_create_fragment(reason)) == NULL)
@@ -821,6 +821,7 @@ pf_reassemble(struct mbuf **m0, int dir,
/* We have all the data */
frent = TAILQ_FIRST(>fr_queue);
KASSERT(frent != NULL);
+   maxlen = frag->fr_maxlen;
total = TAILQ_LAST(>fr_queue, pf_fragq)->fe_off +
TAILQ_LAST(>fr_queue, pf_fragq)->fe_len;
hdrlen = frent->fe_hdrlen;
@@ -843,6 +844,63 @@ pf_reassemble(struct mbuf **m0, int dir,
 
PF_FRAG_UNLOCK();
DPFPRINTF(LOG_INFO, "complete: %p(%d)", m, ntohs(ip->ip_len));
+
+   if (ISSET(pf_status.reass, PF_REASS_REFRAG)) {
+   struct m_tag *mtag;
+   struct pf_fragment_tag *ftag;
+
+   mtag = m_tag_get(PACKET_TAG_PF_REASSEMBLED, sizeof(*ftag),
+   M_NOWAIT);
+   if (mtag == NULL) {
+   REASON_SET(reason, PFRES_MEMORY);
+   return (PF_DROP);
+   }
+
+  

Re: [External] : better use the tokeniser in the pfctl parser

2021-08-31 Thread David Gwynne
On Tue, Aug 31, 2021 at 07:33:40AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> On Tue, Aug 31, 2021 at 02:40:57PM +1000, David Gwynne wrote:
> > handling the "no" option with a token, and "yes" via a string made my
> > eye twitch.
> > 
> > ok? or is the helpful yyerror a nice feature?
> > 
> 
> I actually think it's a nice feature. below is output
> for current pfctl we have in tree:

it is nice, but the implementation isn't... rigorous.

> 
>   lumpy$ pfctl -n -f /tmp/pf.conf
>   /tmp/pf.conf:6: invalid value 'nope', expected 'yes' or 'no'
> 
> and output with diff applied:
> 
>   lumpy$ ./pfctl -n -f /tmp/pf.conf
>   /tmp/pf.conf:6: syntax error

but if you try to use a keyword instead of a string, you get this:

dlg@kbuild ~$ echo "set reassemble yes" | pfctl -vnf -
set reassemble yes 
dlg@kbuild ~$ echo "set reassemble no" | pfctl -vnf -  
set reassemble no 
dlg@kbuild ~$ echo "set reassemble nope" | pfctl -vnf -
stdin:1: invalid value 'nope', expected 'yes' or 'no'
dlg@kbuild ~$ echo "set reassemble block" | pfctl -vnf -
stdin:1: syntax error
dlg@kbuild ~$ 

if the tokeniser exposed the buffer it was working on, we could make it
consistent for all arguments:

dlg@kbuild pfctl$ echo "set reassemble yes" | ./obj/pfctl -vnf -   
set reassemble yes 
dlg@kbuild pfctl$ echo "set reassemble no" | ./obj/pfctl -vnf -  
set reassemble no 
dlg@kbuild pfctl$ echo "set reassemble nope" | ./obj/pfctl -vnf -
stdin:1: syntax error
stdin:1: invalid value 'nope', expected 'yes' or 'no'
dlg@kbuild pfctl$ echo "set reassemble block" | ./obj/pfctl -vnf -
stdin:1: syntax error
stdin:1: invalid value 'block', expected 'yes' or 'no'

the extremely rough PoC diff for pfctl that implements this is
below. because the tokeniser handles some operators without using the
buffer, if you give "set reassemble" an operator then you get confusing
output:

dlg@kbuild pfctl$ echo "set reassemble <" | ./obj/pfctl -vnf - 
stdin:1: syntax error
stdin:1: invalid value 'reassemble', expected 'yes' or 'no'

anyway, it might be easier to drop the diff for now.

Index: parse.y
===
RCS file: /cvs/src/sbin/pfctl/parse.y,v
retrieving revision 1.709
diff -u -p -r1.709 parse.y
--- parse.y 1 Feb 2021 00:31:04 -   1.709
+++ parse.y 31 Aug 2021 09:20:38 -
@@ -458,6 +458,8 @@ typedef struct {
int lineno;
 } YYSTYPE;
 
+static u_char   *yytext;
+
 #define PPORT_RANGE1
 #define PPORT_STAR 2
 intparseport(char *, struct range *r, int);
@@ -471,7 +473,7 @@ int parseport(char *, struct range *r, i
 %token PASS BLOCK MATCH SCRUB RETURN IN OS OUT LOG QUICK ON FROM TO FLAGS
 %token RETURNRST RETURNICMP RETURNICMP6 PROTO INET INET6 ALL ANY ICMPTYPE
 %token ICMP6TYPE CODE KEEP MODULATE STATE PORT BINATTO NODF
-%token MINTTL ERROR ALLOWOPTS FILENAME ROUTETO DUPTO REPLYTO NO LABEL
+%token MINTTL ERROR ALLOWOPTS FILENAME ROUTETO DUPTO REPLYTO YES NO LABEL
 %token NOROUTE URPFFAILED FRAGMENT USER GROUP MAXMSS MAXIMUM TTL TOS DROP TABLE
 %token REASSEMBLE ANCHOR SYNCOOKIES
 %token SET OPTIMIZATION TIMEOUT LIMIT LOGINTERFACE BLOCKPOLICY RANDOMID
@@ -3754,16 +3756,11 @@ comma   : ','
;
 
 yesno  : NO{ $$ = 0; }
-   | STRING{
-   if (!strcmp($1, "yes"))
-   $$ = 1;
-   else {
-   yyerror("invalid value '%s', expected 'yes' "
-   "or 'no'", $1);
-   free($1);
-   YYERROR;
-   }
-   free($1);
+   | YES   { $$ = 1; }
+   | error {
+   yyerror("invalid value '%s', expected 'yes' or 'no'",
+   yytext);
+   YYABORT;
}
;
 
@@ -5048,6 +5045,7 @@ lookup(char *s)
{ "urpf-failed",URPFFAILED},
{ "user",   USER},
{ "weight", WEIGHT},
+   { "yes",YES},
};
const struct keywords   *p;
 
@@ -5170,10 +5168,12 @@ findeol(void)
 int
 yylex(void)
 {
-   u_char   buf[8096];
+   static u_char buf[8192];
u_char  *p, *val;
int  quotec, next, c;
int  token;
+
+   yytext = buf;
 
 top:
p = buf;





better use the tokeniser in the pfctl parser

2021-08-30 Thread David Gwynne
handling the "no" option with a token, and "yes" via a string made my
eye twitch.

ok? or is the helpful yyerror a nice feature?

Index: parse.y
===
RCS file: /cvs/src/sbin/pfctl/parse.y,v
retrieving revision 1.709
diff -u -p -r1.709 parse.y
--- parse.y 1 Feb 2021 00:31:04 -   1.709
+++ parse.y 31 Aug 2021 04:36:30 -
@@ -471,7 +471,7 @@ int parseport(char *, struct range *r, i
 %token PASS BLOCK MATCH SCRUB RETURN IN OS OUT LOG QUICK ON FROM TO FLAGS
 %token RETURNRST RETURNICMP RETURNICMP6 PROTO INET INET6 ALL ANY ICMPTYPE
 %token ICMP6TYPE CODE KEEP MODULATE STATE PORT BINATTO NODF
-%token MINTTL ERROR ALLOWOPTS FILENAME ROUTETO DUPTO REPLYTO NO LABEL
+%token MINTTL ERROR ALLOWOPTS FILENAME ROUTETO DUPTO REPLYTO YES NO LABEL
 %token NOROUTE URPFFAILED FRAGMENT USER GROUP MAXMSS MAXIMUM TTL TOS DROP TABLE
 %token REASSEMBLE ANCHOR SYNCOOKIES
 %token SET OPTIMIZATION TIMEOUT LIMIT LOGINTERFACE BLOCKPOLICY RANDOMID
@@ -3754,17 +3754,7 @@ comma: ','
;
 
 yesno  : NO{ $$ = 0; }
-   | STRING{
-   if (!strcmp($1, "yes"))
-   $$ = 1;
-   else {
-   yyerror("invalid value '%s', expected 'yes' "
-   "or 'no'", $1);
-   free($1);
-   YYERROR;
-   }
-   free($1);
-   }
+   | YES   { $$ = 1; }
;
 
 unaryop: '='   { $$ = PF_OP_EQ; }
@@ -5048,6 +5038,7 @@ lookup(char *s)
{ "urpf-failed",URPFFAILED},
{ "user",   USER},
{ "weight", WEIGHT},
+   { "yes",YES},
};
const struct keywords   *p;
 



Re: [External] : Re: if_etherbridge.c vs. parallel forwarding

2021-06-24 Thread David Gwynne
On Thu, Jun 24, 2021 at 09:31:20AM +0200, Alexandr Nedvedicky wrote:
> Hello David,
> 
> 
> > 
> > i think we can get away with not refcounting eb_entry structures at all.
> > either they're in the etherbridge map/table or they're not, and the
> > thing that takes them out of the map while holding the eb_lock mutex
> > becomes responsible for their cleanup.
> > 
> > i feel like most of the original logic can still hold up if we fix my
> > stupid refcnting mistake(s) and do a better job of avoiding a double
> > free.
> 
> I'm not sure. It seems to me the code in your diff deals with
> insert vs. insert race properly. how about delete vs. insert?

hrm. i can half convince myself that the outcome of losing a delete vs
insert race would have a semantically correct outcome, but while i'm
trying to do that it occurs to me that i'm trying to make the code too
clever and i should dumb it down.

> 350 mtx_enter(>eb_lock);
> 351 num = eb->eb_num + (oebe == NULL);
>
> 352 if (num <= eb->eb_max && ebt_insert(eb, nebe) == oebe) {  
>
> 353 /* we won, do the update */   
>
> 354 ebl_insert(ebl, nebe);
> 355 
> 356 if (oebe != NULL) {
> 357 ebl_remove(ebl, oebe);
> 358 ebt_replace(eb, oebe, nebe);
> 359 }
> 360 
> 361 nebe = NULL; /* give nebe reference to the table */
> 362 eb->eb_num = num; 
>
> 363 } else {  
>
> 364 /* we lost, we didn't end up replacing oebe */
> 365 oebe = NULL;
> 366 }
> 367 mtx_leave(>eb_lock);
> 368 
> 
> assume cpu0 got oebe and assumes it is going to perform update (oebe != 
> NULL).
> the cpu1 runs ahead and won mutex (->eb_lock) in etherbridge_del_addr() 
> and
> removed the entry successfully. as soon as cpu1 leaves ->eb_lock, it's
> cpu0's turn. In this case ebt_insert() returns NULL, because there is
> no conflict any more. However 'NULL != oebe'.

in this situation it would look like etherbridge_del_addr ran after nebe
was inserted, which i think is a plausible history.

> I'm not sure we can fix insert vs. delete race properly without atomic
> reference counter.

during the drive to work it occurred to me that we should basically have
the same logic around whether we should insert or replace or do nothing
in both the smr and mutex critical sections.

it at least makes the code easier to understand. i think?

Index: if_etherbridge.c
===
RCS file: /cvs/src/sys/net/if_etherbridge.c,v
retrieving revision 1.6
diff -u -p -r1.6 if_etherbridge.c
--- if_etherbridge.c10 Mar 2021 10:21:47 -  1.6
+++ if_etherbridge.c25 Jun 2021 03:56:37 -
@@ -44,7 +44,6 @@
 
 #include 
 
-static inline void ebe_take(struct eb_entry *);
 static inline void ebe_rele(struct eb_entry *);
 static voidebe_free(void *);
 
@@ -233,16 +232,9 @@ ebt_remove(struct etherbridge *eb, struc
 }
 
 static inline void
-ebe_take(struct eb_entry *ebe)
-{
-   refcnt_take(>ebe_refs);
-}
-
-static void
 ebe_rele(struct eb_entry *ebe)
 {
-   if (refcnt_rele(>ebe_refs))
-   smr_call(>ebe_smr_entry, ebe_free, ebe);
+   smr_call(>ebe_smr_entry, ebe_free, ebe);
 }
 
 static void
@@ -309,19 +301,21 @@ etherbridge_map(struct etherbridge *eb, 
 
smr_read_enter();
oebe = ebl_find(ebl, eba);
-   if (oebe == NULL)
-   new = 1;
-   else {
+   if (oebe == NULL) {
+   /*
+* peek at the space to see if it's worth trying
+* to make a new entry.
+*/
+   if (eb->eb_num < eb->eb_max)
+   new = 1;
+   } else {
if (oebe->ebe_age != now)
oebe->ebe_age = now;
 
/* does this entry need to be replaced? */
if (oebe->ebe_type == EBE_DYNAMIC &&
-   !eb_port_eq(eb, oebe->ebe_port, port)) {
+   !eb_port_eq(eb, oebe->ebe_port, port))
new = 1;
-   ebe_take(oebe);
-   } else
-   oebe = NULL;
}
smr_read_leave();
 
@@ -342,7 +336,6 @@ etherbridge_map(struct etherbridge *eb, 
}
 
smr_init(>ebe_smr_entry);
-   refcnt_init(>ebe_refs);
nebe->ebe_etherbridge = eb;
 
nebe->ebe_addr = eba;
@@ -351,40 +344,49 @@ etherbridge_map(struct etherbridge *eb, 

Re: if_etherbridge.c vs. parallel forwarding

2021-06-23 Thread David Gwynne
On Sat, Jun 19, 2021 at 12:32:04AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> skip reading if you are not interested in L2 switching combined
> with bluhm's diff [1], which enables parallel forwarding.
> 
> Hrvoje gave it a try and soon discovered some panics. Diff below
> fixes a panic indicated by stack as follows:

nice.

> login: panic: kernel diagnostic assertion "smr->smr_func == NULL"\
> fai(ed: file "/home/sasha/src.sashan/sys/kern/kern_smr.c", line 247
> Stopped at  db_enter+0x10:  popq%rbp
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>  168970  15734  0 0x14000  0x2003  softnet
>   92200  58362  0 0x14000  0x2005  softnet
>  195539  36092  0 0x14000  0x2002  softnet
> *162819  18587  0 0x14000  0x2001  softnet
> db_enter() at db_enter+0x10
> panic(81e0483c) at panic+0xbf
> __assert(81e6d854,81e6b008,f7,81e6b033) at 
> __assert+0x2b
> smr_call_impl(fd83936c7160,810ef0f0,fd83936c7100,0) \
> at smr_call_impl+0xd4
> veb_port_input(80082048,fd80cccaef00,90e2ba33b4a1,8015f900)\
> at veb_port_input+0x2fa
> ether_input(80082048,fd80cccaef00) at ether_input+0xf5
> if_input_process(80082048,800022c62388) at if_input_process+0x6f
> ifiq_process(80082458) at ifiq_process+0x69
> taskq_thread(8002f080) at taskq_thread+0x81
> end trace frame: 0x0, count: 6
> https://www.openbsd.org/ddb.html describes the minimum info required in bug
> reports.  Insufficient info makes it difficult to find and fix bugs.
> 
> Hrvoje knows all details [2] how to wire things up to trigger the
> crash. I'm just using all HW and scripts he kindly provided me
> to reproduce those panics reliably.
> 
> I think the crash comes from combination of SMR and reference
> counting done by atomic ops. Let's assume two cpus
> are trying to update the same entry.
> 
> Let's assume one CPU (cpu0) just found oebe using ebl_find() at line 311:
> 
> 310 smr_read_enter();
> 311 oebe = ebl_find(ebl, eba);
> 312 if (oebe == NULL)
> 313 new = 1;
> 314 else {
> 315 if (oebe->ebe_age != now)
> 316 oebe->ebe_age = now;
> 317 
> 318 /* does this entry need to be replaced? */
> 319 if (oebe->ebe_type == EBE_DYNAMIC &&
> 320 !eb_port_eq(eb, oebe->ebe_port, port)) {
> 321 new = 1;
> 322 ebe_take(oebe);
> 323 } else
> 
> few ticks later the other CPU (cpu1) runs ahead. It just removed
> the same entry found by cpu0 at line 360:
> 353 mtx_enter(>eb_lock);
> 354 num = eb->eb_num + (oebe == NULL);
> 355 if (num <= eb->eb_max && ebt_insert(eb, nebe) == oebe) {
> 356 /* we won, do the update */
> 357 ebl_insert(ebl, nebe);
> 358 
> 359 if (oebe != NULL) {
> 360 ebl_remove(ebl, oebe);
> 361 ebt_replace(eb, oebe, nebe);
> 362 
> 
> let's further assume cpu1 reaches line 389:
> 383 if (oebe != NULL) {
> 384 /*
> 385  * the old entry could be referenced in
> 386  * multiple places, including an smr read
> 387  * section, so release it properly.
> 388  */
> 389 ebe_rele(oebe);
> 390 }
> before cpu0 raches line 322 (ebe_take()). If that happens the cpu1 drops
> the last reference to oebe, which is in fact shared between cpu1 and cpu0.
> if cpu1 sees reference count is zero, it does smr_call(), which schedules
> ebe_free() on oebe, so oebe can be freed when cpu0 is done with it.
> 
> few ticks later cpu0 reaches line 322 and takes its reference to oebe.
> the cpu0 enters critical section and sees it lost race (because
> ebt_insert() != oebe). cpu0 continues at line 389. it drops last
> reference and calls smr_call(). It trips the assert, because ebe_free()
> is scheduled already by cpu1.
> 
> diff below fixes the flaw by introducing `cebe` (conflicting ebe) local
> variable:

i think your real fix is where you stop taking an oebe reference
from the smr critical section.

> 345 mtx_enter(>eb_lock);
> 346 num = eb->eb_num + (oebe == NULL);
> 347 cebe = NULL;
> 348 if (num <= eb->eb_max) {
> 349 cebe = ebt_insert(eb, nebe);
> 350 
> 351 if (cebe == NULL) {
> 352 /* nebe got inserted without conflict */
> 353 eb->eb_num++;
> 354 ebl_insert(ebl, nebe);
> 355 nebe = NULL;
> 356 } else if ((oebe != NULL) && (oebe == cebe)) {
> 357 /* we won, do the update */
> 358 ebl_insert(ebl, nebe);
> 359 

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-21 Thread David Gwynne
On Wed, Jun 16, 2021 at 02:59:19AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
> > 
> > as above, copyout with a sleeping lock is fine.
> > 
> > the whole point of my change is to give us the ability to lock in the
> > forwarding path separately to locking in the ioctl path. half of that is
> > so we can copyout safely. the other half is to avoid letting the ioctl
> > path block packet processing if we can avoid it as an alternative to
> > having the network stack having to yield the cpu.
> 
> I see. my confusion came from fact I've forgot pflock got turned to mutex,
> when we saw the crash.
> 
> > 
> > > let's take a look at this part of pf_purge_expired_states()
> > > from your diff:
> > > 
> > > 1543 NET_LOCK();
> > > 1544 rw_enter_write(_state_list.pfs_rwl);
> > > 1545 PF_LOCK();
> > > 1546 PF_STATE_ENTER_WRITE();
> > > 1547 SLIST_FOREACH(st, , gc_list) {
> > > 1548 if (st->timeout != PFTM_UNLINKED)
> > > 1549 pf_remove_state(st);
> > > 1550 
> > > 1551 pf_free_state(st);
> > > 1552 }
> > > 1553 PF_STATE_EXIT_WRITE();
> > > 1554 PF_UNLOCK();
> > > 1555 rw_exit_write(_state_list.pfs_rwl);
> > > 
> > > at line 1543 we grab NET_LOCK(), at line 1544 we are trying
> > > to grab new lock (pf_state_list.pfs_rwl) exclusively. 
> > > 
> > > with your change we might be running into situation, where we do 
> > > copyout() as a
> > > reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to 
> > > acquire
> > > pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who 
> > > might be
> > > doing uvm_fault() in copyout(9f).
> > > 
> > > I'm just worried we may be trading one bug for another bug.  may be my 
> > > concern
> > > is just a false alarm here. I don't know.
> > 
> > no, these things are all worth discussing.
> > 
> > it's definitely possible there's bugs in here, but im pretty confident
> > it's not the copyout one.
> > 
> 
> it seems to work. I'm running your diff with bluhm's parallel diff
> and do occasional pfctl -Fs/pfctl -ss under a load. so far so good.
> 
> 
> 
> > > I guess 'pfgpurge_expired_fragment(s);' is unintentional change, 
> > > right?
> > 
> > yeah, i dont know how i did that. vi is hard?
> 
> sure it is...
> thanks for fixing the nits.
> 
> 
> 
> > Index: if_pfsync.c
> > ===
> > RCS file: /cvs/src/sys/net/if_pfsync.c,v
> > retrieving revision 1.292
> > diff -u -p -r1.292 if_pfsync.c
> > --- if_pfsync.c 15 Jun 2021 10:10:22 -  1.292
> > +++ if_pfsync.c 15 Jun 2021 11:21:20 -
> > @@ -2545,22 +2545,34 @@ pfsync_bulk_start(void)
> >  {
> > struct pfsync_softc *sc = pfsyncif;
> 
> have not spot anything suspicious in if_pfsync.c
> 
> the new diff reads fine to me.
> 
> OK sashan

I've been running versions of this diff in production at work, and have
hit a few panics and asserts. All the issues we've hit should be
addressed in this diff.

The first issue was that pfsync could be in the processing of sending a
deferred packet while it's being removed from the state tree. Part of
that removal process is stripping the state keys from the state, which
pfsync uses to determine the address family so it knows which ip output
routine to use. The quick and dirty fix to this is to have pfsync check
if timeout state to see if the state is unlinked or not. This currently
relies on pfsync undefer and pf being serialised by the NET_LOCK.

The second is that the timeout member on a state can change while the
purge task is looking at it. We hit this assert in pf_state_expires:

KASSERT(state->timeout != PFTM_UNLINKED);

pf_state_expires was called from the purge code like this:

if ((cur->timeout == PFTM_UNLINKED) ||
(pf_state_expires(cur) <= getuptime()))
SLIST_INSERT_HEAD(, cur, gc_list);


With my new locking scheme here, the state purge code is called without
any of the locks that would serialise access the state->timeout
variable. I think I found a solution to this without having to
reintroduce extra locking, which should allow us to keep the purge scan
running concurrently with pf actually handling packets.

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1118
diff -u -p -r1.1118 pf.c
--- pf.c1 Jun 2021 09:57:11 -   1.1118
+++ pf.c22 Jun 2021 01:34:36 -
@@ -259,6 +259,7 @@ void pf_state_key_link_inpcb(struct 
p
 voidpf_state_key_unlink_inpcb(struct pf_state_key *);
 voidpf_inpcb_unlink_state_key(struct inpcb *);
 voidpf_pktenqueue_delayed(void *);
+int32_t pf_state_expires(const struct pf_state *, 
uint8_t);
 
 #if NPFLOG > 0
 void   

Re: ifnewlladdr spl

2021-06-15 Thread David Gwynne



> On 16 Jun 2021, at 00:39, Martin Pieuchot  wrote:
> 
> On 15/06/21(Tue) 22:52, David Gwynne wrote:
>> On Mon, Jun 14, 2021 at 10:07:58AM +0200, Martin Pieuchot wrote:
>>> On 10/06/21(Thu) 19:17, Alexander Bluhm wrote:
>> [...] 
>>>> The in6_ functions need netlock.  And driver SIOCSIFFLAGS ioctl
>>>> must not have splnet().
>>> 
>>> Why not?  This is new since the introduction of intr_barrier() or this
>>> is an old issue?
>>> 
>>>> Is reducing splnet() the correct aproach?
>> 
>> yes.
>> 
>>> I doubt it is possible to answer this question without defining who owns
>>> `if_flags' and how can it be read/written to.
>> 
>> NET_LOCK is what "owns" updates to if_flags.
> 
> Why does reducing splnet() is the correct approach?  It isn't clear to
> me.  What's splnet() protecting then?

splnet() and all the other splraise() variants only raise the IPL on the 
current CPU. Unless you have some other lock to coordinate with other CPUs (eg 
KERNEL_LOCK) it doesn't really prevent other code running. ixl in particular 
has mpsafe interrupts, so unless your ioctl code is running on the same CPU 
that ixl is interrupting, it's not helping.

splnet() with KERNEL_LOCK provides backward compat for with legacy drivers. The 
reason it doesn't really help with the network stack is because the stack runs 
from nettq under NET_LOCK without KERNEL_LOCK, it's no longer a softint at an 
IPL lower than net.

dlg

> 
>>>> Index: net/if.c
>>>> ===
>>>> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if.c,v
>>>> retrieving revision 1.641
>>>> diff -u -p -r1.641 if.c
>>>> --- net/if.c   25 May 2021 22:45:09 -  1.641
>>>> +++ net/if.c   10 Jun 2021 14:32:12 -
>>>> @@ -3109,6 +3109,8 @@ ifnewlladdr(struct ifnet *ifp)
>>>>short up;
>>>>int s;
>>>> 
>>>> +  NET_ASSERT_LOCKED();
>>>> +
>>>>s = splnet();
>>>>up = ifp->if_flags & IFF_UP;
>>>> 
>>>> @@ -3116,11 +3118,14 @@ ifnewlladdr(struct ifnet *ifp)
>>>>/* go down for a moment... */
>>>>ifp->if_flags &= ~IFF_UP;
>>>>ifrq.ifr_flags = ifp->if_flags;
>>>> +  splx(s);
>>>>(*ifp->if_ioctl)(ifp, SIOCSIFFLAGS, (caddr_t));
>>>> +  s = splnet();
>>>>}
>>>> 
>>>>ifp->if_flags |= IFF_UP;
>>>>ifrq.ifr_flags = ifp->if_flags;
>>>> +  splx(s);
>>>>(*ifp->if_ioctl)(ifp, SIOCSIFFLAGS, (caddr_t));
>>>> 
>>>> #ifdef INET6
>>>> @@ -3139,11 +3144,12 @@ ifnewlladdr(struct ifnet *ifp)
>>>> #endif
>>>>if (!up) {
>>>>/* go back down */
>>>> +  s = splnet();
>>>>ifp->if_flags &= ~IFF_UP;
>>>>ifrq.ifr_flags = ifp->if_flags;
>>>> +  splx(s);
>>>>(*ifp->if_ioctl)(ifp, SIOCSIFFLAGS, (caddr_t));
>>>>}
>>>> -  splx(s);
>>>> }
>>>> 
>>>> void
>>>> 
>>> 
>> 



Re: timecounting: refactor fraction-to-nanosecond, bintime-to-nanosecond conversions

2021-06-15 Thread David Gwynne



> On 16 Jun 2021, at 01:10, Scott Cheloha  wrote:
> 
> Hi,
> 
> dlg@ just moved nsecuptime() and getnsecuptime() into kern_tc.c.
> 
> To tidy it up I'd like to refactor the fraction-to-nanosecond and
> bintime-to-nanosecond conversions into new functions so we only need
> to write them once.

yes please. every time i copied that code i thought i could do it better and 
wasted time trying. having the macro wrap it up will take that temptation away 
i think.

> 
> ok?

ok.

> 
> Index: sys/time.h
> ===
> RCS file: /cvs/src/sys/sys/time.h,v
> retrieving revision 1.60
> diff -u -p -r1.60 time.h
> --- sys/time.h15 Jun 2021 05:24:47 -  1.60
> +++ sys/time.h15 Jun 2021 13:10:01 -
> @@ -222,11 +222,17 @@ bintimesub(const struct bintime *bt, con
>  *   time_second ticks after N.9 not after N.49
>  */
> 
> +static inline uint32_t
> +FRAC_TO_NSEC(uint64_t frac)
> +{
> + return ((frac >> 32) * 10ULL) >> 32;
> +}
> +
> static inline void
> BINTIME_TO_TIMESPEC(const struct bintime *bt, struct timespec *ts)
> {
>   ts->tv_sec = bt->sec;
> - ts->tv_nsec = (long)(((uint64_t)10 * (uint32_t)(bt->frac >> 
> 32)) >> 32);
> + ts->tv_nsec = FRAC_TO_NSEC(bt->frac);
> }
> 
> static inline void
> @@ -250,6 +256,12 @@ TIMEVAL_TO_BINTIME(const struct timeval 
>   bt->sec = (time_t)tv->tv_sec;
>   /* 18446744073709 = int(2^64 / 100) */
>   bt->frac = (uint64_t)tv->tv_usec * (uint64_t)18446744073709ULL;
> +}
> +
> +static inline uint64_t
> +BINTIME_TO_NSEC(const struct bintime *bt)
> +{
> + return bt->sec * 10ULL + FRAC_TO_NSEC(bt->frac);
> }
> #endif
> 
> Index: kern/kern_tc.c
> ===
> RCS file: /cvs/src/sys/kern/kern_tc.c,v
> retrieving revision 1.73
> diff -u -p -r1.73 kern_tc.c
> --- kern/kern_tc.c15 Jun 2021 05:24:46 -  1.73
> +++ kern/kern_tc.c15 Jun 2021 13:10:02 -
> @@ -254,28 +254,18 @@ uint64_t
> nsecuptime(void)
> {
>   struct bintime bt;
> - uint64_t nsec;
> 
>   binuptime();
> -
> - nsec = (10ULL * (bt.frac >> 32)) >> 32;
> - nsec += bt.sec * 10ULL;
> -
> - return (nsec);
> + return BINTIME_TO_NSEC();
> }
> 
> uint64_t
> getnsecuptime(void)
> {
>   struct bintime bt;
> - uint64_t nsec;
> 
>   getbinuptime();
> -
> - nsec = (10ULL * (bt.frac >> 32)) >> 32;
> - nsec += bt.sec * 10ULL;
> -
> - return (nsec);
> + return BINTIME_TO_NSEC();
> }
> 
> void



Re: [External] : rework pfsync deferral timeout handling

2021-06-15 Thread David Gwynne



> On 14 Jun 2021, at 19:12, Alexandr Nedvedicky 
>  wrote:
> 
> Hello,
> 
> looks good to me. I think this should be committed
> as-is. I have just one question,
> 
> On Mon, Jun 14, 2021 at 01:58:06PM +1000, David Gwynne wrote:
> 
>> @@ -1931,6 +1933,9 @@ pfsync_defer(struct pf_state *st, struct
>> {
>>  struct pfsync_softc *sc = pfsyncif;
>>  struct pfsync_deferral *pd;
>> +struct timeval now;
>> +unsigned int sched;
>> +static const struct timeval defer = { 0, 2 };
>^^^
>I'm just curious, why there is a static?

so it can exist in the ro data section rather than get set up on the stack 
every call.

>> 
>>  NET_ASSERT_LOCKED();
>> 
>> @@ -1942,10 +1947,12 @@ pfsync_defer(struct pf_state *st, struct
>>  if (sc->sc_deferred >= 128) {
>>  mtx_enter(>sc_deferrals_mtx);
>>  pd = TAILQ_FIRST(>sc_deferrals);
>> -TAILQ_REMOVE(>sc_deferrals, pd, pd_entry);
>> -sc->sc_deferred--;
>> +if (pd != NULL) {
>> +TAILQ_REMOVE(>sc_deferrals, pd, pd_entry);
>> +sc->sc_deferred--;
>> +}
>>  mtx_leave(>sc_deferrals_mtx);
>> -if (timeout_del(>pd_tmo))
>> +if (pd != NULL)
>>  pfsync_undefer(pd, 0);
> 
>yes, this still needs fix, we discuss in other thread [1].
> 
>please go ahead and commit your change here, so I can refresh
>my patches.
> 
> 
> OK sashan@



Re: ifnewlladdr spl

2021-06-15 Thread David Gwynne
On Mon, Jun 14, 2021 at 10:07:58AM +0200, Martin Pieuchot wrote:
> On 10/06/21(Thu) 19:17, Alexander Bluhm wrote:
> > Hi,
> > 
> > I have seen this crash trace on a 6.6 based system, but I think the
> > bug exists still in -current.  It happened when an ixl(4) interface
> > was removed from trunk(4).

i think it is fixed in later ixl(4). ixl_down coordinaties with the
interrupt handler with the IFF_RUNNING flag and an interrupt barrier.
6.6 ixl_intr() didn't check IFF_RUNNING, but it does now. eg:

@@ -2852,7 +3323,7 @@ ixl_rxrinfo(struct ixl_softc *sc, struct
 }
 
 static int
-ixl_intr(void *xsc)
+ixl_intr0(void *xsc)
 {
struct ixl_softc *sc = xsc;
struct ifnet *ifp = >sc_ac.ac_if;
@@ -2873,18 +3344,63 @@ ixl_intr(void *xsc)
rv = 1;
}
 
-   if (ISSET(icr, I40E_INTR_NOTX_RX_MASK))
-   rv |= ixl_rxeof(sc, ifp->if_iqs[0]);
-   if (ISSET(icr, I40E_INTR_NOTX_TX_MASK))
-   rv |= ixl_txeof(sc, ifp->if_ifqs[0]);
+   if (ISSET(ifp->if_flags, IFF_RUNNING)) {
+   struct ixl_vector *iv = sc->sc_vectors;
+   if (ISSET(icr, I40E_INTR_NOTX_RX_MASK))
+   rv |= ixl_rxeof(sc, iv->iv_rxr);
+   if (ISSET(icr, I40E_INTR_NOTX_TX_MASK))
+   rv |= ixl_txeof(sc, iv->iv_txr);
+   }


> > ifnewlladdr() is interrupted by ixl transmit interrupt.  There it
> > crashes in ixl_txeof as txr is NULL.  The code in -current if_ixl.c
> > has changed, so it might not happen anymore.  But I think the bug
> > is in ifnewlladdr().
> 
> Hard to say.

i hate how ifnewlladdr works, but im leaning toward 6.6 ixl as the thing
to blame for this one.

> > ifnewlladdr() sets splnet() and configures the interface up and
> > down.  The ixl_down() code has some interrupt barriers which cannot
> > work while interrupts are blocked by splnet().  So interrupts fire
> > at splx() when the driver does not expect them.
> 
> If intr_barrier() or ixl_down() need a certain IPL level to properly
> work then something has been overlooked.  Should we add an assert?

intr_barrier is fine. ixl_down assumed something that wasn't present.

> > Combining interrupt barriers with spl protection looks like a bad
> > idea.
> > 
> > Is there anything that lowers spl in all cases during intr_barrier(),
> > ifq_barrier() or timeout_del_barrier()?
> > 
> > How should spls work together with barriers?
> > 
> > The integrity of ifnewlladdr() state should be guaranteed by netlock.
> >
> > Changing if_flags needs splnet() as they are used by all drivers.
> 
> This isn't clear to me.  splnet() used to be needed but nowadays this
> seems to questionable depending on the driver.

updates to if_flags needs to be serialised, but in a lot of cases
you can read if_flags without a lock because there's no inconsistent
intermediate state that the reader can see. a reader will either
see the flag set or cleared, there's nothing in between.

the trick ixl is trying to do is to ensure that the driver data the
interrupt side uses is allocated and availble when IFF_RUNNING is set.
the intr side in 6.6 wasn't checking that though.

> > The in6_ functions need netlock.  And driver SIOCSIFFLAGS ioctl
> > must not have splnet().
> 
> Why not?  This is new since the introduction of intr_barrier() or this
> is an old issue?
> 
> > Is reducing splnet() the correct aproach?

yes.

> I doubt it is possible to answer this question without defining who owns
> `if_flags' and how can it be read/written to.

NET_LOCK is what "owns" updates to if_flags.

> I'd question if splnet() is needed at all here.  Why is it here in the
> first place?  I'd guess to prevent the interrupt handler to run while 
> SIOCSIFFLAGS ioctl is being executed...  Your diff suggest something
> else...
> 
> > Index: net/if.c
> > ===
> > RCS file: /data/mirror/openbsd/cvs/src/sys/net/if.c,v
> > retrieving revision 1.641
> > diff -u -p -r1.641 if.c
> > --- net/if.c25 May 2021 22:45:09 -  1.641
> > +++ net/if.c10 Jun 2021 14:32:12 -
> > @@ -3109,6 +3109,8 @@ ifnewlladdr(struct ifnet *ifp)
> > short up;
> > int s;
> >  
> > +   NET_ASSERT_LOCKED();
> > +
> > s = splnet();
> > up = ifp->if_flags & IFF_UP;
> >  
> > @@ -3116,11 +3118,14 @@ ifnewlladdr(struct ifnet *ifp)
> > /* go down for a moment... */
> > ifp->if_flags &= ~IFF_UP;
> > ifrq.ifr_flags = ifp->if_flags;
> > +   splx(s);
> > (*ifp->if_ioctl)(ifp, SIOCSIFFLAGS, (caddr_t));
> > +   s = splnet();
> > }
> >  
> > ifp->if_flags |= IFF_UP;
> > ifrq.ifr_flags = ifp->if_flags;
> > +   splx(s);
> > (*ifp->if_ioctl)(ifp, SIOCSIFFLAGS, (caddr_t));
> >  
> >  #ifdef INET6
> > @@ -3139,11 +3144,12 @@ ifnewlladdr(struct ifnet *ifp)
> >  #endif
> > if (!up) {
> > /* go back down */
> > +   s = splnet();
> > ifp->if_flags 

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-15 Thread David Gwynne
On Tue, Jun 08, 2021 at 06:54:25PM +0200, Alexandr Nedvedicky wrote:
> Hello David,
> 
> I'm still not sure if your change is ultimate fix, or just significantly
> minimizes risk of the bug. If I understand things right, the problem we are
> trying to solve:

> DIOCGETSTATES we have in current, grabs NET_LOCK() and pf_state_lock
> as a reader.
> 
> it then walks through the whole state list and copies out (copyout(9f))
> data for each state into ioctl(2) buffer provided by calling process.

yep. it's not an ultimate fix. it's main goal is to allow the ioctl
side of things to read the states safely without blocking the packet
processing side of things.

> we may trip assert down in copyout(9f):
> > panic: acquiring blockable sleep lock with spinlock or critical section
> > held (rwlock) vmmaplk

this is about taking an rwlock (vmmaplk) while holding a mutex
(pflock). you can copyout while holding sleeping locks (eg, NET_LOCK
or the rwlock version of the pflocks).

the pfs_rwl is a sleeping lock, so it's safe to copyout while holding
it. that's why this diff works.

> > Stopped at  db_enter+0x10:  popq%rbp
> > TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> > *512895  28841  0 0x3  03K pfctl
> > db_enter() at db_enter+0x10
> > panic(81e19411) at panic+0x12a
> > witness_checkorder(fd83b09b4d18,1,0) at witness_checkorder+0xbce
> > rw_enter_read(fd83b09b4d08) at rw_enter_read+0x38
> > uvmfault_lookup(8000238e3418,0) at uvmfault_lookup+0x8a
> > uvm_fault_check(8000238e3418,8000238e3450,8000238e3478) at
> > uvm_fault_check+0x32
> > uvm_fault(fd83b09b4d00,e36553c000,0,2) at uvm_fault+0xfc
> > kpageflttrap(8000238e3590,e36553c000) at kpageflttrap+0x131
> > kerntrap(8000238e3590) at kerntrap+0x91
> > alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> > copyout() at copyout+0x53
> 
> 
> I'm just afraid that although your change significantly reduces the risk we
> will die with similar call stack as the one above, the new code is not bullet
> proof. We still do copyout() while holding pf_state_list.pfs_rwl as a reader
> (in pf_states_get() from your diff). I agree packets do not grab
> pf_state_list.pfs_rwl at all. Your fix solves this problem in this respect.

as above, copyout with a sleeping lock is fine.

the whole point of my change is to give us the ability to lock in the
forwarding path separately to locking in the ioctl path. half of that is
so we can copyout safely. the other half is to avoid letting the ioctl
path block packet processing if we can avoid it as an alternative to
having the network stack having to yield the cpu.

> let's take a look at this part of pf_purge_expired_states()
> from your diff:
> 
> 1543 NET_LOCK();
> 1544 rw_enter_write(_state_list.pfs_rwl);
> 1545 PF_LOCK();
> 1546 PF_STATE_ENTER_WRITE();
> 1547 SLIST_FOREACH(st, , gc_list) {
> 1548 if (st->timeout != PFTM_UNLINKED)
> 1549 pf_remove_state(st);
> 1550 
> 1551 pf_free_state(st);
> 1552 }
> 1553 PF_STATE_EXIT_WRITE();
> 1554 PF_UNLOCK();
> 1555 rw_exit_write(_state_list.pfs_rwl);
> 
> at line 1543 we grab NET_LOCK(), at line 1544 we are trying
> to grab new lock (pf_state_list.pfs_rwl) exclusively. 
> 
> with your change we might be running into situation, where we do copyout() as 
> a
> reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to 
> acquire
> pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who might 
> be
> doing uvm_fault() in copyout(9f).
> 
> I'm just worried we may be trading one bug for another bug.  may be my concern
> is just a false alarm here. I don't know.

no, these things are all worth discussing.

it's definitely possible there's bugs in here, but im pretty confident
it's not the copyout one.

> anyway there are few more nits in your diff.

yep.

> 
> 
> > Index: pf.c
> > ===
> > RCS file: /cvs/src/sys/net/pf.c,v
> > retrieving revision 1.1118
> > diff -u -p -r1.1118 pf.c
> > --- pf.c1 Jun 2021 09:57:11 -   1.1118
> > +++ pf.c3 Jun 2021 06:24:48 -
> > @@ -1247,7 +1278,8 @@ pf_purge_expired_rules(void)
> >  void
> >  pf_purge_timeout(void *unused)
> >  {
> > -   task_add(net_tq(0), _purge_task);
> > +   /* XXX move to systqmp to avoid KERNEL_LOCK */
> > +   task_add(systq, _purge_task);
> >  }
> 
> I would just clean up the comment. looks like we should be
> able to get pf's ioctl operations  out of KERNEL_LOCK completely.
> I'll take a further look at it, while be working in pf_ioctl.c

this isnt the ioctl side though, this is the periodic gc task. if the pf
locking is right then we could move the task now.

> 
> > @@ -1280,11 +1311,10 @@ pf_purge(void *xnloops)
> >  * Fragments don't require PF_LOCK(), they use their own lock.
> >   

rework pfsync deferral timeout handling

2021-06-13 Thread David Gwynne
pfsync deferrals are used so if you have firewalls that could both
process packets, you defer sending the initial packet in state so the
peer can learn about the state before potentially handling packets for
it.

there are three ways that a deferal can end. the preferred one is if a
peer firewall acks the insert of the state, and then this firewall can
send the packet. the second is if the number of deferalls gets too high,
it tries to pop the earliest deferal and push the original packet on
without waiting for the firewall. the third is a timeout expires cos the
peer firewall didn't ack in time, so we push the packet on.

every deferal has its own timeout at the moment. unfortunately there
isnt good coordination between the three different paths that could
clean up the deferal, so you can get a fault like this one:

kernel page fault
uvm_fault(0x82131f60, 0x7, 0, 2) -> e
pfsync_defer_tmo(fd816d66d0d0) at pfsync_defer_tmo+0x41
end trace frame: 0x800024cf2130, count: 0
ddb{0}> tr
pfsync_defer_tmo(fd816d66d0d0) at pfsync_defer_tmo+0x41
softclock_thread(8000efc0) at softclock_thread+0x169
end trace frame: 0x0, count: -2

basically two paths are cleaning up a deferal and one of them ends up
doing a use after free. the timeout handler in this one.

this diff moves things around so there's a single timeout that pfsync
processes a list of deferrals from. this allows us to say that a
deferall is pending if it is on the list, and whoever takes it off the
list is responsible for handling it to completion. the reasoning about
memory lifetimes is a lot easier.

we've been running this for a week in production, and it's been good so
far.

i hate the timersub/timercmp macros, so i was going to implement this
with a uint64_t and getnanosecuptime, but i can do that later.

ok?

Index: if_pfsync.c
===
RCS file: /cvs/src/sys/net/if_pfsync.c,v
retrieving revision 1.288
diff -u -p -r1.288 if_pfsync.c
--- if_pfsync.c 10 Mar 2021 10:21:48 -  1.288
+++ if_pfsync.c 4 Jun 2021 00:31:56 -
@@ -187,7 +187,7 @@ struct pfsync_deferral {
TAILQ_ENTRY(pfsync_deferral) pd_entry;
struct pf_state *pd_st;
struct mbuf *pd_m;
-   struct timeout   pd_tmo;
+   struct timeval   pd_deadline;
 };
 TAILQ_HEAD(pfsync_deferrals, pfsync_deferral);
 
@@ -223,6 +223,7 @@ struct pfsync_softc {
struct pfsync_deferrals  sc_deferrals;
u_intsc_deferred;
struct mutex sc_deferrals_mtx;
+   struct timeout   sc_deferrals_tmo;
 
void*sc_plus;
size_t   sc_pluslen;
@@ -273,7 +274,7 @@ voidpfsync_ifdetach(void *);
 
 void   pfsync_deferred(struct pf_state *, int);
 void   pfsync_undefer(struct pfsync_deferral *, int);
-void   pfsync_defer_tmo(void *);
+void   pfsync_deferrals_tmo(void *);
 
 void   pfsync_cancel_full_update(struct pfsync_softc *);
 void   pfsync_request_full_update(struct pfsync_softc *);
@@ -346,6 +347,7 @@ pfsync_clone_create(struct if_clone *ifc
mtx_init(>sc_upd_req_mtx, IPL_SOFTNET);
TAILQ_INIT(>sc_deferrals);
mtx_init(>sc_deferrals_mtx, IPL_SOFTNET);
+   timeout_set_proc(>sc_deferrals_tmo, pfsync_deferrals_tmo, sc);
task_set(>sc_ltask, pfsync_syncdev_state, sc);
task_set(>sc_dtask, pfsync_ifdetach, sc);
sc->sc_deferred = 0;
@@ -1931,6 +1933,9 @@ pfsync_defer(struct pf_state *st, struct
 {
struct pfsync_softc *sc = pfsyncif;
struct pfsync_deferral *pd;
+   struct timeval now;
+   unsigned int sched;
+   static const struct timeval defer = { 0, 2 };
 
NET_ASSERT_LOCKED();
 
@@ -1942,10 +1947,12 @@ pfsync_defer(struct pf_state *st, struct
if (sc->sc_deferred >= 128) {
mtx_enter(>sc_deferrals_mtx);
pd = TAILQ_FIRST(>sc_deferrals);
-   TAILQ_REMOVE(>sc_deferrals, pd, pd_entry);
-   sc->sc_deferred--;
+   if (pd != NULL) {
+   TAILQ_REMOVE(>sc_deferrals, pd, pd_entry);
+   sc->sc_deferred--;
+   }
mtx_leave(>sc_deferrals_mtx);
-   if (timeout_del(>pd_tmo))
+   if (pd != NULL)
pfsync_undefer(pd, 0);
}
 
@@ -1959,13 +1966,18 @@ pfsync_defer(struct pf_state *st, struct
pd->pd_st = pf_state_ref(st);
pd->pd_m = m;
 
+   getmicrouptime();
+   timeradd(, , >pd_deadline);
+
mtx_enter(>sc_deferrals_mtx);
-   sc->sc_deferred++;
+   sched = TAILQ_EMPTY(>sc_deferrals);
+
TAILQ_INSERT_TAIL(>sc_deferrals, pd, pd_entry);
+   sc->sc_deferred++;
mtx_leave(>sc_deferrals_mtx);
 
-   timeout_set_proc(>pd_tmo, pfsync_defer_tmo, pd);
-   

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-13 Thread David Gwynne
On Sat, Jun 05, 2021 at 01:09:01PM +0200, Alexandr Nedvedicky wrote:
> Hello David,
> 
> 
> 
> > the scope of the pf locks likely needs reduction anyway. one of my
> 
> I agree. pf_lock covers too much in PF currently. it protects,
> all rules, tables and fragment caches.
> 
> > production firewalls panicked with the pf lock trying to lock against
> > itself a couple of nights ago:
> > 
> > db_enter() at db_enter+0x5
> > panic(81d47212) at panic+0x12a
> > rw_enter(82060e70,1) at rw_enter+0x261
> > pf_test(18,2,8158f000,800024d03db8) at pf_test+0x118c
> > ip6_output(fd80049a3a00,0,0,0,800024d03eb0,0) at
> > ip6_output+0xd33
> > nd6_ns_output(8158f000,0,800024d04228,fd8279b3b420,0) at
> > nd6_ns
> > _output+0x3e2
> > nd6_resolve(8158f000,fd816c6b2ae0,fd80659d8300,800024d04220
> > ,800024d040d8) at nd6_resolve+0x29d
> > ether_resolve(8158f000,fd80659d8300,800024d04220,fd816c6b2a
> > e0,800024d040d8) at ether_resolve+0x127
> > ether_output(8158f000,fd80659d8300,800024d04220,fd816c6b2ae
> > 0) at ether_output+0x2a
> > ip6_output(fd80659d8300,0,0,0,0,0) at ip6_output+0x1180
> > pfsync_undefer_notify(fd841dbec7b8) at pfsync_undefer_notify+0xac
> > pfsync_undefer(fd841dbec7b8,0) at pfsync_undefer+0x8d
> > pfsync_defer(fd82303cb310,fd8065a2) at pfsync_defer+0xfe
> > pf_test_rule(800024d04600,800024d045e8,800024d045e0,800024d045f
> > 0,800024d045d8,800024d045fe) at pf_test_rule+0x693
> > pf_test(2,3,815a9000,800024d04798) at pf_test+0x10f1
> > ip_output(fd8065a2,0,800024d04850,1,0,0) at ip_output+0x829
> > ip_forward(fd8065a2,81541800,fd817a7722d0,0) at
> > ip_forward+
> > 0x27a  
> > ip_input_if(800024d04a80,800024d04a7c,4,0,81541800) at
> > ip_input
> > _if+0x5fd
> > ipv4_input(81541800,fd8065a2) at ipv4_input+0x37
> > carp_input(8158f000,fd8065a2,5e000158) at
> > carp_input+0x1ac
> > ether_input(8158f000,fd8065a2) at ether_input+0x1c0
> > vlan_input(8152f000,fd8065a2) at vlan_input+0x19a
> > ether_input(8152f000,fd8065a2) at ether_input+0x76
> > if_input_process(801df048,800024d04ca8) at
> > if_input_process+0x5a
> > ifiq_process(801dbe00) at ifiq_process+0x6f
> > taskq_thread(8002b080) at taskq_thread+0x7b
> > end trace frame: 0x0, count: -26
> > 
> 
> diff below postpones call to pfsync_undefer() to moment, when PF_LOCK() is
> released. I believe this should fix the panic you see. the change does not
> look nice. and can be later reverted, when pf_lock scope will be reduced.
> 
> I took a closer look at pfsync_defer() here:
> 
> 1941 
> 1942 if (sc->sc_deferred >= 128) {
> 1943 mtx_enter(>sc_deferrals_mtx);
> 1944 pd = TAILQ_FIRST(>sc_deferrals);
> 1945 TAILQ_REMOVE(>sc_deferrals, pd, pd_entry);
> 1946 sc->sc_deferred--;
> 1947 mtx_leave(>sc_deferrals_mtx);
> 1948 if (timeout_del(>pd_tmo))
> 1949 pfsync_undefer(pd, 0);
> 1950 }
> 1951 
> 1952 pd = pool_get(>sc_pool, M_NOWAIT);
> 1953 if (pd == NULL)
> 1954 return (0);
> 
> it seems to me we may leak `pd` we took at line 1944. The leak
> happens in case timeout_del() return zero.
> 
> diff below ignores the return value from timeout_del() and makes
> sure we always call pfsync_undefefer()
> 
> I have not seen such panic in my test environment. Would you be able
> to give my diff a try?

i hit a different fault in this code recently, so i've been looking
at the code again. my fix for the second fault will conflict with this
one. i'll send it out to tech@ separately though.

your diff looks like it moves the handling of undefer until after pf
drops its locks. i was just going to make pfsync schedule a task or
softint immediately if the queue was getting long, and then stop doing
deferrals if it was way too long.

> 8<---8<---8<--8<
> diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
> index b90aa934de4..a5e5e1ae0f3 100644
> --- a/sys/net/if_pfsync.c
> +++ b/sys/net/if_pfsync.c
> @@ -272,7 +272,6 @@ void  pfsync_syncdev_state(void *);
>  void pfsync_ifdetach(void *);
>  
>  void pfsync_deferred(struct pf_state *, int);
> -void pfsync_undefer(struct pfsync_deferral *, int);
>  void pfsync_defer_tmo(void *);
>  
>  void pfsync_cancel_full_update(struct pfsync_softc *);
> @@ -1927,7 +1926,7 @@ pfsync_insert_state(struct pf_state *st)
>  }
>  
>  int
> -pfsync_defer(struct pf_state *st, struct mbuf *m)
> +pfsync_defer(struct pf_state *st, struct mbuf *m, struct pfsync_deferral 
> **ppd)
>  {
>   struct pfsync_softc *sc = pfsyncif;
>   struct 

(get)nsecuptime

2021-06-13 Thread David Gwynne
we have a few places that use a uint64_t with the number of nanosecons
of uptime the machine has. this factors it out to make them a bit more
generally available.

i was going to add yet another one of these to pfsync, but thought it
might be a good idea to factor them out first.

ok?

Index: kern/kern_tc.c
===
RCS file: /cvs/src/sys/kern/kern_tc.c,v
retrieving revision 1.72
diff -u -p -r1.72 kern_tc.c
--- kern/kern_tc.c  30 Apr 2021 13:52:48 -  1.72
+++ kern/kern_tc.c  4 Jun 2021 02:28:31 -
@@ -196,6 +196,21 @@ binuptime(struct bintime *bt)
 }
 
 void
+getbinuptime(struct bintime *bt)
+{
+   struct timehands *th;
+   u_int gen;
+
+   do {
+   th = timehands;
+   gen = th->th_generation;
+   membar_consumer();
+   *bt = th->th_offset;
+   membar_consumer();
+   } while (gen == 0 || gen != th->th_generation);
+}
+
+void
 nanouptime(struct timespec *tsp)
 {
struct bintime bt;
@@ -233,6 +248,34 @@ getuptime(void)
 
return now;
 #endif
+}
+
+uint64_t
+nsecuptime(void)
+{
+   struct bintime bt;
+   uint64_t nsec;
+
+   binuptime();
+
+   nsec = (10ULL * (bt.frac >> 32)) >> 32;
+   nsec += bt.sec * 10ULL;
+
+   return (nsec);
+}
+
+uint64_t
+getnsecuptime(void)
+{
+   struct bintime bt;
+   uint64_t nsec;
+
+   getbinuptime();
+
+   nsec = (10ULL * (bt.frac >> 32)) >> 32;
+   nsec += bt.sec * 10ULL;
+
+   return (nsec);
 }
 
 void
Index: kern/subr_pool.c
===
RCS file: /cvs/src/sys/kern/subr_pool.c,v
retrieving revision 1.233
diff -u -p -r1.233 subr_pool.c
--- kern/subr_pool.c10 Mar 2021 10:21:47 -  1.233
+++ kern/subr_pool.c4 Jun 2021 02:28:31 -
@@ -272,19 +272,6 @@ struct task pool_gc_task = TASK_INITIALI
 #define POOL_WAIT_FREE SEC_TO_NSEC(1)
 #define POOL_WAIT_GC   SEC_TO_NSEC(8)
 
-/*
- * TODO Move getnsecuptime() to kern_tc.c and document it when we
- * have callers in other modules.
- */
-static uint64_t
-getnsecuptime(void)
-{
-   struct timespec now;
-
-   getnanouptime();
-   return TIMESPEC_TO_NSEC();
-}
-
 RBT_PROTOTYPE(phtree, pool_page_header, ph_node, phtree_compare);
 
 static inline int
Index: kern/vfs_sync.c
===
RCS file: /cvs/src/sys/kern/vfs_sync.c,v
retrieving revision 1.65
diff -u -p -r1.65 vfs_sync.c
--- kern/vfs_sync.c 14 Jan 2021 03:32:01 -  1.65
+++ kern/vfs_sync.c 4 Jun 2021 02:28:31 -
@@ -132,19 +132,6 @@ vn_syncer_add_to_worklist(struct vnode *
 }
 
 /*
- * TODO Move getnsecuptime() to kern_tc.c and document it when we have
- * more users in the kernel.
- */
-static uint64_t
-getnsecuptime(void)
-{
-   struct timespec now;
-
-   getnanouptime();
-   return TIMESPEC_TO_NSEC();
-}
-
-/*
  * System filesystem synchronizer daemon.
  */
 void
Index: net/bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.204
diff -u -p -r1.204 bpf.c
--- net/bpf.c   23 Apr 2021 03:43:19 -  1.204
+++ net/bpf.c   4 Jun 2021 02:28:31 -
@@ -421,19 +421,6 @@ bpfclose(dev_t dev, int flag, int mode, 
(d)->bd_fbuf = NULL;
 
 /*
- * TODO Move nsecuptime() into kern_tc.c and document it when we have
- * more users elsewhere in the kernel.
- */
-static uint64_t
-nsecuptime(void)
-{
-   struct timespec now;
-
-   nanouptime();
-   return TIMESPEC_TO_NSEC();
-}
-
-/*
  *  bpfread - read next chunk of packets from buffers
  */
 int
Index: sys/time.h
===
RCS file: /cvs/src/sys/sys/time.h,v
retrieving revision 1.58
diff -u -p -r1.58 time.h
--- sys/time.h  13 Jan 2021 16:28:50 -  1.58
+++ sys/time.h  4 Jun 2021 02:28:31 -
@@ -290,6 +290,7 @@ voidbinuptime(struct bintime *);
 void   nanouptime(struct timespec *);
 void   microuptime(struct timeval *);
 
+void   getbinuptime(struct bintime *);
 void   getnanouptime(struct timespec *);
 void   getmicrouptime(struct timeval *);
 
@@ -302,6 +303,9 @@ voidnanoruntime(struct timespec *);
 
 time_t gettime(void);
 time_t getuptime(void);
+
+uint64_t   nsecuptime(void);
+uint64_t   getnsecuptime(void);
 
 struct proc;
 intclock_gettime(struct proc *, clockid_t, struct timespec *);



Re: [External] : Re: parallel forwarding vs. bridges

2021-06-03 Thread David Gwynne
On Thu, Jun 03, 2021 at 01:09:48AM +0200, Alexandr Nedvedicky wrote:
> >  pf_purge_expired_states(u_int32_t maxcheck)
> >  {
> 
> > -   cur = pf_state_ref(next);
> > +   do {
> > +   if ((cur->timeout == PFTM_UNLINKED) ||
> > +   (pf_state_expires(cur) <= getuptime())) {
> > +   SLIST_INSERT_HEAD(, pf_state_ref(cur), gc_list);
>   
> I wonder: is the extra reference you are chasing for coming from here?
> I suspect pf_state_ref(cur) is being called two times, as macro expands.

I think you're right :D :D

This is an updated diff with a fix for this and the explanation. It also
reduces the scope of the NET_LOCK so scanning the state list shouldn't
block the network stack.

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1118
diff -u -p -r1.1118 pf.c
--- pf.c1 Jun 2021 09:57:11 -   1.1118
+++ pf.c3 Jun 2021 06:24:48 -
@@ -308,7 +308,7 @@ static __inline void pf_set_protostate(s
 struct pf_src_tree tree_src_tracking;
 
 struct pf_state_tree_id tree_id;
-struct pf_state_queue state_list;
+struct pf_state_list pf_state_list = PF_STATE_LIST_INITIALIZER(pf_state_list);
 
 RB_GENERATE(pf_src_tree, pf_src_node, entry, pf_src_compare);
 RB_GENERATE(pf_state_tree, pf_state_key, entry, pf_state_compare_key);
@@ -440,6 +440,37 @@ pf_check_threshold(struct pf_threshold *
return (threshold->count > threshold->limit);
 }
 
+void
+pf_state_list_insert(struct pf_state_list *pfs, struct pf_state *st)
+{
+   /*
+* we can always put states on the end of the list.
+*
+* things reading the list should take a read lock, then
+* the mutex, get the head and tail pointers, release the
+* mutex, and then they can iterate between the head and tail.
+*/
+
+   pf_state_ref(st); /* get a ref for the list */
+
+   mtx_enter(>pfs_mtx);
+   TAILQ_INSERT_TAIL(>pfs_list, st, entry_list);
+   mtx_leave(>pfs_mtx);
+}
+
+void
+pf_state_list_remove(struct pf_state_list *pfs, struct pf_state *st)
+{
+   /* states can only be removed when the write lock is held */
+   rw_assert_wrlock(>pfs_rwl);
+
+   mtx_enter(>pfs_mtx);
+   TAILQ_REMOVE(>pfs_list, st, entry_list);
+   mtx_leave(>pfs_mtx);
+
+   pf_state_unref(st); /* list no longer references the state */
+}
+
 int
 pf_src_connlimit(struct pf_state **state)
 {
@@ -986,7 +1017,7 @@ pf_state_insert(struct pfi_kif *kif, str
PF_STATE_EXIT_WRITE();
return (-1);
}
-   TAILQ_INSERT_TAIL(_list, s, entry_list);
+   pf_state_list_insert(_state_list, s);
pf_status.fcounters[FCNT_STATE_INSERT]++;
pf_status.states++;
pfi_kif_ref(kif, PFI_KIF_REF_STATE);
@@ -1247,7 +1278,8 @@ pf_purge_expired_rules(void)
 void
 pf_purge_timeout(void *unused)
 {
-   task_add(net_tq(0), _purge_task);
+   /* XXX move to systqmp to avoid KERNEL_LOCK */
+   task_add(systq, _purge_task);
 }
 
 void
@@ -1255,9 +1287,6 @@ pf_purge(void *xnloops)
 {
int *nloops = xnloops;
 
-   KERNEL_LOCK();
-   NET_LOCK();
-
/*
 * process a fraction of the state table every second
 * Note:
@@ -1268,6 +1297,8 @@ pf_purge(void *xnloops)
pf_purge_expired_states(1 + (pf_status.states
/ pf_default_rule.timeout[PFTM_INTERVAL]));
 
+   NET_LOCK();
+
PF_LOCK();
/* purge other expired types every PFTM_INTERVAL seconds */
if (++(*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL]) {
@@ -1280,11 +1311,10 @@ pf_purge(void *xnloops)
 * Fragments don't require PF_LOCK(), they use their own lock.
 */
if ((*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL]) {
-   pf_purge_expired_fragments();
+   pfgpurge_expired_fragment(s);
*nloops = 0;
}
NET_UNLOCK();
-   KERNEL_UNLOCK();
 
timeout_add_sec(_purge_to, 1);
 }
@@ -1447,7 +1477,7 @@ pf_free_state(struct pf_state *cur)
}
pf_normalize_tcp_cleanup(cur);
pfi_kif_unref(cur->kif, PFI_KIF_REF_STATE);
-   TAILQ_REMOVE(_list, cur, entry_list);
+   pf_state_list_remove(_state_list, cur);
if (cur->tag)
pf_tag_unref(cur->tag);
pf_state_unref(cur);
@@ -1458,53 +1488,77 @@ pf_free_state(struct pf_state *cur)
 void
 pf_purge_expired_states(u_int32_t maxcheck)
 {
+   /*
+* this task/thread/context/whatever is the only thing that
+* removes states from the pf_state_list, so the cur reference
+* it holds between calls is guaranteed to still be in the
+* list.
+*/
static struct pf_state  *cur = NULL;
-   struct pf_state *next;
-   SLIST_HEAD(pf_state_gcl, pf_state) gcl;
+
+   struct pf_state *head, *tail;
+   

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-03 Thread David Gwynne
On Thu, Jun 03, 2021 at 01:09:48AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> looks like my earlier mail got eaten by SPAM filter, let's try again...
> 
> 
> > 
> > moving pf locks to mutexes makes sense to me, but like you say,
> > this will need testing and experimentation. one of the issues
> 
> I'm not entirely convinced that trading rwlock for mutexes
> is a good thing. IMO majority of packets are being processed
> as a readers on state lock. If packet does not match state,
> then it needs to be pushed through rules, where we run to pf_lock
> and state lock. Both those locks are grabbed exclusively on
> this path. However I think we can relax pf_lock to reader
> in a future.

there's obviously plusses and minuses for trading rwlocks for mutexes.

here's the states info from one of my firewalls at work:

State Table  Total Rate
  current entries   789418   
  half-open tcp   19101626   
  searches   2835612773525   221420.0/s
  inserts  37030235256 2891.5/s
  removals 37029445838 2891.5/s

the ratio of searches to state table updates is like 40 to 1, so yes,
letting readers through concurrently would be useful.

> I'd like to better understand why mutexes (spin locks) are better in
> network subsystem? I'm concerned that removing all sleeping points from
> network stack will create life harder for scheduler. If such box will
> be hammered by network, the scheduler won't allow other components to
> run, because the whole network subsystem got turned to one giant
> non-interruptible block. I admit my concern may come from lack
> of understanding of overall concept.

as i said, there's plusses and minuses for mutexes as pflocks. the minus
is that you lose the ability to run pf concurrently. however, i'm pretty
sure that there's a lot of mutable state inside pf that gets updated on
every packet that almost certainly benefits from being run exclusively.
im mostly thinking about the packet and byte counters on rules and
states.

my best argument for mutexes in pf is that currently we use smr critical
sections around things like the bridge and aggr port selection, which
are very obviously read-only workloads. pf may be a largely read-only
workload, but where it is at the moment means it's not going to get run
concurrently anyway.

i think the current scope of the locks in pf needs to be looked at
anyway. we might be able to reduce the contention and allow concurrency
with mutexes by doing something more complicated than just swapping the
lock type around. eg, could the ruleset be wired up with SMR pointers
for traversal? could we split the state trees into bucks and hash
packets into them for lookup? stoeplitz is probably cheap enough to make
that possible...

the scope of the pf locks likely needs reduction anyway. one of my
production firewalls panicked with the pf lock trying to lock against
itself a couple of nights ago:

db_enter() at db_enter+0x5
panic(81d47212) at panic+0x12a
rw_enter(82060e70,1) at rw_enter+0x261
pf_test(18,2,8158f000,800024d03db8) at pf_test+0x118c
ip6_output(fd80049a3a00,0,0,0,800024d03eb0,0) at
ip6_output+0xd33
nd6_ns_output(8158f000,0,800024d04228,fd8279b3b420,0) at
nd6_ns
_output+0x3e2
nd6_resolve(8158f000,fd816c6b2ae0,fd80659d8300,800024d04220
,800024d040d8) at nd6_resolve+0x29d
ether_resolve(8158f000,fd80659d8300,800024d04220,fd816c6b2a
e0,800024d040d8) at ether_resolve+0x127
ether_output(8158f000,fd80659d8300,800024d04220,fd816c6b2ae
0) at ether_output+0x2a
ip6_output(fd80659d8300,0,0,0,0,0) at ip6_output+0x1180
pfsync_undefer_notify(fd841dbec7b8) at pfsync_undefer_notify+0xac
pfsync_undefer(fd841dbec7b8,0) at pfsync_undefer+0x8d
pfsync_defer(fd82303cb310,fd8065a2) at pfsync_defer+0xfe
pf_test_rule(800024d04600,800024d045e8,800024d045e0,800024d045f
0,800024d045d8,800024d045fe) at pf_test_rule+0x693
pf_test(2,3,815a9000,800024d04798) at pf_test+0x10f1
ip_output(fd8065a2,0,800024d04850,1,0,0) at ip_output+0x829
ip_forward(fd8065a2,81541800,fd817a7722d0,0) at
ip_forward+
0x27a  
ip_input_if(800024d04a80,800024d04a7c,4,0,81541800) at
ip_input
_if+0x5fd
ipv4_input(81541800,fd8065a2) at ipv4_input+0x37
carp_input(8158f000,fd8065a2,5e000158) at
carp_input+0x1ac
ether_input(8158f000,fd8065a2) at ether_input+0x1c0
vlan_input(8152f000,fd8065a2) at vlan_input+0x19a
ether_input(8152f000,fd8065a2) at ether_input+0x76
if_input_process(801df048,800024d04ca8) at
if_input_process+0x5a
ifiq_process(801dbe00) at ifiq_process+0x6f

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-02 Thread David Gwynne
On Wed, May 19, 2021 at 01:48:26AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> just for the record...
> 
> 
> 
> > > in current tree the ether_input() is protected by NET_LOCK(), which is 
> > > grabbed
> > > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so
> > > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has
> > > implications on smr read section above. The ting is the call to 
> > > eb->eb_input()
> > > can sleep now. This is something what needs to be avoided within smr 
> > > section.
> > 
> > Is the new sleeping point introduced by the fact the PF_LOCK() is a
> > rwlock?  Did you consider using a mutex, at least for the time being,
> > in order to not run in such issues?
> 
> below is a diff, which trades both pf(4) rw-locks for mutexes.
> 
> diff compiles, it still needs testing/experimenting.

hi. i'm trying to get my head back into this space, so i'm trying to
have a look at these diffs properly.

moving pf locks to mutexes makes sense to me, but like you say,
this will need testing and experimentation. one of the issues
identified in another part of this thread (and on icb?) is that the
ioctl path does some stuff which can sleep, but you can't sleep
while holding a mutex which would now be protecting the data
structures you're trying to look at.

a more specific example is if you're trying to dump the state table.
like you say, the state table is currently protected by the
pf_state_lock. iterating over the state table and giving up this lock to
copyout the entries is... not fun.

the diff below is a work in progress attempt to address this. it works
by locking the pf state list separately so it can support traversal
without (long) holds of locks needed in the forwarding path. the very
short explanation is that the TAILQ holding the states is locked
separately to the links between the states in the list. a much longer
explanataion is in the diff in the pfvar_priv.h chunk.

inserting states into the TAILQ while processing packets uses a mutex.
iterating over the list in the pf purge processing, ioctl paths, and
pfsync can largely rely on an rwlock. the rwlock would also allow
concurrent access to the list of states, ie, you can dump the list of
states while the gc thread is looking for states to collect, also while
pfsync is sending out a bulk update. it also converts the ioctl handling
for flushing states to using the list instead of one of the RB trees
holding states. i'm sure everyone wants an omgoptimised state flushing
mechanism.

this is a work in progress because i've screwed up the pf_state
reference counting somehow. states end up with an extra reference, and i
can't see where that's coming from.

anyway. apart from the refcounting thing it seems to be working well,
and would take us a step closer to using mutexes for pf locks.

even if we throw this away, id still like to move the pf purge
processing from out of nettq0 into systq, just to avoid having a place
where a nettq thread has to spin waiting for the kernel lock.

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1118
diff -u -p -r1.1118 pf.c
--- pf.c1 Jun 2021 09:57:11 -   1.1118
+++ pf.c2 Jun 2021 07:52:21 -
@@ -308,7 +308,7 @@ static __inline void pf_set_protostate(s
 struct pf_src_tree tree_src_tracking;
 
 struct pf_state_tree_id tree_id;
-struct pf_state_queue state_list;
+struct pf_state_list pf_state_list = PF_STATE_LIST_INITIALIZER(pf_state_list);
 
 RB_GENERATE(pf_src_tree, pf_src_node, entry, pf_src_compare);
 RB_GENERATE(pf_state_tree, pf_state_key, entry, pf_state_compare_key);
@@ -440,6 +440,37 @@ pf_check_threshold(struct pf_threshold *
return (threshold->count > threshold->limit);
 }
 
+void
+pf_state_list_insert(struct pf_state_list *pfs, struct pf_state *st)
+{
+   /*
+* we can always put states on the end of the list.
+*
+* things reading the list should take a read lock, then
+* the mutex, get the head and tail pointers, release the
+* mutex, and then they can iterate between the head and tail.
+*/
+
+   pf_state_ref(st); /* get a ref for the list */
+
+   mtx_enter(>pfs_mtx);
+   TAILQ_INSERT_TAIL(>pfs_list, st, entry_list);
+   mtx_leave(>pfs_mtx);
+}
+
+void
+pf_state_list_remove(struct pf_state_list *pfs, struct pf_state *st)
+{
+   /* states can only be removed when the write lock is held */
+   rw_assert_wrlock(>pfs_rwl);
+
+   mtx_enter(>pfs_mtx);
+   TAILQ_REMOVE(>pfs_list, st, entry_list);
+   mtx_leave(>pfs_mtx);
+
+   pf_state_unref(st); /* list no longer references the state */
+}
+
 int
 pf_src_connlimit(struct pf_state **state)
 {
@@ -986,7 +1017,7 @@ pf_state_insert(struct pfi_kif *kif, str
PF_STATE_EXIT_WRITE();
return (-1);
}
-   TAILQ_INSERT_TAIL(_list, s, entry_list);
+   

let ipv4_check remember if the ip checksum was good

2021-06-01 Thread David Gwynne
if a bridge checked the ipv4 checksum and it was good, we can avoid
checking it again in ip_input.

ok?

Index: ip_input.c
===
RCS file: /cvs/src/sys/netinet/ip_input.c,v
retrieving revision 1.361
diff -u -p -r1.361 ip_input.c
--- ip_input.c  2 Jun 2021 00:09:57 -   1.361
+++ ip_input.c  2 Jun 2021 01:07:35 -
@@ -287,8 +287,8 @@ ipv4_check(struct ifnet *ifp, struct mbu
}
}
 
-   if ((m->m_pkthdr.csum_flags & M_IPV4_CSUM_IN_OK) == 0) {
-   if (m->m_pkthdr.csum_flags & M_IPV4_CSUM_IN_BAD) {
+   if (!ISSET(m->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK)) {
+   if (ISSET(m->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_BAD)) {
ipstat_inc(ips_badsum);
goto bad;
}
@@ -298,6 +298,8 @@ ipv4_check(struct ifnet *ifp, struct mbu
ipstat_inc(ips_badsum);
goto bad;
}
+
+   SET(m->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK);
}
 
/* Retrieve the packet length. */



Re: nvme(4): fix prpl sync length

2021-05-31 Thread David Gwynne



> On 1 Jun 2021, at 04:17, Patrick Wildt  wrote:
> 
> Hi,
> 
> this call to sync the DMA mem wants to sync N - 1 number of prpl
> entries, as the first segment is configured regularly, while the
> addresses for the following segments (if more than 2), are in a
> special DMA memory.
> 
> The code currently removes a single byte, instead of an entry.
> This just means that it is syncing more than it should.

nice.

> ok?

ok.

> 
> Patrick
> 
> diff --git a/sys/dev/ic/nvme.c b/sys/dev/ic/nvme.c
> index 62b8e40c626..6db25260ef0 100644
> --- a/sys/dev/ic/nvme.c
> +++ b/sys/dev/ic/nvme.c
> @@ -629,7 +629,7 @@ nvme_scsi_io(struct scsi_xfer *xs, int dir)
>   bus_dmamap_sync(sc->sc_dmat,
>   NVME_DMA_MAP(sc->sc_ccb_prpls),
>   ccb->ccb_prpl_off,
> - sizeof(*ccb->ccb_prpl) * dmap->dm_nsegs - 1,
> + sizeof(*ccb->ccb_prpl) * (dmap->dm_nsegs - 1),
>   BUS_DMASYNC_PREWRITE);
>   }
> 
> @@ -691,7 +691,7 @@ nvme_scsi_io_done(struct nvme_softc *sc, struct nvme_ccb 
> *ccb,
>   bus_dmamap_sync(sc->sc_dmat,
>   NVME_DMA_MAP(sc->sc_ccb_prpls),
>   ccb->ccb_prpl_off,
> - sizeof(*ccb->ccb_prpl) * dmap->dm_nsegs - 1,
> + sizeof(*ccb->ccb_prpl) * (dmap->dm_nsegs - 1),
>   BUS_DMASYNC_POSTWRITE);
>   }
> 
> 



Re: Enable pool cache on knote_pool

2021-05-31 Thread David Gwynne



> On 1 Jun 2021, at 02:58, Visa Hankala  wrote:
> 
> This patch enables the pool cache feature on the knote pool to reduce
> the overhead of knote management.
> 
> Profiling done by mpi@ and bluhm@ indicate that the potentially needless
> allocation of knotes in kqueue_register() causes slowdown with
> kqueue-based poll(2) and select(2).
> 
> One approach to fix this is to reverse the function's initial guess
> about knote: Try without allocation first. Then allocate and retry if
> the knote is missing from the kqueue and EV_ADD is given.
> 
> Another option is to cache free knotes so that the shared knote pool
> would be accessed less frequently.
> 
> The following diff takes the second approach. The caching is implemented
> simply by enabling the pool cache feature. This makes use of existing
> code and does not complicate kqueue_register(). The feature also helps
> if there is heavy knote churn.
> 
> I think the most substantial part of the diff is that it extends pool
> cache usage beyond mbufs. Is this change acceptable?

absolutely.

> Note the cache is not particularly useful without kqueue-based poll(2)
> and select(2). The pool view of systat(1) shows that there are pools
> that would benefit more than knote_pool from caching, at least in terms
> of request frequencies. The relative frequencies are dependent on system
> workload, though. Kqpoll would definitely make knote pool more heavily
> used.

ok.

separate to this diff, at some point maybe we should have a task list/dohook 
thing for "per cpu init" like mountroot or startup?

> Index: kern/init_main.c
> ===
> RCS file: src/sys/kern/init_main.c,v
> retrieving revision 1.306
> diff -u -p -r1.306 init_main.c
> --- kern/init_main.c  8 Feb 2021 10:51:01 -   1.306
> +++ kern/init_main.c  31 May 2021 16:50:17 -
> @@ -71,6 +71,7 @@
> #include 
> #endif
> #include 
> +#include 
> #include 
> #include 
> #include 
> @@ -148,7 +149,6 @@ void  crypto_init(void);
> void  db_ctf_init(void);
> void  prof_init(void);
> void  init_exec(void);
> -void kqueue_init(void);
> void  futex_init(void);
> void  taskq_init(void);
> void  timeout_proc_init(void);
> @@ -432,7 +432,9 @@ main(void *framep)
>   prof_init();
> #endif
> 
> - mbcpuinit();/* enable per cpu mbuf data */
> + /* Enable per-CPU data. */
> + mbcpuinit();
> + kqueue_init_percpu();
>   uvm_init_percpu();
> 
>   /* init exec and emul */
> Index: kern/kern_event.c
> ===
> RCS file: src/sys/kern/kern_event.c,v
> retrieving revision 1.163
> diff -u -p -r1.163 kern_event.c
> --- kern/kern_event.c 22 Apr 2021 15:30:12 -  1.163
> +++ kern/kern_event.c 31 May 2021 16:50:17 -
> @@ -231,6 +231,12 @@ kqueue_init(void)
>   PR_WAITOK, "knotepl", NULL);
> }
> 
> +void
> +kqueue_init_percpu(void)
> +{
> + pool_cache_init(_pool);
> +}
> +
> int
> filt_fileattach(struct knote *kn)
> {
> Index: sys/event.h
> ===
> RCS file: src/sys/sys/event.h,v
> retrieving revision 1.54
> diff -u -p -r1.54 event.h
> --- sys/event.h   24 Feb 2021 14:59:52 -  1.54
> +++ sys/event.h   31 May 2021 16:50:18 -
> @@ -292,6 +292,8 @@ extern void   knote_fdclose(struct proc *p
> extern void   knote_processexit(struct proc *);
> extern void   knote_modify(const struct kevent *, struct knote *);
> extern void   knote_submit(struct knote *, struct kevent *);
> +extern void  kqueue_init(void);
> +extern void  kqueue_init_percpu(void);
> extern intkqueue_register(struct kqueue *kq,
>   struct kevent *kev, struct proc *p);
> extern intkqueue_scan(struct kqueue_scan_state *, int, struct kevent *,
> 



Re: mcx(4): sync only received length on RX

2021-05-31 Thread David Gwynne



> On 1 Jun 2021, at 04:15, Patrick Wildt  wrote:
> 
> Hi,
> 
> mcx(4) seems to sync the whole mapsize on processing a received packet.
> As far as I know, we usually only sync the actual size that we have
> received.  Noticed this when doing bounce buffer tests, seeing that
> it copied a lot more data than is necessary.
> 
> That's because the RX buffer size is maximum supported MTU, which is
> about 9500 bytes or so.  For small packets, or regular 1500 bytes,
> this adds overhead.
> 
> This change should not change anything for ARM machines that have a
> cache coherent PCIe bus or x86.
> 
> ok?

ok.

> 
> Patrick
> 
> diff --git a/sys/dev/pci/if_mcx.c b/sys/dev/pci/if_mcx.c
> index 38437e54897..065855d46d3 100644
> --- a/sys/dev/pci/if_mcx.c
> +++ b/sys/dev/pci/if_mcx.c
> @@ -6800,20 +6800,20 @@ mcx_process_rx(struct mcx_softc *sc, struct mcx_rx 
> *rx,
> {
>   struct mcx_slot *ms;
>   struct mbuf *m;
> - uint32_t flags;
> + uint32_t flags, len;
>   int slot;
> 
> + len = bemtoh32(>cq_byte_cnt);
>   slot = betoh16(cqe->cq_wqe_count) % (1 << MCX_LOG_RQ_SIZE);
> 
>   ms = >rx_slots[slot];
> - bus_dmamap_sync(sc->sc_dmat, ms->ms_map, 0, ms->ms_map->dm_mapsize,
> - BUS_DMASYNC_POSTREAD);
> + bus_dmamap_sync(sc->sc_dmat, ms->ms_map, 0, len, BUS_DMASYNC_POSTREAD);
>   bus_dmamap_unload(sc->sc_dmat, ms->ms_map);
> 
>   m = ms->ms_m;
>   ms->ms_m = NULL;
> 
> - m->m_pkthdr.len = m->m_len = bemtoh32(>cq_byte_cnt);
> + m->m_pkthdr.len = m->m_len = len;
> 
>   if (cqe->cq_rx_hash_type) {
>   m->m_pkthdr.ph_flowid = betoh32(cqe->cq_rx_hash);
> 



factor out ipv4 and ipv6 initial packet sanity checks for bridges

2021-05-30 Thread David Gwynne
if you're looking at an ip header, it makes sense to do some checks to
make sure that the values and addresses make some sense. the canonical
versions of these checks are in the ipv4 and ipv6 input paths, which
makes sense. when bridge(4) is about to run packets through pf it makes
sure the ip headers are sane before first, which i think also makes
sense. veb and tpmr don't do these checks before they run pf, but i
think they should. however, duplicating the code again doesn't appeal to
me.

this factors the ip checks out in the ip_input path, and uses that code
from bridge, veb, and tpmr.

this is mostly shuffling the deck chairs, but ipv6 is moved around a bit
more than ipv4, so some eyes and tests would be appreciated.

in the future i think the ipv6 code should do length checks like the
ipv4 code does too. this diff is big enough as it is though.

ok?

Index: net/if_bridge.c
===
RCS file: /cvs/src/sys/net/if_bridge.c,v
retrieving revision 1.354
diff -u -p -r1.354 if_bridge.c
--- net/if_bridge.c 5 Mar 2021 06:44:09 -   1.354
+++ net/if_bridge.c 31 May 2021 04:21:51 -
@@ -1674,61 +1674,12 @@ bridge_ip(struct ifnet *brifp, int dir, 
switch (etype) {
 
case ETHERTYPE_IP:
-   if (m->m_pkthdr.len < sizeof(struct ip))
-   goto dropit;
-
-   /* Copy minimal header, and drop invalids */
-   if (m->m_len < sizeof(struct ip) &&
-   (m = m_pullup(m, sizeof(struct ip))) == NULL) {
-   ipstat_inc(ips_toosmall);
+   m = ipv4_check(ifp, m);
+   if (m == NULL)
return (NULL);
-   }
-   ip = mtod(m, struct ip *);
-
-   if (ip->ip_v != IPVERSION) {
-   ipstat_inc(ips_badvers);
-   goto dropit;
-   }
-
-   hlen = ip->ip_hl << 2;  /* get whole header length */
-   if (hlen < sizeof(struct ip)) {
-   ipstat_inc(ips_badhlen);
-   goto dropit;
-   }
-
-   if (hlen > m->m_len) {
-   if ((m = m_pullup(m, hlen)) == NULL) {
-   ipstat_inc(ips_badhlen);
-   return (NULL);
-   }
-   ip = mtod(m, struct ip *);
-   }
-
-   if ((m->m_pkthdr.csum_flags & M_IPV4_CSUM_IN_OK) == 0) {
-   if (m->m_pkthdr.csum_flags & M_IPV4_CSUM_IN_BAD) {
-   ipstat_inc(ips_badsum);
-   goto dropit;
-   }
-
-   ipstat_inc(ips_inswcsum);
-   if (in_cksum(m, hlen) != 0) {
-   ipstat_inc(ips_badsum);
-   goto dropit;
-   }
-   }
-
-   if (ntohs(ip->ip_len) < hlen)
-   goto dropit;
 
-   if (m->m_pkthdr.len < ntohs(ip->ip_len))
-   goto dropit;
-   if (m->m_pkthdr.len > ntohs(ip->ip_len)) {
-   if (m->m_len == m->m_pkthdr.len) {
-   m->m_len = ntohs(ip->ip_len);
-   m->m_pkthdr.len = ntohs(ip->ip_len);
-   } else
-   m_adj(m, ntohs(ip->ip_len) - m->m_pkthdr.len);
-   }
+   ip = mtod(m, struct ip *);
+   hlen = ip->ip_hl << 2;
 
 #ifdef IPSEC
if ((brifp->if_flags & IFF_LINK2) == IFF_LINK2 &&
@@ -1772,23 +1723,10 @@ bridge_ip(struct ifnet *brifp, int dir, 
break;
 
 #ifdef INET6
-   case ETHERTYPE_IPV6: {
-   struct ip6_hdr *ip6;
-
-   if (m->m_len < sizeof(struct ip6_hdr)) {
-   if ((m = m_pullup(m, sizeof(struct ip6_hdr)))
-   == NULL) {
-   ip6stat_inc(ip6s_toosmall);
-   return (NULL);
-   }
-   }
-
-   ip6 = mtod(m, struct ip6_hdr *);
-
-   if ((ip6->ip6_vfc & IPV6_VERSION_MASK) != IPV6_VERSION) {
-   ip6stat_inc(ip6s_badvers);
-   goto dropit;
-   }
+   case ETHERTYPE_IPV6:
+   m = ipv6_check(ifp, m);
+   if (m == NULL)
+   return (NULL);
 
 #ifdef IPSEC
hlen = sizeof(struct ip6_hdr);
@@ -1819,7 +1757,6 @@ bridge_ip(struct ifnet *brifp, int dir, 
 #endif /* NPF > 0 */
 
break;
-   }
 #endif /* INET6 */
 
default:
Index: net/if_tpmr.c
===
RCS file: /cvs/src/sys/net/if_tpmr.c,v
retrieving revision 1.26
diff -u -p -r1.26 if_tpmr.c
--- 

Re: timeout_del_barrier(9): remove kernel lock

2021-05-10 Thread David Gwynne
On Thu, May 06, 2021 at 11:43:55AM -0500, Scott Cheloha wrote:
> On Wed, May 05, 2021 at 11:05:06AM +1000, David Gwynne wrote:
> > On Tue, May 04, 2021 at 11:54:31AM -0500, Scott Cheloha wrote:
> > > 
> > > [...]
> > > 
> > > Here is where I get confused.
> > > 
> > > Why do I want to wait for *all* timeouts in the queue to finish
> > > running?
> > 
> > You don't, you just want to know that whatever timeout was running
> > has finished cos you don't know if that currently running timeout is
> > yours or not.
> > 
> > > My understanding of the timeout_del_barrier(9) use case was as
> > > follows:
> > > 
> > > 1. I have a dynamically allocated timeout struct.  The timeout is
> > >scheduled to execute at some point in the future.
> > > 
> > > 2. I want to free the memory allocated to for the timeout.
> > > 
> > > 3. To safely free the memory I need to ensure the timeout
> > >is not executing.  Until the timeout function, i.e.
> > >to->to_func(), has returned it isn't necessarily safe to
> > >free that memory.
> > > 
> > > 4. If I call timeout_del_barrier(9), it will only return if the
> > >timeout in question is not running.  Assuming the timeout itself
> > >is not a periodic timeout that reschedules itself we can then
> > >safely free the memory.
> > 
> > Barriers aren't about references to timeouts, they're about references
> > to the work associated with a timeout.
> > 
> > If you only cared about the timeout struct itself, then you can get
> > all the information about whether it's referenced by the timeout
> > queues from the return values from timeout_add and timeout_del.
> > timeout_add() returns 1 if the timeout subsystem takes the reference
> > to the timeout. If the timeout is already scheduled it returns 0
> > because it's already got a reference. timeout_del returns 1 if the
> > timeout itself was scheduled and it removed it, therefore giving
> > up it's reference. If you timeout_del and it returns 0, then it
> > wasn't on a queue inside timeouts. Easy.
> > 
> > What timeout_add and timeout_del don't tell you is whether the work
> > referenced by the timeout is currently running. The timeout runners
> > copy the function and argument onto the stack when they dequeue a
> > timeout to run, and give up the reference to the timeout when the
> > mutex around the timeout wheels/cirqs is released. However, the argument
> > is still referenced on the stack and the function to process it may be
> > about to run or is running. You can't tell that from the timeout_add/del
> > return values though.
> > 
> > We provide two ways to deal with that. One is you have reference
> > counters (or similar) on the thing you're deferring to a timeout.
> > The other is you use a barrier so you know the work you deferred
> > isn't on the timeout runners stack anymore because it's moved on
> > to run the barrier work.
> > 
> > This is consistent with tasks and interrupts.
> > 
> > > Given this understanding, my approach was to focus on the timeout in
> > > question.  So my code just spins until it the timeout is no longer
> > > running.
> > > 
> > > But now I think I am mistaken.
> > > 
> > > IIUC you're telling me (and showing me, in code) that the goal of
> > > timeout_barrier() is to wait for the *whole* softclock() to return,
> > > not just the one timeout.
> > 
> > No, just the currently running timeout.
> > 
> > Putting the barrier timeout on the end of the proc and todo cirqs
> > is because there's no CIRCQ_INSERT_HEAD I can use right now.
> > 
> > > Why do I want to wait for the whole softclock() or softclock_thread()?
> > > Why not just wait for the one timeout to finish?
> > 
> > Cos I was too lazy to write CIRCQ_INSERT_HEAD last night :D
> 
> Okay.  After thinking this over I'm pretty sure we are skinning the
> same cat in two different ways.
> 
> Your cond_wait(9) approach is fine for both timeout types because
> timeout_del_barrier(9) is only safe to call from process context.
> When I wrote my first diff I was under the impression it was safe to
> call timeout_del_barrier(9) from interrupt context.  But I reread the
> manpage and now I see that this is not the case, my bad.
> 
> > [...]
> > 
> > This discussion has some relevance to taskqs too. I was also lazy when I
> > implemented taskq_barrier and used task_add to put the barrier onto the
> &

Re: timeout_del_barrier(9): remove kernel lock

2021-05-04 Thread David Gwynne
On Tue, May 04, 2021 at 11:54:31AM -0500, Scott Cheloha wrote:
> On Tue, May 04, 2021 at 11:21:27PM +1000, David Gwynne wrote:
> > On Tue, May 04, 2021 at 11:24:05AM +0200, Martin Pieuchot wrote:
> > > On 04/05/21(Tue) 01:10, Scott Cheloha wrote:
> > > > [...] 
> > > > I want to run softclock() without the kernel lock.  The way to go, I
> > > > think, is to first push the kernel lock down into timeout_run(), and
> > > > then to remove the kernel lock from each timeout, one by one.
> > > 
> > > Grabbing and releasing the KERNEL_LOCk() on a per-timeout basis creates
> > > more latency than running all timeouts in a batch after having grabbed
> > > the KERNEL_LOCK().  I doubt this is the best way forward.
> > > 
> > > > Before we can push the kernel lock down into timeout_run() we need to
> > > > remove the kernel lock from timeout_del_barrier(9).
> > > 
> > > Seems worth it on its own.
> > >
> > > > The kernel lock is used in timeout_del_barrier(9) to determine whether
> > > > the given timeout has stopped running.  Because the softclock() runs
> > > > with the kernel lock we currently assume that once the calling thread
> > > > has taken the kernel lock any onging softclock() must have returned
> > > > and relinquished the lock, so the timeout in question has returned.
> > 
> > as i'll try to explain below, it's not about waiting for a specific
> > timeout, it's about knowing that a currently running timeout has
> > finished.
> 
> I'm very confused about the distinction between the two.

It is subtle :(

> > > So you want to stop using the KERNEL_LOCK() to do the serialization?  
> > 
> > the KERNEL_LOCK in timeout_barrier cos it was an elegant^Wclever^W hack.
> > because timeouts are run under the kernel lock, i knew i could take the
> > lock and know that timeouts arent running anymore. that's all.
> > 
> > in hindsight this means that the thread calling timeout_barrier spins
> > when it could sleep, and worse, it spins waiting for all pending
> > timeouts to run. so yeah, i agree that undoing the hack is worth it on
> > its own.
> 
> Okay so we're all in agreement that this could be improved, cool.

Yep.

> > > > The simplest replacement I can think of is a volatile pointer to the
> > > > running timeout that we set before leaving the timeout_mutex and clear
> > > > after reentering the same during timeout_run().
> > > 
> > > Sounds like a condition variable protected by this mutex.  Interesting
> > > that cond_wait(9) doesn't work with a mutex. 
> > 
> > cond_wait borrows the sched lock to coordinate between the waiting
> > thread and the signalling context. the cond api is basically a wrapper
> > around sleep_setup/sleep_finish.
> > 
> > > > So in the non-TIMEOUT_PROC case the timeout_del_barrier(9) caller just
> > > > spins until the timeout function returns and the timeout_running
> > > > pointer is changed.  Not every caller can sleep during
> > > > timeout_del_barrier(9).  I think spinning is the simplest thing that
> > > > will definitely work here.
> > > 
> > > This keeps the current semantic indeed.
> > 
> > i don't want to throw timeout_barrier out just yet.
> 
> I'm fine with that... where might it be useful?

A single timeout_barrier call can be used as a barrier for multiple
timeouts. This would feel more natural if timeout_barrier took a flag
instead of a timeout pointer, or if we didnt have the timeout_proc
context.

> > > > -void
> > > > -timeout_barrier(struct timeout *to)
> > > > +int
> > > > +timeout_del_barrier(struct timeout *to)
> > > >  {
> > > > +   struct timeout barrier;
> > > > +   struct cond c = COND_INITIALIZER();
> > > > int needsproc = ISSET(to->to_flags, TIMEOUT_PROC);
> > > >  
> > > > timeout_sync_order(needsproc);
> > > >  
> > > > -   if (!needsproc) {
> > > > -   KERNEL_LOCK();
> > > > -   splx(splsoftclock());
> > > > -   KERNEL_UNLOCK();
> > > > -   } else {
> > > > -   struct cond c = COND_INITIALIZER();
> > > > -   struct timeout barrier;
> > > > +   mtx_enter(_mutex);
> > > > +
> > > > +   if (timeout_del_locked(to)) {
> > > > +   mtx_leave(_m

Re: timeout_del_barrier(9): remove kernel lock

2021-05-04 Thread David Gwynne
On Tue, May 04, 2021 at 11:24:05AM +0200, Martin Pieuchot wrote:
> On 04/05/21(Tue) 01:10, Scott Cheloha wrote:
> > [...] 
> > I want to run softclock() without the kernel lock.  The way to go, I
> > think, is to first push the kernel lock down into timeout_run(), and
> > then to remove the kernel lock from each timeout, one by one.
> 
> Grabbing and releasing the KERNEL_LOCk() on a per-timeout basis creates
> more latency than running all timeouts in a batch after having grabbed
> the KERNEL_LOCK().  I doubt this is the best way forward.
> 
> > Before we can push the kernel lock down into timeout_run() we need to
> > remove the kernel lock from timeout_del_barrier(9).
> 
> Seems worth it on its own.
>
> > The kernel lock is used in timeout_del_barrier(9) to determine whether
> > the given timeout has stopped running.  Because the softclock() runs
> > with the kernel lock we currently assume that once the calling thread
> > has taken the kernel lock any onging softclock() must have returned
> > and relinquished the lock, so the timeout in question has returned.

as i'll try to explain below, it's not about waiting for a specific
timeout, it's about knowing that a currently running timeout has
finished.

> So you want to stop using the KERNEL_LOCK() to do the serialization?  

the KERNEL_LOCK in timeout_barrier cos it was an elegant^Wclever^W hack.
because timeouts are run under the kernel lock, i knew i could take the
lock and know that timeouts arent running anymore. that's all.

in hindsight this means that the thread calling timeout_barrier spins
when it could sleep, and worse, it spins waiting for all pending
timeouts to run. so yeah, i agree that undoing the hack is worth it on
its own.

> > The simplest replacement I can think of is a volatile pointer to the
> > running timeout that we set before leaving the timeout_mutex and clear
> > after reentering the same during timeout_run().
> 
> Sounds like a condition variable protected by this mutex.  Interesting
> that cond_wait(9) doesn't work with a mutex. 

cond_wait borrows the sched lock to coordinate between the waiting
thread and the signalling context. the cond api is basically a wrapper
around sleep_setup/sleep_finish.

> > So in the non-TIMEOUT_PROC case the timeout_del_barrier(9) caller just
> > spins until the timeout function returns and the timeout_running
> > pointer is changed.  Not every caller can sleep during
> > timeout_del_barrier(9).  I think spinning is the simplest thing that
> > will definitely work here.
> 
> This keeps the current semantic indeed.

i don't want to throw timeout_barrier out just yet.

> > -void
> > -timeout_barrier(struct timeout *to)
> > +int
> > +timeout_del_barrier(struct timeout *to)
> >  {
> > +   struct timeout barrier;
> > +   struct cond c = COND_INITIALIZER();
> > int needsproc = ISSET(to->to_flags, TIMEOUT_PROC);
> >  
> > timeout_sync_order(needsproc);
> >  
> > -   if (!needsproc) {
> > -   KERNEL_LOCK();
> > -   splx(splsoftclock());
> > -   KERNEL_UNLOCK();
> > -   } else {
> > -   struct cond c = COND_INITIALIZER();
> > -   struct timeout barrier;
> > +   mtx_enter(_mutex);
> > +
> > +   if (timeout_del_locked(to)) {
> > +   mtx_leave(_mutex);
> > +   return 1;
> > +   }
> >  
> > +   if (needsproc) {
> > timeout_set_proc(, timeout_proc_barrier, );
> > barrier.to_process = curproc->p_p;
> > -
> > -   mtx_enter(_mutex);
> > SET(barrier.to_flags, TIMEOUT_ONQUEUE);
> > CIRCQ_INSERT_TAIL(_proc, _list);
> > mtx_leave(_mutex);
> > -
> > wakeup_one(_proc);
> > -
> > cond_wait(, "tmobar");
> > +   } else {
> > +   mtx_leave(_mutex);
> > +   /* XXX Is this in the right spot? */
> > +   splx(splsoftclock());
> > +   while (timeout_running == to)
> > +   CPU_BUSY_CYCLE();
> 
> Won't splx() will execute the soft-interrupt if there's any pending?
> Shouldn't the barrier be before?  Could you add `spc->spc_spinning++'
> around the spinning loop?  What happen if two threads call
> timeout_del_barrier(9) with the same argument at the same time?  Is
> it possible and/or supported?

the timeout passed to timeout_barrier is only used to figure out which
context the barrier should apply to, ie, it's used to pick between the
softint or proc queue runners. like the other barriers in other parts of
the kernel, it's just supposed to wait for the current work to finish
running.

it is unfortunate that the timeout_barrier man page isn't very
clear. it's worth reading the manpage for intr_barrier and
taskq_barrier. sched_barrier is a thing that exists too, but isn't
documented. also have a look at the EXAMPLE in the cond_wait(9) manpage
too. however, don't read the taskq_barrier code. timeout_barrier
is like intr_barrier in that it uses the argument to find out where
work runs, but again, it doesn't care about the specific 

Re: have bpf kq events fire when the interface goes away

2021-04-21 Thread David Gwynne
On Wed, Apr 21, 2021 at 01:15:53PM +, Visa Hankala wrote:
> On Wed, Apr 21, 2021 at 11:04:20AM +1000, David Gwynne wrote:
> > On Wed, Apr 21, 2021 at 10:21:32AM +1000, David Gwynne wrote:
> > > if you have a program that uses kq (or libevent) to wait for bytes to
> > > read off an idle network interface via /dev/bpf and that interface
> > > goes away, the program doesnt get woken up. this is because the kq
> > > read filter in bpf only checks if there ares bytes available. because a
> > > detached interface never gets packets (how very zen), this condition
> > > never changes and the program will never know something happened.
> > > 
> > > this has the bpf filter check if the interface is detached too. with
> > > this change my test program wakes up, tries to read, and gets EIO. which
> > > is great.
> > > 
> > > note that in the middle of this is the vdevgone machinery. when an
> > > interface is detached, bpfdetach gets called, which ends up calling
> > > vdevgone. vdevgone sort of swaps out bpf on the currently open vdev with
> > > some dead operations, part of which involves calling bpfclose() to try
> > > and clean up the existing state associated with the vdev. bpfclose tries
> > > to wake up any waiting listeners, which includes kq handlers. that's how
> > > the kernel goes from an interface being detached to the bpf kq filter
> > > being run. the bpf kq filter just has to check that the interface is
> > > still attached.
> > 
> > I thought tun(4) had this same problem, but I wrote a test and couldn't
> > reproduce it. tun works because it addresses the problem in a different
> > way. Instead of having its own kq filter check if the device is dead or
> > not, it calls klist_invalidate, which switches things around like the
> > vdevgone/vop_revoke stuff does with the vdev.
> > 
> > So an alternative way to solve this problem in bpf(4) would be the
> > following:
> > 
> > Index: bpf.c
> > ===
> > RCS file: /cvs/src/sys/net/bpf.c,v
> > retrieving revision 1.203
> > diff -u -p -r1.203 bpf.c
> > --- bpf.c   21 Jan 2021 12:33:14 -  1.203
> > +++ bpf.c   21 Apr 2021 00:54:30 -
> > @@ -401,6 +401,7 @@ bpfclose(dev_t dev, int flag, int mode, 
> > bpf_wakeup(d);
> > LIST_REMOVE(d, bd_list);
> > mtx_leave(>bd_mtx);
> > +   klist_invalidate(>bd_sel.si_note);
> > bpf_put(d);
> >  
> > return (0);
> 
> I think bpf should call klist_invalidate() from the detach path.
> bpfsdetach() might be the right place. This would make the code pattern
> similar to the existing uses of klist_invalidate().
> 
> Calling klist_invalidate() from the close function twists the logic,
> at least in my opinion. When a file descriptor is closed, the file
> descriptor layer will remove the knotes automatically. This is why close
> functions usually do not have to manage with knotes. However, the
> automatic removal does not happen when a device is revoked, which is
> mended by klist_invalidate().

yep, makes sense to me. how's this look? it works as well as my previous
diff did in my tests.

Index: bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.203
diff -u -p -r1.203 bpf.c
--- bpf.c   21 Jan 2021 12:33:14 -  1.203
+++ bpf.c   22 Apr 2021 03:09:27 -
@@ -1690,8 +1690,10 @@ bpfsdetach(void *p)
if (cdevsw[maj].d_open == bpfopen)
break;
 
-   while ((bd = SMR_SLIST_FIRST_LOCKED(>bif_dlist)))
+   while ((bd = SMR_SLIST_FIRST_LOCKED(>bif_dlist))) {
vdevgone(maj, bd->bd_unit, bd->bd_unit, VCHR);
+   klist_invalidate(>bd_sel.si_note);
+   }
 
for (tbp = bpf_iflist; tbp; tbp = tbp->bif_next) {
if (tbp->bif_next == bp) {



Re: have bpf kq events fire when the interface goes away

2021-04-20 Thread David Gwynne
On Wed, Apr 21, 2021 at 10:21:32AM +1000, David Gwynne wrote:
> if you have a program that uses kq (or libevent) to wait for bytes to
> read off an idle network interface via /dev/bpf and that interface
> goes away, the program doesnt get woken up. this is because the kq
> read filter in bpf only checks if there ares bytes available. because a
> detached interface never gets packets (how very zen), this condition
> never changes and the program will never know something happened.
> 
> this has the bpf filter check if the interface is detached too. with
> this change my test program wakes up, tries to read, and gets EIO. which
> is great.
> 
> note that in the middle of this is the vdevgone machinery. when an
> interface is detached, bpfdetach gets called, which ends up calling
> vdevgone. vdevgone sort of swaps out bpf on the currently open vdev with
> some dead operations, part of which involves calling bpfclose() to try
> and clean up the existing state associated with the vdev. bpfclose tries
> to wake up any waiting listeners, which includes kq handlers. that's how
> the kernel goes from an interface being detached to the bpf kq filter
> being run. the bpf kq filter just has to check that the interface is
> still attached.

I thought tun(4) had this same problem, but I wrote a test and couldn't
reproduce it. tun works because it addresses the problem in a different
way. Instead of having its own kq filter check if the device is dead or
not, it calls klist_invalidate, which switches things around like the
vdevgone/vop_revoke stuff does with the vdev.

So an alternative way to solve this problem in bpf(4) would be the
following:

Index: bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.203
diff -u -p -r1.203 bpf.c
--- bpf.c   21 Jan 2021 12:33:14 -  1.203
+++ bpf.c   21 Apr 2021 00:54:30 -
@@ -401,6 +401,7 @@ bpfclose(dev_t dev, int flag, int mode, 
bpf_wakeup(d);
LIST_REMOVE(d, bd_list);
mtx_leave(>bd_mtx);
+   klist_invalidate(>bd_sel.si_note);
bpf_put(d);
 
return (0);



have bpf kq events fire when the interface goes away

2021-04-20 Thread David Gwynne
if you have a program that uses kq (or libevent) to wait for bytes to
read off an idle network interface via /dev/bpf and that interface
goes away, the program doesnt get woken up. this is because the kq
read filter in bpf only checks if there ares bytes available. because a
detached interface never gets packets (how very zen), this condition
never changes and the program will never know something happened.

this has the bpf filter check if the interface is detached too. with
this change my test program wakes up, tries to read, and gets EIO. which
is great.

note that in the middle of this is the vdevgone machinery. when an
interface is detached, bpfdetach gets called, which ends up calling
vdevgone. vdevgone sort of swaps out bpf on the currently open vdev with
some dead operations, part of which involves calling bpfclose() to try
and clean up the existing state associated with the vdev. bpfclose tries
to wake up any waiting listeners, which includes kq handlers. that's how
the kernel goes from an interface being detached to the bpf kq filter
being run. the bpf kq filter just has to check that the interface is
still attached.

ok?

Index: bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.203
diff -u -p -r1.203 bpf.c
--- bpf.c   21 Jan 2021 12:33:14 -  1.203
+++ bpf.c   21 Apr 2021 00:03:15 -
@@ -1222,6 +1222,7 @@ int
 filt_bpfread(struct knote *kn, long hint)
 {
struct bpf_d *d = kn->kn_hook;
+   struct bpf_if *bp;
 
KERNEL_ASSERT_LOCKED();
 
@@ -1229,9 +1230,11 @@ filt_bpfread(struct knote *kn, long hint
kn->kn_data = d->bd_hlen;
if (d->bd_immediate)
kn->kn_data += d->bd_slen;
+
+   bp = d->bd_bif; /* check that the interface is still attached */
mtx_leave(>bd_mtx);
 
-   return (kn->kn_data > 0);
+   return (kn->kn_data > 0 || bp == NULL);
 }
 
 /*



pcn(4): use ifq_dequeue instead of ifq_deq_begin/commit/rollback

2021-03-23 Thread David Gwynne
this follows the more standard order for fitting a packet onto a tx
ring. it also uses the more modern m_defrag pattern for heavily
fragmented packets.

Works For Me(tm).

ok?

Index: if_pcn.c
===
RCS file: /cvs/src/sys/dev/pci/if_pcn.c,v
retrieving revision 1.44
diff -u -p -r1.44 if_pcn.c
--- if_pcn.c10 Jul 2020 13:26:38 -  1.44
+++ if_pcn.c18 Mar 2021 01:34:01 -
@@ -811,10 +811,10 @@ void
 pcn_start(struct ifnet *ifp)
 {
struct pcn_softc *sc = ifp->if_softc;
-   struct mbuf *m0, *m;
+   struct mbuf *m0;
struct pcn_txsoft *txs;
bus_dmamap_t dmamap;
-   int error, nexttx, lasttx = -1, ofree, seg;
+   int nexttx, lasttx = -1, ofree, seg;
 
if (!(ifp->if_flags & IFF_RUNNING) || ifq_is_oactive(>if_snd))
return;
@@ -831,80 +831,34 @@ pcn_start(struct ifnet *ifp)
 * descriptors.
 */
for (;;) {
-   /* Grab a packet off the queue. */
-   m0 = ifq_deq_begin(>if_snd);
-   if (m0 == NULL)
+   if (sc->sc_txsfree == 0 ||
+   sc->sc_txfree < (PCN_NTXSEGS + 1)) {
+   ifq_set_oactive(>if_snd);
break;
-   m = NULL;
+   }
 
-   /* Get a work queue entry. */
-   if (sc->sc_txsfree == 0) {
-   ifq_deq_rollback(>if_snd, m0);
+   /* Grab a packet off the queue. */
+   m0 = ifq_dequeue(>if_snd);
+   if (m0 == NULL)
break;
-   }
 
txs = >sc_txsoft[sc->sc_txsnext];
dmamap = txs->txs_dmamap;
 
-   /*
-* Load the DMA map.  If this fails, the packet either
-* didn't fit in the alloted number of segments, or we
-* were short on resources.  In this case, we'll copy
-* and try again.
-*/
-   if (bus_dmamap_load_mbuf(sc->sc_dmat, dmamap, m0,
-   BUS_DMA_WRITE|BUS_DMA_NOWAIT) != 0) {
-   MGETHDR(m, M_DONTWAIT, MT_DATA);
-   if (m == NULL) {
-   ifq_deq_rollback(>if_snd, m0);
-   break;
-   }
-   if (m0->m_pkthdr.len > MHLEN) {
-   MCLGET(m, M_DONTWAIT);
-   if ((m->m_flags & M_EXT) == 0) {
-   ifq_deq_rollback(>if_snd, m0);
-   m_freem(m);
-   break;
-   }
-   }
-   m_copydata(m0, 0, m0->m_pkthdr.len, mtod(m, caddr_t));
-   m->m_pkthdr.len = m->m_len = m0->m_pkthdr.len;
-   error = bus_dmamap_load_mbuf(sc->sc_dmat, dmamap,
-   m, BUS_DMA_WRITE|BUS_DMA_NOWAIT);
-   if (error) {
-   ifq_deq_rollback(>if_snd, m0);
-   break;
-   }
-   }
-
-   /*
-* Ensure we have enough descriptors free to describe
-* the packet.  Note, we always reserve one descriptor
-* at the end of the ring as a termination point, to
-* prevent wrap-around.
-*/
-   if (dmamap->dm_nsegs > (sc->sc_txfree - 1)) {
-   /*
-* Not enough free descriptors to transmit this
-* packet.  We haven't committed anything yet,
-* so just unload the DMA map, put the packet
-* back on the queue, and punt.  Notify the upper
-* layer that there are not more slots left.
-*
-* XXX We could allocate an mbuf and copy, but
-* XXX is it worth it?
-*/
-   ifq_set_oactive(>if_snd);
-   bus_dmamap_unload(sc->sc_dmat, dmamap);
-   m_freem(m);
-   ifq_deq_rollback(>if_snd, m0);
+   switch (bus_dmamap_load_mbuf(sc->sc_dmat, dmamap, m0,
+   BUS_DMA_NOWAIT)) {
+   case 0:
break;
-   }
+   case EFBIG:
+   if (m_defrag(m0, M_DONTWAIT) == 0 &&
+   bus_dmamap_load_mbuf(sc->sc_dmat, dmamap, m0,
+   BUS_DMA_NOWAIT) == 0)
+   break;
 
-   ifq_deq_commit(>if_snd, m0);
-   if (m != NULL) {
+   /* FALLTHROUGH */
+   default:
m_freem(m0);
-  

Re: veb(4) exceeds 1514 byte frame size while bridge(4) doesn't?

2021-03-23 Thread David Gwynne
On Sun, Mar 21, 2021 at 04:24:24PM +0100, Jurjen Oskam wrote:
> Hi,
> 
> When trying out veb(4), I ran into a situation where TCP sessions across a
> veb(4) bridge stalled while the exact same config using bridge(4) worked fine.
> 
> After some investigation, it seems that veb(4) adds an FCS to the outgoing
> frame, while bridge(4) doesn't. When this causes the outgoing frame to exceed
> 1514 bytes, the destination doesn't receive it.
> 
> I must note that I was using USB NICs, one of them being quite old.
> 
> Am I doing something wrong, or is the problem in (one of) the NIC(s)?

it looks like ure(4) hardware doesn't strip the fcs before pushing it to
the host over usb, but the ure(4) driver doesn't strip it.

this usually isn't a huge deal because layers like ip just ignore
the extra bytes. bridge(4) was ok with this because it actually
parses ip packets and removes the extra bytes. veb(4) does a lot less
(by design) so it just lets the fcs on the end of ure packets go out to
other nics.

from what i can tell, ure should remove the fcs. that's what this diff
does. can you try it?

cheers,
dlg

Index: if_ure.c
===
RCS file: /cvs/src/sys/dev/usb/if_ure.c,v
retrieving revision 1.21
diff -u -p -r1.21 if_ure.c
--- if_ure.c14 Oct 2020 23:47:55 -  1.21
+++ if_ure.c23 Mar 2021 10:18:54 -
@@ -1896,10 +1896,17 @@ ure_rxeof(struct usbd_xfer *xfer, void *
ifp->if_ierrors++;
goto done;
}
+   if (pktlen < ETHER_MIN_LEN) {
+   DPRINTF(("ethernet frame is too short\n"));
+   ifp->if_ierrors++;
+   goto done;
+   }
 
total_len -= roundup(pktlen, URE_RX_BUF_ALIGN);
buf += sizeof(rxhdr);
 
+   /* trim fcs */
+   pktlen -= ETHER_CRC_LEN;
m = m_devget(buf, pktlen, ETHER_ALIGN);
if (m == NULL) {
DPRINTF(("unable to allocate mbuf for next packet\n"));



Re: ifconfig.8: document veb(4)

2021-03-11 Thread David Gwynne



> On 11 Mar 2021, at 8:06 pm, Klemens Nanni  wrote:
> 
> On Thu, Mar 11, 2021 at 01:46:34PM +1000, David Gwynne wrote:
> 
>>> +.It Cm link0
>>> +Disable the filtering of 802.1Q VLAN and QinQ SVLAN packets.
>>> +.It Cm -link0
>>> +Enable the filtering of 802.1Q VLAN and QinQ SVLAN packets.
>>> +Packets will appear to enter or leave the member port interfaces.
>> 
>> I dont think this line above should be here.
> I removed this line, but it is also present in TPMR from which I copied
> it?  Should it stay there?

I don't think so. Free commit!

> 
>>> +This is the default.
>>> +.It Cm link1
>>> +Enable the filtering of IPv4 and IPv6 packets with
>>> +.Xr pf 4 .
>>> +.It Cm -link1
>>> +Disable the filtering of IPv4 and IPv6 packets with
>>> +.Xr pf 4 .
>>> +This is the default.
>> 
>> Is it worth noting that vport(4) members of a veb(4) are different to other 
>> ports? The answer to this question doesnt affect my ok, the diff should go 
>> in and we can tweak this later.
> I'll leave that to a separate diff.

Cool.


Re: ifconfig.8: document veb(4)

2021-03-10 Thread David Gwynne



> On 10 Mar 2021, at 23:07, Klemens Nanni  wrote:
> 
> On Tue, Mar 09, 2021 at 08:48:14PM +0100, Klemens Nanni wrote:
>> Simple addition of VEB right before BRIDGE.
> New diff sorting the section alphabetically between UMB and VLAN,
> thanks jmc.
> 
>> All text is copied from other already existing sections, i.e. link flag
>> handling from TPMR and the rest from BIDGE.
>> 
>> Contrary to BRIDGE, I deliberately added a synopsis for VEB such that
>> there's a simple overwie, especially since veb(4) currently does not
>> explain *how* to use the described features.
>> 
>> While TPMR and VEB use the same wording for link flags, their semantics
>> are different, i.e. both different flags and swapped polarity for those
>> flags.
>> 
>> Feedback? OK?

ok after you fix one little thing below.

> 
> Index: ifconfig.8
> ===
> RCS file: /cvs/src/sbin/ifconfig/ifconfig.8,v
> retrieving revision 1.365
> diff -u -p -r1.365 ifconfig.8
> --- ifconfig.89 Mar 2021 19:39:20 -   1.365
> +++ ifconfig.810 Mar 2021 13:05:38 -
> @@ -2044,6 +2044,104 @@ As soon as the interface is marked as "u
> .Xr umb 4
> device will try to establish a data connection with the service provider.
> .El
> +.Sh VEB
> +.nr nS 1
> +.Bk -words
> +.Nm ifconfig
> +.Ar veb-interface
> +.Op Cm add Ar child-iface
> +.Op Cm addspan Ar child-iface
> +.Op Cm del Ar child-iface
> +.Op Cm delspan Ar child-iface
> +.Op Oo Fl Oc Ns Cm discover Ar child-iface
> +.It Oo Fl Oc Ns Cm learn Ar child-iface
> +.Op Oo Fl Oc Ns Cm link0
> +.Op Oo Fl Oc Ns Cm link1
> +.Op Oo Fl Oc Ns Cm protected Ar child-iface ids
> +.Ek
> +.nr nS 0
> +.Pp
> +The following options are available for a
> +.Xr veb 4
> +interface:
> +.Bl -tag -width Ds
> +.It Cm add Ar child-iface
> +Add
> +.Ar child-iface
> +as a member.
> +.It Cm addspan Ar child-iface
> +Add
> +.Ar child-iface
> +as a span port on the bridge.
> +.It Cm del Ar child-iface
> +Remove the member
> +.Ar child-iface .
> +.It Cm delspan Ar child-iface
> +Delete
> +.Ar child-iface
> +from the list of span ports of the bridge.
> +.It Cm discover Ar child-iface
> +Mark
> +.Ar child-iface
> +so that packets are sent out of the interface
> +if the destination port of the packet is unknown.
> +If the bridge has no address cache entry for the destination of
> +a packet, meaning that there is no static entry and no dynamically learned
> +entry for the destination, the bridge will forward the packet to all member
> +interfaces that have this flag set.
> +This is the default for interfaces added to the bridge.
> +.It Cm -discover Ar child-iface
> +Mark
> +.Ar child-iface
> +so that packets are not sent out of the interface
> +if the destination port of the packet is unknown.
> +Turning this flag
> +off means that the bridge will not send packets out of this interface
> +unless the packet is a broadcast packet, multicast packet, or a
> +packet with a destination address found on the interface's segment.
> +This, in combination with static address cache entries,
> +prevents potentially sensitive packets from being sent on
> +segments that have no need to see the packet.
> +.It Cm learn Ar child-iface
> +Mark
> +.Ar child-iface
> +so that the source address of packets received from
> +the interface
> +are entered into the address cache.
> +This is the default for interfaces added to the bridge.
> +.It Cm -learn Ar child-iface
> +Mark
> +.Ar child-iface
> +so that the source address of packets received from interface
> +are not entered into the address cache.
> +.It Cm link0
> +Disable the filtering of 802.1Q VLAN and QinQ SVLAN packets.
> +.It Cm -link0
> +Enable the filtering of 802.1Q VLAN and QinQ SVLAN packets.
> +Packets will appear to enter or leave the member port interfaces.

I dont think this line above should be here.

> +This is the default.
> +.It Cm link1
> +Enable the filtering of IPv4 and IPv6 packets with
> +.Xr pf 4 .
> +.It Cm -link1
> +Disable the filtering of IPv4 and IPv6 packets with
> +.Xr pf 4 .
> +This is the default.

Is it worth noting that vport(4) members of a veb(4) are different to other 
ports? The answer to this question doesnt affect my ok, the diff should go in 
and we can tweak this later.

> +.It Cm protected Ar child-iface ids
> +Put
> +.Ar interface
> +in protected domains.
> +.Ar ids
> +is a comma delimited list of domain IDs, between 1 and 31, to put the
> +interface in.
> +Interfaces that are part of a protected domain cannot forward traffic to any
> +other interface in that domain.
> +Interfaces do not belong to any protected domain by default.
> +.It Cm -protected Ar child-iface
> +Remove
> +.Ar child-iface
> +from all protected domains.
> +.El
> .Sh VLAN
> .nr nS 1
> .Bk -words

veb also supports bridge rules btw.



Re: pcidump(8): add missing PCI classes

2021-03-05 Thread David Gwynne
ok.

> On 5 Mar 2021, at 9:05 pm, Jan Klemkow  wrote:
> 
> Hi,
> 
> this diff adds the missing PCI classes Accelerator and Instrumentation.
> Thus, we can replace a few unknown in its output:
> 
> -   0x0008: Class: 13 (unknown), Subclass: 00 (unknown),
> +   0x0008: Class: 13 Instrumentation, Subclass: 00 (null),
> 
> Both Classes have vendor specific APIs.  So, there are no predefined
> subclasses.
> 
> OK?
> 
> bye,
> Jan
> 
> Index: pcidump.c
> ===
> RCS file: /cvs/src/usr.sbin/pcidump/pcidump.c,v
> retrieving revision 1.61
> diff -u -p -r1.61 pcidump.c
> --- pcidump.c 17 Jan 2021 11:54:15 -  1.61
> +++ pcidump.c 5 Mar 2021 10:57:27 -
> @@ -1296,6 +1296,9 @@ static const struct pci_subclass pci_sub
>   { PCI_SUBCLASS_DASP_MISC,   "Miscellaneous" },
> };
> 
> +static const struct pci_subclass pci_subclass_accelerator[] = {0};
> +static const struct pci_subclass pci_subclass_instrumentation[] = {0};
> +
> #define CLASS(_c, _n, _s) { \
>   .class = _c, \
>   .name = _n, \
> @@ -1338,6 +1341,10 @@ static const struct pci_class pci_classe
>   pci_subclass_crypto),
>   CLASS(PCI_CLASS_DASP,   "DASP",
>   pci_subclass_dasp),
> + CLASS(PCI_CLASS_ACCELERATOR,"Accelerator",
> + pci_subclass_accelerator),
> + CLASS(PCI_CLASS_INSTRUMENTATION, "Instrumentation",
> + pci_subclass_instrumentation),
> };
> 
> static const struct pci_class *
> 



Re: rewritten vxlan(4)

2021-03-04 Thread David Gwynne
On Thu, Mar 04, 2021 at 03:36:19PM +1000, David Gwynne wrote:
> as the subject says, this is a rewrite of vxlan(4).
> 
> vxlan(4) relies on bridge(4) to implement learning, but i want to be
> able to remove bridge(4) one day. while working on veb(4), i wrote
> the guts of a learning bridge implementation that is now used by veb(4),
> bpe(4), and nvgre(4). that learning bridge code is now also used by
> vxlan(4).
> 
> this means that a few of the modes that the manpage talks about are
> different now. because vxlan doesnt need a bridge for learning, there's
> no "multicast mode" anymore, it just does "dynamic mode" out of the box
> when configured with a multicast destination address. there's no
> multipoint mode now too.
> 
> another thing that's always bothered me about vxlan(4) is how it occupies
> the "udp namespace" and gets how it steals packets from the udp stack.
> the new code actually creates and bind udp sockets to handle the
> vxlan packets. this means userland can't collide with a vxlan interface,
> and you get to see that the port is in use in things like netstat. e.g.:
> 
> dlg@ikkaku ~$ ifconfig vxlan0
> vxlan0: flags=8843 mtu 1500
>   lladdr fe:e1:ba:d1:17:2a
>   index 11 llprio 3
>   encap: vnetid none parent aggr0 txprio 0 rxprio outer
>   groups: vxlan
>   tunnel: inet 192.0.2.36 port 4789 --> 239.0.0.1 ttl 1 nodf
>   Addresses (max cache: 100, timeout: 240):
>   inet 100.64.1.36 netmask 0xff00 broadcast 100.64.1.255
> dlg@ikkaku ~$ netstat -na -f inet -p udp
> Active Internet connections (including servers)
> Proto   Recv-Q Send-Q  Local Address  Foreign Address   
> udp  0  0  130.102.96.36.29742129.250.35.250.123
> udp  0  0  130.102.96.36.8965 162.159.200.123.123   
> udp  0  0  130.102.96.36.13189162.159.200.1.123 
> udp  0  0  130.102.96.36.46580220.158.215.20.123
> udp  0  0  130.102.96.36.23109103.38.121.36.123 
> udp  0  0  239.0.0.1.4789 *.*   
> udp  0  0  192.0.2.36.4789*.*   
> 
> ive also added loop prevention, ie, sending an interfaces vxlan
> packets over itself should fail rather than panic now.

here's an updated diff with a few fixes.

Index: netinet/udp_usrreq.c
===
RCS file: /cvs/src/sys/netinet/udp_usrreq.c,v
retrieving revision 1.262
diff -u -p -r1.262 udp_usrreq.c
--- netinet/udp_usrreq.c22 Aug 2020 17:54:57 -  1.262
+++ netinet/udp_usrreq.c5 Mar 2021 06:22:43 -
@@ -112,11 +112,6 @@
 #include 
 #endif
 
-#include "vxlan.h"
-#if NVXLAN > 0
-#include 
-#endif
-
 /*
  * UDP protocol implementation.
  * Per RFC 768, August, 1980.
@@ -350,15 +345,6 @@ udp_input(struct mbuf **mp, int *offp, i
break;
 #endif /* INET6 */
}
-
-#if NVXLAN > 0
-   if (vxlan_enable > 0 &&
-#if NPF > 0
-   !(m->m_pkthdr.pf.flags & PF_TAG_DIVERTED) &&
-#endif
-   vxlan_lookup(m, uh, iphlen, , ) != 0)
-   return IPPROTO_DONE;
-#endif
 
if (m->m_flags & (M_BCAST|M_MCAST)) {
struct inpcb *last;
Index: net/if_vxlan.c
===
RCS file: /cvs/src/sys/net/if_vxlan.c,v
retrieving revision 1.82
diff -u -p -r1.82 if_vxlan.c
--- net/if_vxlan.c  25 Feb 2021 02:48:21 -  1.82
+++ net/if_vxlan.c  5 Mar 2021 06:22:43 -
@@ -1,7 +1,7 @@
-/* $OpenBSD: if_vxlan.c,v 1.82 2021/02/25 02:48:21 dlg Exp $   */
+/* $OpenBSD$ */
 
 /*
- * Copyright (c) 2013 Reyk Floeter 
+ * Copyright (c) 2021 David Gwynne 
  *
  * Permission to use, copy, modify, and distribute this software for any
  * purpose with or without fee is hereby granted, provided that the above
@@ -17,493 +17,759 @@
  */
 
 #include "bpfilter.h"
-#include "vxlan.h"
-#include "vlan.h"
 #include "pf.h"
-#include "bridge.h"
 
 #include 
 #include 
+#include 
 #include 
 #include 
-#include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
-
-#if NBPFILTER > 0
-#include 
-#endif
+#include 
 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
 #include 
+#include 
 
-#if NPF > 0
-#include 
+#ifdef INET6
+#include 
+#include 
+#include 
 #endif
 
-#if NBRIDGE > 0
+/* for bridge stuff */
 #include 
+#include 
+
+#if NBPFILTER > 0
+#include 
 #endif
 
-#include 
+/*
+ * The protocol.
+ */
+
+#define VXLANMTU   1492
+#define VXLAN_PORT 4789
+
+struct vxlan_header {
+  

use 64bit ethernet addresses in carp(4)

2021-03-04 Thread David Gwynne
this passes the destination ethernet address from the network packet
as a uint64_t from ether_input into carp_input, so it can use it
to see if a carp interface should take the packet.

it's been working on amd64 and sparc64. anyone else want to try?

Index: netinet/ip_carp.c
===
RCS file: /cvs/src/sys/netinet/ip_carp.c,v
retrieving revision 1.352
diff -u -p -r1.352 ip_carp.c
--- netinet/ip_carp.c   8 Feb 2021 12:30:10 -   1.352
+++ netinet/ip_carp.c   5 Mar 2021 04:42:27 -
@@ -260,7 +260,7 @@ voidcarp_update_lsmask(struct carp_soft
 intcarp_new_vhost(struct carp_softc *, int, int);
 void   carp_destroy_vhosts(struct carp_softc *);
 void   carp_del_all_timeouts(struct carp_softc *);
-intcarp_vhe_match(struct carp_softc *, uint8_t *);
+intcarp_vhe_match(struct carp_softc *, uint64_t);
 
 struct if_clone carp_cloner =
 IF_CLONE_INITIALIZER("carp", carp_clone_create, carp_clone_destroy);
@@ -1345,6 +1345,7 @@ carp_ourether(struct ifnet *ifp, uint8_t
struct carp_softc *sc;
struct srp_ref sr;
int match = 0;
+   uint64_t dst = ether_addr_to_e64((struct ether_addr *)ena);
 
KASSERT(ifp->if_type == IFT_ETHER);
 
@@ -1352,7 +1353,7 @@ carp_ourether(struct ifnet *ifp, uint8_t
if ((sc->sc_if.if_flags & (IFF_UP|IFF_RUNNING)) !=
(IFF_UP|IFF_RUNNING))
continue;
-   if (carp_vhe_match(sc, ena)) {
+   if (carp_vhe_match(sc, dst)) {
match = 1;
break;
}
@@ -1363,29 +1364,27 @@ carp_ourether(struct ifnet *ifp, uint8_t
 }
 
 int
-carp_vhe_match(struct carp_softc *sc, uint8_t *ena)
+carp_vhe_match(struct carp_softc *sc, uint64_t dst)
 {
struct carp_vhost_entry *vhe;
struct srp_ref sr;
-   int match = 0;
+   int active = 0;
 
vhe = SRPL_FIRST(, >carp_vhosts);
-   match = (vhe->state == MASTER || sc->sc_balancing >= CARP_BAL_IP) &&
-   !memcmp(ena, sc->sc_ac.ac_enaddr, ETHER_ADDR_LEN);
+   active = (vhe->state == MASTER || sc->sc_balancing >= CARP_BAL_IP);
SRPL_LEAVE();
 
-   return (match);
+   return (active && (dst ==
+   ether_addr_to_e64((struct ether_addr *)sc->sc_ac.ac_enaddr)));
 }
 
 struct mbuf *
-carp_input(struct ifnet *ifp0, struct mbuf *m)
+carp_input(struct ifnet *ifp0, struct mbuf *m, uint64_t dst)
 {
-   struct ether_header *eh;
struct srpl *cif;
struct carp_softc *sc;
struct srp_ref sr;
 
-   eh = mtod(m, struct ether_header *);
cif = >if_carp;
 
SRPL_FOREACH(sc, , cif, sc_list) {
@@ -1393,7 +1392,7 @@ carp_input(struct ifnet *ifp0, struct mb
(IFF_UP|IFF_RUNNING))
continue;
 
-   if (carp_vhe_match(sc, eh->ether_dhost)) {
+   if (carp_vhe_match(sc, dst)) {
/*
 * These packets look like layer 2 multicast but they
 * are unicast at layer 3. With help of the tag the
@@ -1417,7 +1416,7 @@ carp_input(struct ifnet *ifp0, struct mb
if (sc == NULL) {
SRPL_LEAVE();
 
-   if (!ETHER_IS_MULTICAST(eh->ether_dhost))
+   if (!ETH64_IS_MULTICAST(dst))
return (m);
 
/*
Index: netinet/ip_carp.h
===
RCS file: /cvs/src/sys/netinet/ip_carp.h,v
retrieving revision 1.50
diff -u -p -r1.50 ip_carp.h
--- netinet/ip_carp.h   24 Jul 2020 18:17:15 -  1.50
+++ netinet/ip_carp.h   5 Mar 2021 04:42:27 -
@@ -209,7 +209,7 @@ carp_strict_addr_chk(struct ifnet *ifp_a
ifp_a->if_carpdevidx == ifp_b->if_carpdevidx));
 }
 
-struct mbuf*carp_input(struct ifnet *, struct mbuf *);
+struct mbuf*carp_input(struct ifnet *, struct mbuf *, uint64_t);
 
 int carp_proto_input(struct mbuf **, int *, int, int);
 voidcarp_carpdev_state(void *);
Index: net/if_ethersubr.c
===
RCS file: /cvs/src/sys/net/if_ethersubr.c,v
retrieving revision 1.272
diff -u -p -r1.272 if_ethersubr.c
--- net/if_ethersubr.c  5 Mar 2021 03:51:41 -   1.272
+++ net/if_ethersubr.c  5 Mar 2021 04:42:27 -
@@ -460,7 +460,7 @@ ether_input(struct ifnet *ifp, struct mb
 */
if (ifp->if_type != IFT_CARP &&
!SRPL_EMPTY_LOCKED(>if_carp)) {
-   m = carp_input(ifp, m);
+   m = carp_input(ifp, m, dst);
if (m == NULL)
return;
 



rewritten vxlan(4)

2021-03-03 Thread David Gwynne
as the subject says, this is a rewrite of vxlan(4).

vxlan(4) relies on bridge(4) to implement learning, but i want to be
able to remove bridge(4) one day. while working on veb(4), i wrote
the guts of a learning bridge implementation that is now used by veb(4),
bpe(4), and nvgre(4). that learning bridge code is now also used by
vxlan(4).

this means that a few of the modes that the manpage talks about are
different now. because vxlan doesnt need a bridge for learning, there's
no "multicast mode" anymore, it just does "dynamic mode" out of the box
when configured with a multicast destination address. there's no
multipoint mode now too.

another thing that's always bothered me about vxlan(4) is how it occupies
the "udp namespace" and gets how it steals packets from the udp stack.
the new code actually creates and bind udp sockets to handle the
vxlan packets. this means userland can't collide with a vxlan interface,
and you get to see that the port is in use in things like netstat. e.g.:

dlg@ikkaku ~$ ifconfig vxlan0
vxlan0: flags=8843 mtu 1500
lladdr fe:e1:ba:d1:17:2a
index 11 llprio 3
encap: vnetid none parent aggr0 txprio 0 rxprio outer
groups: vxlan
tunnel: inet 192.0.2.36 port 4789 --> 239.0.0.1 ttl 1 nodf
Addresses (max cache: 100, timeout: 240):
inet 100.64.1.36 netmask 0xff00 broadcast 100.64.1.255
dlg@ikkaku ~$ netstat -na -f inet -p udp
Active Internet connections (including servers)
Proto   Recv-Q Send-Q  Local Address  Foreign Address   
udp  0  0  130.102.96.36.29742129.250.35.250.123
udp  0  0  130.102.96.36.8965 162.159.200.123.123   
udp  0  0  130.102.96.36.13189162.159.200.1.123 
udp  0  0  130.102.96.36.46580220.158.215.20.123
udp  0  0  130.102.96.36.23109103.38.121.36.123 
udp  0  0  239.0.0.1.4789 *.*   
udp  0  0  192.0.2.36.4789*.*   

ive also added loop prevention, ie, sending an interfaces vxlan
packets over itself should fail rather than panic now.

Index: netinet/udp_usrreq.c
===
RCS file: /cvs/src/sys/netinet/udp_usrreq.c,v
retrieving revision 1.262
diff -u -p -r1.262 udp_usrreq.c
--- netinet/udp_usrreq.c22 Aug 2020 17:54:57 -  1.262
+++ netinet/udp_usrreq.c4 Mar 2021 04:32:03 -
@@ -112,11 +112,6 @@
 #include 
 #endif
 
-#include "vxlan.h"
-#if NVXLAN > 0
-#include 
-#endif
-
 /*
  * UDP protocol implementation.
  * Per RFC 768, August, 1980.
@@ -350,15 +345,6 @@ udp_input(struct mbuf **mp, int *offp, i
break;
 #endif /* INET6 */
}
-
-#if NVXLAN > 0
-   if (vxlan_enable > 0 &&
-#if NPF > 0
-   !(m->m_pkthdr.pf.flags & PF_TAG_DIVERTED) &&
-#endif
-   vxlan_lookup(m, uh, iphlen, , ) != 0)
-   return IPPROTO_DONE;
-#endif
 
if (m->m_flags & (M_BCAST|M_MCAST)) {
struct inpcb *last;
Index: net/if_vxlan.c
===
RCS file: /cvs/src/sys/net/if_vxlan.c,v
retrieving revision 1.82
diff -u -p -r1.82 if_vxlan.c
--- net/if_vxlan.c  25 Feb 2021 02:48:21 -  1.82
+++ net/if_vxlan.c  4 Mar 2021 04:32:03 -
@@ -1,7 +1,7 @@
-/* $OpenBSD: if_vxlan.c,v 1.82 2021/02/25 02:48:21 dlg Exp $   */
+/* $OpenBSD$ */
 
 /*
- * Copyright (c) 2013 Reyk Floeter 
+ * Copyright (c) 2021 David Gwynne 
  *
  * Permission to use, copy, modify, and distribute this software for any
  * purpose with or without fee is hereby granted, provided that the above
@@ -17,475 +17,770 @@
  */
 
 #include "bpfilter.h"
-#include "vxlan.h"
-#include "vlan.h"
 #include "pf.h"
-#include "bridge.h"
 
 #include 
 #include 
+#include 
 #include 
 #include 
-#include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
-
-#if NBPFILTER > 0
-#include 
-#endif
+#include 
 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
 #include 
+#include 
 
-#if NPF > 0
-#include 
+#ifdef INET6
+#include 
+#include 
+#include 
 #endif
 
-#if NBRIDGE > 0
+/* for bridge stuff */
 #include 
+#include 
+
+#if NBPFILTER > 0
+#include 
 #endif
 
-#include 
+/*
+ * The protocol.
+ */
+
+#define VXLANMTU   1492
+#define VXLAN_PORT 4789
+
+struct vxlan_header {
+   uint32_tvxlan_flags;
+#define VXLAN_F_I  (1U << 27)
+   uint32_tvxlan_id;
+#define VXLAN_VNI_SHIFT8
+#defineVXLAN_VNI_MASK  (0xff << VXLAN_VNI_SHIFT)
+};
+
+#define VXLAN_VNI_MAX 

Re: veb(4) support for vmd(8)?

2021-02-28 Thread David Gwynne



> On 27 Feb 2021, at 10:11, Klemens Nanni  wrote:
> 
> On Sat, Feb 27, 2021 at 09:44:03AM +1000, David Gwynne wrote:
>> 
>> 
>>> On 27 Feb 2021, at 7:50 am, Klemens Nanni  wrote:
>>> 
>>> On Sat, Feb 27, 2021 at 07:30:56AM +1000, David Gwynne wrote:
>>>> i think this is enough to let vmd wire guests up to veb interfaces.
>>> But please update vm.conf(5) to mention veb(4) and vport(4) in as well
>>> SWITCH CONFIGURATION.
>> 
>> How would you fit wording about vport(4) in?
> I was too vague, it'd be just veb(4), I guess.
> I wouldn't go into any of the bridge/switch driver's specific
> configuration in vmd(8), i.e. explicitly omit any mention of vether(4)
> or vport(4).
> 
> How about this (quietly moving vmctl(8) from bridge(4) to veb(4) while
> at it...)

reads good to me, so ok. i'll put the .c bits in now.

> 
> 
> Index: vmd/vm.conf.5
> ===
> RCS file: /cvs/src/usr.sbin/vmd/vm.conf.5,v
> retrieving revision 1.55
> diff -u -p -r1.55 vm.conf.5
> --- vmd/vm.conf.5 23 Sep 2020 19:18:18 -  1.55
> +++ vmd/vm.conf.5 27 Feb 2021 00:07:20 -
> @@ -376,9 +376,10 @@ Set the owner to the specified group.
> .Sh SWITCH CONFIGURATION
> A virtual switch allows VMs to communicate with other network interfaces on 
> the
> host system via either
> -.Xr bridge 4
> +.Xr bridge 4 ,
> +.Xr switch 4
> or
> -.Xr switch 4 .
> +.Xr veb 4 .
> The network interface for each virtual switch defined in
> .Nm
> is pre-configured using
> @@ -435,9 +436,10 @@ as described in
> .Xr ifconfig 8 .
> .It Cm interface Ar name
> Set the
> +.Xr bridge 4 ,
> .Xr switch 4
> or
> -.Xr bridge 4
> +.Xr veb 4
> network interface of this switch.
> If the type is changed to
> .Ar switch0 ,
> Index: vmctl/vmctl.8
> ===
> RCS file: /cvs/src/usr.sbin/vmctl/vmctl.8,v
> retrieving revision 1.72
> diff -u -p -r1.72 vmctl.8
> --- vmctl/vmctl.8 16 Feb 2020 11:03:25 -  1.72
> +++ vmctl/vmctl.8 27 Feb 2021 00:07:41 -
> @@ -280,7 +280,7 @@ This tap/vio interface mapping
> allows guest network traffic to be manipulated by the host.
> Any valid host-side interface configuration may be performed on these
> tap interfaces, such as bridging (via
> -.Xr bridge 4 ) ,
> +.Xr veb 4 ) ,
> or using
> .Xr pf 4
> nat-to rules to create private or host-side NATed networks, as desired.
> @@ -423,7 +423,7 @@ Terminate VM number 1:
> # vmctl stop 1
> .Ed
> .Sh SEE ALSO
> -.Xr bridge 4 ,
> +.Xr veb 4 ,
> .Xr pf 4 ,
> .Xr tap 4 ,
> .Xr vio 4 ,



Re: veb(4) support for vmd(8)?

2021-02-26 Thread David Gwynne



> On 27 Feb 2021, at 9:57 am, Mike Larkin  wrote:
> 
> On Sat, Feb 27, 2021 at 09:44:03AM +1000, David Gwynne wrote:
>> 
>> 
>>> On 27 Feb 2021, at 7:50 am, Klemens Nanni  wrote:
>>> 
>>> On Sat, Feb 27, 2021 at 07:30:56AM +1000, David Gwynne wrote:
>>>> i think this is enough to let vmd wire guests up to veb interfaces.
>>> But please update vm.conf(5) to mention veb(4) and vport(4) in as well
>>> SWITCH CONFIGURATION.
>> 
>> How would you fit wording about vport(4) in?
>> 
>>> 
>>> OK kn
>> 
> 
> Do we want to just talk only about veb/vport and remove all the old discussion
> around bridge/vether?

I'd vote that way, but I am biased.

Some test reports that it's working well for people would be nice too.


Re: ip_fragment ip6_fragment

2021-02-26 Thread David Gwynne
I like it. I would like it more if you named the mbuf list variable "fml".

ok by me.

> On 26 Feb 2021, at 9:08 pm, Alexander Bluhm  wrote:
> 
> Hi,
> 
> I always bothered me that ip_fragment() and ip6_fragment() behave
> sligtly differently.  Unify them and use an mlist to simplify the
> fragment list.
> 
> - The functions ip_fragment() and ip6_fragment() always consume the mbuf.
> - They free the mbuf and mbuf list in case of an error.
> - They care about the counter.
> - Adjust the code a bit to make v4 and v6 look similar.
> - Maybe there was an mbuf leak when pf_route6() called pf_refragment6()
>  and it failed.  Now the mbuf is always freed by ip6_fragment().
> 
> ok?
> 
> bluhm
> 
> Index: net/if_bridge.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_bridge.c,v
> retrieving revision 1.352
> diff -u -p -r1.352 if_bridge.c
> --- net/if_bridge.c   25 Feb 2021 02:48:21 -  1.352
> +++ net/if_bridge.c   26 Feb 2021 10:41:57 -
> @@ -1853,7 +1853,7 @@ bridge_fragment(struct ifnet *brifp, str
> struct mbuf *m)
> {
>   struct llc llc;
> - struct mbuf *m0;
> + struct mbuf_list ml;
>   int error = 0;
>   int hassnap = 0;
>   u_int16_t etype;
> @@ -1911,40 +1911,32 @@ bridge_fragment(struct ifnet *brifp, str
>   return;
>   }
> 
> - error = ip_fragment(m, ifp, ifp->if_mtu);
> - if (error) {
> - m = NULL;
> - goto dropit;
> - }
> + error = ip_fragment(m, , ifp, ifp->if_mtu);
> + if (error)
> + return;
> 
> - for (; m; m = m0) {
> - m0 = m->m_nextpkt;
> - m->m_nextpkt = NULL;
> - if (error == 0) {
> - if (hassnap) {
> - M_PREPEND(m, LLC_SNAPFRAMELEN, M_DONTWAIT);
> - if (m == NULL) {
> - error = ENOBUFS;
> - continue;
> - }
> - bcopy(, mtod(m, caddr_t),
> - LLC_SNAPFRAMELEN);
> - }
> - M_PREPEND(m, sizeof(*eh), M_DONTWAIT);
> + while ((m = ml_dequeue()) != NULL) {
> + if (hassnap) {
> + M_PREPEND(m, LLC_SNAPFRAMELEN, M_DONTWAIT);
>   if (m == NULL) {
>   error = ENOBUFS;
> - continue;
> + break;
>   }
> - bcopy(eh, mtod(m, caddr_t), sizeof(*eh));
> - error = bridge_ifenqueue(brifp, ifp, m);
> - if (error) {
> - continue;
> - }
> - } else
> - m_freem(m);
> + bcopy(, mtod(m, caddr_t), LLC_SNAPFRAMELEN);
> + }
> + M_PREPEND(m, sizeof(*eh), M_DONTWAIT);
> + if (m == NULL) {
> + error = ENOBUFS;
> + break;
> + }
> + bcopy(eh, mtod(m, caddr_t), sizeof(*eh));
> + error = bridge_ifenqueue(brifp, ifp, m);
> + if (error)
> + break;
>   }
> -
> - if (error == 0)
> + if (error)
> + ml_purge();
> + else
>   ipstat_inc(ips_fragmented);
> 
>   return;
> Index: net/pf.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v
> retrieving revision 1.1112
> diff -u -p -r1.1112 pf.c
> --- net/pf.c  23 Feb 2021 11:43:40 -  1.1112
> +++ net/pf.c  26 Feb 2021 10:41:57 -
> @@ -5969,7 +5969,8 @@ pf_rtlabel_match(struct pf_addr *addr, s
> void
> pf_route(struct pf_pdesc *pd, struct pf_state *s)
> {
> - struct mbuf *m0, *m1;
> + struct mbuf *m0;
> + struct mbuf_list ml;
>   struct sockaddr_in  *dst, sin;
>   struct rtentry  *rt = NULL;
>   struct ip   *ip;
> @@ -6078,23 +6079,18 @@ pf_route(struct pf_pdesc *pd, struct pf_
>   goto bad;
>   }
> 
> - m1 = m0;
> - error = ip_fragment(m0, ifp, ifp->if_mtu);
> - if (error) {
> - m0 = NULL;
> - goto bad;
> - }
> + error = ip_fragment(m0, , ifp, ifp->if_mtu);
> + if (error)
> + goto done;
> 
> - for (m0 = m1; m0; m0 = m1) {
> - m1 = m0->m_nextpkt;
> - m0->m_nextpkt = NULL;
> - if (error == 0)
> - error = ifp->if_output(ifp, m0, sintosa(dst), rt);
> - else
> - m_freem(m0);
> + while ((m0 = ml_dequeue()) != NULL) {
> + error = ifp->if_output(ifp, m0, sintosa(dst), rt);
> + if (error)
> + break;
>   }
> -
> - if (error == 0)
> 

Re: veb(4) support for vmd(8)?

2021-02-26 Thread David Gwynne



> On 27 Feb 2021, at 7:50 am, Klemens Nanni  wrote:
> 
> On Sat, Feb 27, 2021 at 07:30:56AM +1000, David Gwynne wrote:
>> i think this is enough to let vmd wire guests up to veb interfaces.
> But please update vm.conf(5) to mention veb(4) and vport(4) in as well
> SWITCH CONFIGURATION.

How would you fit wording about vport(4) in?

> 
> OK kn



veb(4) support for vmd(8)?

2021-02-26 Thread David Gwynne
i think this is enough to let vmd wire guests up to veb interfaces.

please remember that veb is not the same as bridge, so some care
has to be taken when replacing bridge with veb. the biggest difference
to note is that if you want the host to talk layer 3 (ie, ip, dhcp,
etc) with the guests, the host must have a vport(4) interface set
up for l3 and added to the veb(4). if you used vether for that, just
replace the vether interfaces with vports.

you can also have guests isolated from the host by not having vport
interfaces on their veb. you can still add a physical interface to the
veb to let guests talk l2 to the real world without having them talk to
the host they're running on.

lastly, veb doesnt filter (non-vport) ports by default. if you're
using pf and bridge to filter between guests, you have to allow pf
to run on veb by setting the link1 flag. care must be taken if
you're also filtering with pf on a vport(4) interface. if anyone is
having trouble with this bit and wants some more pointers, let me know.
i suspect you'll learn more from bitter experience though.

Index: config.c
===
RCS file: /cvs/src/usr.sbin/vmd/config.c,v
retrieving revision 1.58
diff -u -p -r1.58 config.c
--- config.c11 May 2019 19:55:14 -  1.58
+++ config.c26 Feb 2021 21:17:19 -
@@ -40,7 +40,7 @@
 #include "vmd.h"
 
 /* Supported bridge types */
-const char *vmd_descsw[] = { "switch", "bridge", NULL };
+const char *vmd_descsw[] = { "switch", "bridge", "veb", NULL };
 
 static int  config_init_localprefix(struct vmd_config *);
 
Index: priv.c
===
RCS file: /cvs/src/usr.sbin/vmd/priv.c,v
retrieving revision 1.15
diff -u -p -r1.15 priv.c
--- priv.c  28 Jun 2019 13:32:51 -  1.15
+++ priv.c  26 Feb 2021 21:17:19 -
@@ -81,7 +81,8 @@ priv_run(struct privsep *ps, struct priv
 int
 priv_dispatch_parent(int fd, struct privsep_proc *p, struct imsg *imsg)
 {
-   const char  *desct[] = { "tap", "switch", "bridge", NULL };
+   const char  *desct[] = { "tap", "switch", "bridge",
+"veb", NULL };
struct privsep  *ps = p->p_ps;
struct vmop_ifreqvfr;
struct vmd  *env = ps->ps_env;



Re: fix nvme(4): NULL deref. and empty device attachments

2021-02-24 Thread David Gwynne
ok

> On 25 Feb 2021, at 02:34, Jan Klemkow  wrote:
> 
> Hi,
> 
> While attaching the following disks, the nvme driver runs into a NULL
> dereference in nvme_scsi_capacity16() and nvme_scsi_capacity().
> 
> nvme0 at pci1 dev 0 function 0 vendor "Intel", unknown product 0x0a54 rev 
> 0x00: msix, NVMe 1.2
> nvme0: INTEL SSDPE2KX040T8, firmware VDV10170, serial PHLJ0413002P4P0DGN
> scsibus1 at nvme0: 129 targets, initiator 0
> sd0 at scsibus1 targ 1 lun 0: 
> sd0: 3815447MB, 512 bytes/sector, 7814037168 sectors
> sd1 at scsibus1 targ 2 lun 0: 
> uvm_fault(0x821d00e8, 0x0, 0, 1) -> e
> kernel: page fault trap, code=0
> Stopped at  nvme_scsi_capacity16+0x39:  movq0(%rax),%rcx
> ddb{0}>
> 
> "ns" in both functions will be NULL, if "identify" is not allocated in
> nvme_scsi_probe().  Thus, its better to just not attach this empty
> disks/LUNs.
> 
> nvme0 at pci1 dev 0 function 0 vendor "Intel", unknown product 0x0a54 rev 
> 0x00: msix, NVMe 1.2
> nvme0: INTEL SSDPE2KX040T8, firmware VDV10170, serial PHLJ0413002P4P0DGN
> scsibus1 at nvme0: 129 targets, initiator 0
> sd0 at scsibus1 targ 1 lun 0: 
> sd0: 3815447MB, 512 bytes/sector, 7814037168 sectors
> ppb1 at pci0 dev 3 function 2 "AMD 17h PCIE" rev 0x00: msi
> pci2 at ppb1 bus 98
> nvme1 at pci2 dev 0 function 0 vendor "Intel", unknown product 0x0a54 rev 
> 0x00: msix, NVMe 1.2
> nvme1: INTEL SSDPE2KX040T8, firmware VDV10170, serial PHLJ041500C34P0DGN
> scsibus2 at nvme1: 129 targets, initiator 0
> sd1 at scsibus2 targ 1 lun 0: 
> sd1: 3815447MB, 512 bytes/sector, 7814037168 sectors
> ppb2 at pci0 dev 3 function 3 "AMD 17h PCIE" rev 0x00: msi
> pci3 at ppb2 bus 99
> nvme2 at pci3 dev 0 function 0 vendor "Intel", unknown product 0x0a54 rev 
> 0x00: msix, NVMe 1.2
> nvme2: INTEL SSDPE2KX040T8, firmware VDV10170, serial PHLJ041402Z64P0DGN
> scsibus3 at nvme2: 129 targets, initiator 0
> sd2 at scsibus3 targ 1 lun 0: 
> sd2: 3815447MB, 512 bytes/sector, 7814037168 sectors
> ppb3 at pci0 dev 3 function 4 "AMD 17h PCIE" rev 0x00: msi
> pci4 at ppb3 bus 100
> nvme3 at pci4 dev 0 function 0 vendor "Intel", unknown product 0x0a54 rev 
> 0x00: msix, NVMe 1.2
> nvme3: INTEL SSDPE2KX040T8, firmware VDV10170, serial PHLJ041403134P0DGN
> scsibus4 at nvme3: 129 targets, initiator 0
> sd3 at scsibus4 targ 1 lun 0: 
> sd3: 3815447MB, 512 bytes/sector, 7814037168 sectors
> 
> The following diff signals an error for the upper probing function in
> the SCSI layer to prevents further function calls in nvme(4) which would
> just leads to the upper described error and hundreds of not configured
> devices.
> 
> OK?
> 
> bye,
> Jan
> 
> Index: dev/ic/nvme.c
> ===
> RCS file: /cvs//src/sys/dev/ic/nvme.c,v
> retrieving revision 1.90
> diff -u -p -r1.90 nvme.c
> --- dev/ic/nvme.c 9 Feb 2021 01:50:10 -   1.90
> +++ dev/ic/nvme.c 24 Feb 2021 16:01:48 -
> @@ -463,11 +463,16 @@ nvme_scsi_probe(struct scsi_link *link)
>   scsi_io_put(>sc_iopool, ccb);
> 
>   identify = NVME_DMA_KVA(mem);
> - if (rv == 0 && lemtoh64(>nsze) > 0) {
> - /* Commit namespace if it has a size greater than zero. */
> - identify = malloc(sizeof(*identify), M_DEVBUF, M_WAITOK);
> - memcpy(identify, NVME_DMA_KVA(mem), sizeof(*identify));
> - sc->sc_namespaces[link->target].ident = identify;
> + if (rv == 0) {
> + if (lemtoh64(>nsze) > 0) {
> + /* Commit namespace if it has a size greater than zero. 
> */
> + identify = malloc(sizeof(*identify), M_DEVBUF, 
> M_WAITOK);
> + memcpy(identify, NVME_DMA_KVA(mem), sizeof(*identify));
> + sc->sc_namespaces[link->target].ident = identify;
> + } else {
> + /* Don't attach a namespace if its size is zero. */
> + rv = ENXIO;
> + }
>   }
> 
>   nvme_dmamem_free(sc, mem);
> 



Re: have m_copydata use a void * instead of caddr_t

2021-02-23 Thread David Gwynne
On Tue, Feb 23, 2021 at 01:09:06PM +0100, Alexander Bluhm wrote:
> On Tue, Feb 23, 2021 at 07:31:30PM +1000, David Gwynne wrote:
> > i'm not a fan of having to cast to caddr_t when we have modern
> > inventions like void *s we can take advantage of.
> 
> Shoud you remove all the (caddr_t) casts in the callers then?

i asked coccinelle to have a go, but it has terrible ideas about how to
format code.

> Without that step this diff does not provide more consistency.

it's a start though.  cocci and i came up with this to push in after.

Index: arch/armv7/sunxi/sxie.c
===
RCS file: /cvs/src/sys/arch/armv7/sunxi/sxie.c,v
retrieving revision 1.29
diff -u -p -r1.29 sxie.c
--- arch/armv7/sunxi/sxie.c 10 Jul 2020 13:26:36 -  1.29
+++ arch/armv7/sunxi/sxie.c 24 Feb 2021 06:19:13 -
@@ -524,7 +524,7 @@ sxie_start(struct ifnet *ifp)
SXIWRITE4(sc, SXIE_TXPKTLEN0 + (fifo * 4), m->m_pkthdr.len);
 
/* copy the actual packet to fifo XXX through 'align buffer' */
-   m_copydata(m, 0, m->m_pkthdr.len, (caddr_t)td);
+   m_copydata(m, 0, m->m_pkthdr.len, td);
bus_space_write_multi_4(sc->sc_iot, sc->sc_ioh,
SXIE_TXIO0,
(uint32_t *)td, SXIE_ROUNDUP(m->m_pkthdr.len, 4) >> 2);
Index: arch/octeon/dev/octcrypto.c
===
RCS file: /cvs/src/sys/arch/octeon/dev/octcrypto.c,v
retrieving revision 1.3
diff -u -p -r1.3 octcrypto.c
--- arch/octeon/dev/octcrypto.c 10 Mar 2019 14:20:44 -  1.3
+++ arch/octeon/dev/octcrypto.c 24 Feb 2021 06:19:13 -
@@ -739,7 +739,7 @@ octcrypto_authenc_gmac(struct cryptop *c
} else {
if (crp->crp_flags & CRYPTO_F_IMBUF)
m_copydata((struct mbuf *)crp->crp_buf,
-   crde->crd_inject, ivlen, (uint8_t *)iv);
+   crde->crd_inject, ivlen, iv);
else
cuio_copydata((struct uio *)crp->crp_buf,
crde->crd_inject, ivlen, (uint8_t *)iv);
@@ -1035,10 +1035,8 @@ octcrypto_authenc_hmac(struct cryptop *c
memcpy(iv, crde->crd_iv, ivlen);
} else {
if (crp->crp_flags & CRYPTO_F_IMBUF)
-   m_copydata(
-   (struct mbuf *)crp->crp_buf,
-   crde->crd_inject, ivlen,
-   (uint8_t *)iv);
+   m_copydata((struct mbuf *)crp->crp_buf,
+   crde->crd_inject, ivlen, iv);
else
cuio_copydata(
(struct uio *)crp->crp_buf,
Index: dev/ic/acx.c
===
RCS file: /cvs/src/sys/dev/ic/acx.c,v
retrieving revision 1.124
diff -u -p -r1.124 acx.c
--- dev/ic/acx.c10 Jul 2020 13:26:37 -  1.124
+++ dev/ic/acx.c24 Feb 2021 06:19:13 -
@@ -2373,7 +2373,7 @@ acx_set_probe_resp_tmplt(struct acx_soft
IEEE80211_ADDR_COPY(wh->i_addr3, ni->ni_bssid);
*(u_int16_t *)wh->i_seq = 0;
 
-   m_copydata(m, 0, m->m_pkthdr.len, (caddr_t));
+   m_copydata(m, 0, m->m_pkthdr.len, );
len = m->m_pkthdr.len + sizeof(resp.size);
m_freem(m); 
 
@@ -2427,7 +2427,7 @@ acx_set_beacon_tmplt(struct acx_softc *s
return (1);
}
 
-   m_copydata(m, 0, off, (caddr_t));
+   m_copydata(m, 0, off, );
len = off + sizeof(beacon.size);
 
if (acx_set_tmplt(sc, ACXCMD_TMPLT_BEACON, , len) != 0) {
@@ -2442,7 +2442,7 @@ acx_set_beacon_tmplt(struct acx_softc *s
return (0);
}
 
-   m_copydata(m, off, len, (caddr_t));
+   m_copydata(m, off, len, );
len += sizeof(beacon.size);
m_freem(m);
 
Index: dev/ic/an.c
===
RCS file: /cvs/src/sys/dev/ic/an.c,v
retrieving revision 1.77
diff -u -p -r1.77 an.c
--- dev/ic/an.c 8 Dec 2020 04:37:27 -   1.77
+++ dev/ic/an.c 24 Feb 2021 06:19:13 -
@@ -781,7 +781,7 @@ an_mwrite_bap(struct an_softc *sc, int i
len = min(m->m_len, totlen);
 
if ((mtod(m, u_long) & 0x1) || (len & 0x1)) {
-   m_copydata(m, 0, totlen, (caddr_t)>sc_buf.sc_txbuf);
+   m_copydata(m, 0, totlen, >sc_buf.sc_txbuf);
cnt = (totlen + 1) / 2;
an_swap16((

have m_copydata use a void * instead of caddr_t

2021-02-23 Thread David Gwynne
i'm not a fan of having to cast to caddr_t when we have modern
inventions like void *s we can take advantage of.

ok?

Index: share/man/man9/mbuf.9
===
RCS file: /cvs/src/share/man/man9/mbuf.9,v
retrieving revision 1.120
diff -u -p -r1.120 mbuf.9
--- share/man/man9/mbuf.9   12 Dec 2020 11:48:52 -  1.120
+++ share/man/man9/mbuf.9   23 Feb 2021 09:29:55 -
@@ -116,7 +116,7 @@
 .Ft void
 .Fn m_reclaim "void"
 .Ft void
-.Fn m_copydata "struct mbuf *m" "int off" "int len" "caddr_t cp"
+.Fn m_copydata "struct mbuf *m" "int off" "int len" "void *cp"
 .Ft void
 .Fn m_cat "struct mbuf *m" "struct mbuf *n"
 .Ft struct mbuf *
@@ -673,7 +673,7 @@ is a
 pointer, no action occurs.
 .It Fn m_reclaim "void"
 Ask protocols to free unused memory space.
-.It Fn m_copydata "struct mbuf *m" "int off" "int len" "caddr_t cp"
+.It Fn m_copydata "struct mbuf *m" "int off" "int len" "void *cp"
 Copy data from the mbuf chain pointed to by
 .Fa m
 starting at
Index: sys/sys/mbuf.h
===
RCS file: /cvs/src/sys/sys/mbuf.h,v
retrieving revision 1.251
diff -u -p -r1.251 mbuf.h
--- sys/sys/mbuf.h  12 Dec 2020 11:49:02 -  1.251
+++ sys/sys/mbuf.h  23 Feb 2021 09:29:55 -
@@ -435,7 +435,7 @@ int m_copyback(struct mbuf *, int, int, 
 struct mbuf *m_freem(struct mbuf *);
 void   m_purge(struct mbuf *);
 void   m_reclaim(void *, int);
-void   m_copydata(struct mbuf *, int, int, caddr_t);
+void   m_copydata(struct mbuf *, int, int, void *);
 void   m_cat(struct mbuf *, struct mbuf *);
 struct mbuf *m_devget(char *, int, int);
 intm_apply(struct mbuf *, int, int,
Index: sys/kern/uipc_mbuf.c
===
RCS file: /cvs/src/sys/kern/uipc_mbuf.c,v
retrieving revision 1.277
diff -u -p -r1.277 uipc_mbuf.c
--- sys/kern/uipc_mbuf.c13 Jan 2021 12:38:36 -  1.277
+++ sys/kern/uipc_mbuf.c23 Feb 2021 09:29:55 -
@@ -711,8 +711,9 @@ nospace:
  * continuing for "len" bytes, into the indicated buffer.
  */
 void
-m_copydata(struct mbuf *m, int off, int len, caddr_t cp)
+m_copydata(struct mbuf *m, int off, int len, void *p)
 {
+   caddr_t cp = p;
unsigned count;
 
if (off < 0)



Re: veb(4), a virtual ethernet bridge (that could replace bridge(4)?)

2021-02-22 Thread David Gwynne



> On 22 Feb 2021, at 12:46 am, Vitaliy Makkoveev  wrote:
> 
> Hello.
> 
> 
>> +ifp->if_ioctl = veb_ioctl;
>> +ifp->if_input = veb_input;
>> +//ifp->if_rtrequest = veb_rtrequest;
>> +ifp->if_output = veb_output;
>> +ifp->if_enqueue = veb_enqueue;
> 
> Could you replace c++ style comment in veb_clone_create()?

yep.

> 
>> +veb_clone_destroy(struct ifnet *ifp)
>> +{
>> +struct veb_softc *sc = ifp->if_softc;
>> +struct veb_port *p, *np;
>> +
>> +NET_LOCK();
>> +sc->sc_dead = 1;
>> +
>> +if (ISSET(ifp->if_flags, IFF_RUNNING))
>> +veb_down(sc);
>> +NET_UNLOCK();
>> +
>> +if_detach(ifp);
> 
> 
> Also veb_down() looks strange here. I guess it is no reason to 
> play with `if_flags' here and smr_barrier() could be called after
> if_detach(). This makes `sc_dead’ unnecessary.

i need to think about sc_dead again. i do it in a bunch of different drivers 
and you're pretty confident it's not needed anymore.

technically the flags don't need to be cleared, but i like having the flow 
right in case i made veb_down do more in the future.



Re: Posted vs. non-posted device access

2021-02-15 Thread David Gwynne



> On 16 Feb 2021, at 06:01, Mark Kettenis  wrote:
> 
>> Date: Mon, 15 Feb 2021 01:19:29 +0100
>> From: Patrick Wildt 
>> 
>> Am Mon, Feb 15, 2021 at 09:55:56AM +1000 schrieb David Gwynne:
>>> 
>>> 
>>>> On 15 Feb 2021, at 07:54, Mark Kettenis  wrote:
>>>> 
>>>> One of the aspects of device access is whether CPU writes to a device
>>>> are posted or non-posted.  For non-posted writes, the CPU will wait
>>>> for the device to acknowledge that the write has performed.  If the
>>>> device sits on a bus far away, this can take a while and slow things
>>>> down.  The alternative are so-called posted writes.  The CPU will
>>>> "post" the write to the bus without waiting for an acknowledgement.
>>>> The CPU may receive an asynchronous notifaction at a later time that
>>>> the write didn't succeed or a failing write may be dropped without
>>>> further botification.  On most architectures whether writes are posted
>>>> or not is a property of the bus between the CPU and the device.  For
>>>> example, memory mapped I/O on the PCI bus is always posted and there
>>>> is nothing the CPU can do about it.
>>>> 
>>>> On the ARM architecture though we can indicate to the CPU whether
>>>> writes to a certain address range should be posted or not.  This is
>>>> done by specifying certain memory attributes in the mappings used by
>>>> the MMU.  The OpenBSD kernel always specifies device access as
>>>> non-posted.  On all ARM implementations we have seen so far this seems
>>>> to work even for writes to devices connected to a PCIe bus.  There
>>>> might be a penalty though, so I need to investigate this a bit
>>>> further.
>>>> 
>>>> However, on Apple's M1 SoC, this isn't the case.  Non-posted writes to
>>>> a bus that uses posted writes fail and vice-versa.  So in order to use
>>>> the PCIe bus on these SoCs we need to specify the right memory
>>>> attributes.  The diff below implements this by introducing a new
>>>> BUS_SPACE_MAP_POSTED flag.  At this point I don't expect generic
>>>> drivers to use this flag yet.  So there is no need to add it for other
>>>> architectures.  But I don't rule out we may have to use this flag in
>>>> sys/dev/fdt sometime in the future.  That is why I posted this to a
>>>> wider audience.
>>> 
>>> You don't want to (ab)use one of the existing flags? If I squint
>>> and read kind of quickly I could imagine this is kind of like
>>> write combining, like what BUS_SPACE_MAP_PREFETCHABLE can do on
>>> pci busses.
>> 
>> BUS_SPACE_MAP_PREFETCHABLE should be "normal uncached" memory on arm64,
>> which is different to device memory.  That said I have a device where
>> amdgpu(4) doesn't behave if it's "normal uncached", and I'm not sure if
>> it's the HW's fault or if there's some barrier missing.  Still, I would
>> not use BUS_SPACE_MAP_PREFETCHABLE for nGnRnE vs nGnRE.
>> 
>> More info on device vs normal is here:
>> 
>> https://developer.arm.com/documentation/102376/0100/Normal-memory
>> https://developer.arm.com/documentation/102376/0100/Device-memory
> 
> BUS_SPACE_MAP_PREFETCHABLE is used for parts of the address space that
> are "side-effect free".  That means that multiple writes may be
> combined into one and reads might actually fetch more data than you
> asked for.  Typical use is a frambuffer or some other device memory
> that is accessed across a PCI bus.  In most cases that is not what you
> want to access device registers where a read or a write triggers some
> action in the device.
> 
> Posted writes still happen as issued (and in principle in the same
> order as issued).  But they may complete at a later time after the CPU
> has executed many more instructions.  The traditional way to make sure
> posted writes on a PCI bus have completed is to read something back
> from the device.
> 
> So as Patrick said, it's a different thing.
> 
>>> If this does leak into fdt, would it just be a nop on other archs
>>> that use those drivers?
> 
> Most likely, yes.  All other architectures that I know have don't
> require the CPU to do someting different for posted and non-posted
> writes.

fair enough. maybe whether the bus is posted or not could be part of the 
fdt/acpi info, rather than something hardcoded in drivers?

anyway, i'm ok with this as is so go for it.

dlg

> 
>>>> 
>>>> ok?

use rtalloc_mpath in pf_route{,6}

2021-02-15 Thread David Gwynne
if you have multiple links to the same destination, this will let you
use them via route-to/reply-to/dup-to.

ok?

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1110
diff -u -p -r1.1110 pf.c
--- pf.c12 Feb 2021 16:16:10 -  1.1110
+++ pf.c15 Feb 2021 09:59:50 -
@@ -6020,7 +6020,7 @@ pf_route(struct pf_pdesc *pd, struct pf_
dst->sin_addr = s->rt_addr.v4;
rtableid = m0->m_pkthdr.ph_rtableid;
 
-   rt = rtalloc(sintosa(dst), RT_RESOLVE, rtableid);
+   rt = rtalloc_mpath(sintosa(dst), >ip_src.s_addr, rtableid);
if (!rtisvalid(rt)) {
if (s->rt != PF_DUPTO) {
pf_send_icmp(m0, ICMP_UNREACH, ICMP_UNREACH_HOST,
@@ -6162,7 +6162,8 @@ pf_route6(struct pf_pdesc *pd, struct pf
dst->sin6_addr = s->rt_addr.v6;
rtableid = m0->m_pkthdr.ph_rtableid;
 
-   rt = rtalloc(sin6tosa(dst), RT_RESOLVE, rtableid);
+   rt = rtalloc_mpath(sin6tosa(dst), >ip6_src.s6_addr32[0],
+   rtableid);
if (!rtisvalid(rt)) {
if (s->rt != PF_DUPTO) {
pf_send_icmp(m0, ICMP6_DST_UNREACH,



veb(4), a virtual ethernet bridge (that could replace bridge(4)?)

2021-02-14 Thread David Gwynne
_flush(>sc_eb, IFBF_FLUSHALL);
 
return (0);
 }
@@ -771,7 +624,7 @@ bpe_set_parent(struct bpe_softc *sc, con
 
/* commit */
sc->sc_key.k_if = ifp0->if_index;
-   bpe_flush_map(sc, IFBF_FLUSHALL);
+   etherbridge_flush(>sc_eb, IFBF_FLUSHALL);
 
 put:
if_put(ifp0);
@@ -804,7 +657,7 @@ bpe_del_parent(struct bpe_softc *sc)
 
/* commit */
sc->sc_key.k_if = 0;
-   bpe_flush_map(sc, IFBF_FLUSHALL);
+   etherbridge_flush(>sc_eb, IFBF_FLUSHALL);
 
return (0);
 }
@@ -822,75 +675,6 @@ bpe_find(struct ifnet *ifp0, uint32_t is
return (sc);
 }
 
-static void
-bpe_input_map(struct bpe_softc *sc, const uint8_t *ba, const uint8_t *ca)
-{
-   struct bpe_entry *be;
-   int new = 0;
-
-   if (ETHER_IS_MULTICAST(ca))
-   return;
-
-   /* remember where it came from */
-   rw_enter_read(>sc_bridge_lock);
-   be = RBT_FIND(bpe_map, >sc_bridge_map, (struct bpe_entry *)ca);
-   if (be == NULL)
-   new = 1;
-   else {
-   be->be_age = getuptime(); /* only a little bit racy */
-
-   if (be->be_type != BPE_ENTRY_DYNAMIC ||
-   ETHER_IS_EQ(ba, >be_b_da))
-   be = NULL;
-   else
-   refcnt_take(>be_refs);
-   }
-   rw_exit_read(>sc_bridge_lock);
-
-   if (new) {
-   struct bpe_entry *obe;
-   unsigned int num;
-
-   be = pool_get(_entry_pool, PR_NOWAIT);
-   if (be == NULL) {
-   /* oh well */
-   return;
-   }
-
-   memcpy(>be_c_da, ca, sizeof(be->be_c_da));
-   memcpy(>be_b_da, ba, sizeof(be->be_b_da));
-   be->be_type = BPE_ENTRY_DYNAMIC;
-   refcnt_init(>be_refs);
-   be->be_age = getuptime();
-
-   rw_enter_write(>sc_bridge_lock);
-   num = sc->sc_bridge_num;
-   if (++num > sc->sc_bridge_max)
-   obe = be;
-   else {
-   /* try and give the ref to the map */
-   obe = RBT_INSERT(bpe_map, >sc_bridge_map, be);
-   if (obe == NULL) {
-   /* count the insert */
-   sc->sc_bridge_num = num;
-   }
-   }
-   rw_exit_write(>sc_bridge_lock);
-
-   if (obe != NULL)
-   pool_put(_entry_pool, obe);
-   } else if (be != NULL) {
-   rw_enter_write(>sc_bridge_lock);
-   memcpy(>be_b_da, ba, sizeof(be->be_b_da));
-   rw_exit_write(>sc_bridge_lock);
-
-   if (refcnt_rele(>be_refs)) {
-   /* ioctl may have deleted the entry */
-   pool_put(_entry_pool, be);
-   }
-   }
-}
-
 void
 bpe_input(struct ifnet *ifp0, struct mbuf *m)
 {
@@ -928,7 +712,8 @@ bpe_input(struct ifnet *ifp0, struct mbu
 
ceh = (struct ether_header *)(itagp + 1);
 
-   bpe_input_map(sc, beh->ether_shost, ceh->ether_shost);
+   etherbridge_map(>sc_eb, ceh->ether_shost,
+   (struct ether_addr *)beh->ether_shost);
 
m_adj(m, sizeof(*beh) + sizeof(*itagp));
 
@@ -1035,12 +820,62 @@ bpe_cmp(const struct bpe_key *a, const s
return (1);
if (a->k_isid < b->k_isid)
return (-1);
-
+ 
return (0);
 }
 
-static inline int
-bpe_entry_cmp(const struct bpe_entry *a, const struct bpe_entry *b)
+static int
+bpe_eb_port_eq(void *arg, void *a, void *b)
+{
+   struct ether_addr *ea = a, *eb = b;
+
+   return (memcmp(ea, eb, sizeof(*ea)) == 0);
+}
+
+static void *
+bpe_eb_port_take(void *arg, void *port)
+{
+   struct ether_addr *ea = port;
+   struct ether_addr *endpoint;
+
+   endpoint = pool_get(_endpoint_pool, PR_NOWAIT);
+   if (endpoint == NULL)
+   return (NULL);
+
+   memcpy(endpoint, ea, sizeof(*endpoint));
+
+   return (endpoint);
+}
+
+static void
+bpe_eb_port_rele(void *arg, void *port)
+{
+   struct ether_addr *endpoint = port;
+
+   pool_put(_endpoint_pool, endpoint);
+}
+
+static size_t
+bpe_eb_port_ifname(void *arg, char *dst, size_t len, void *port)
 {
-   return memcmp(>be_c_da, >be_c_da, sizeof(a->be_c_da));
+   struct bpe_softc *sc = arg;
+
+   return (strlcpy(dst, sc->sc_ac.ac_if.if_xname, len));
+}
+
+static void
+bpe_eb_port_sa(void *arg, struct sockaddr_storage *ss, void *port)
+{
+   struct ether_addr *endpoint = port;
+   struct sockaddr_dl *sdl;
+
+   sdl = (struct sockaddr_dl *)ss;
+   sdl->sdl_len = sizeof(sdl);
+   sdl->sdl_family = AF_LINK;
+   sdl->sdl_index = 0;
+   sdl->sdl_type = IFT_ETHER;
+   sdl->

Re: Posted vs. non-posted device access

2021-02-14 Thread David Gwynne



> On 15 Feb 2021, at 07:54, Mark Kettenis  wrote:
> 
> One of the aspects of device access is whether CPU writes to a device
> are posted or non-posted.  For non-posted writes, the CPU will wait
> for the device to acknowledge that the write has performed.  If the
> device sits on a bus far away, this can take a while and slow things
> down.  The alternative are so-called posted writes.  The CPU will
> "post" the write to the bus without waiting for an acknowledgement.
> The CPU may receive an asynchronous notifaction at a later time that
> the write didn't succeed or a failing write may be dropped without
> further botification.  On most architectures whether writes are posted
> or not is a property of the bus between the CPU and the device.  For
> example, memory mapped I/O on the PCI bus is always posted and there
> is nothing the CPU can do about it.
> 
> On the ARM architecture though we can indicate to the CPU whether
> writes to a certain address range should be posted or not.  This is
> done by specifying certain memory attributes in the mappings used by
> the MMU.  The OpenBSD kernel always specifies device access as
> non-posted.  On all ARM implementations we have seen so far this seems
> to work even for writes to devices connected to a PCIe bus.  There
> might be a penalty though, so I need to investigate this a bit
> further.
> 
> However, on Apple's M1 SoC, this isn't the case.  Non-posted writes to
> a bus that uses posted writes fail and vice-versa.  So in order to use
> the PCIe bus on these SoCs we need to specify the right memory
> attributes.  The diff below implements this by introducing a new
> BUS_SPACE_MAP_POSTED flag.  At this point I don't expect generic
> drivers to use this flag yet.  So there is no need to add it for other
> architectures.  But I don't rule out we may have to use this flag in
> sys/dev/fdt sometime in the future.  That is why I posted this to a
> wider audience.

You don't want to (ab)use one of the existing flags? If I squint and read kind 
of quickly I could imagine this is kind of like write combining, like what 
BUS_SPACE_MAP_PREFETCHABLE can do on pci busses.

If this does leak into fdt, would it just be a nop on other archs that use 
those drivers?

dlg

> 
> ok?
> 
> 
> Index: arch/arm64/arm64/locore.S
> ===
> RCS file: /cvs/src/sys/arch/arm64/arm64/locore.S,v
> retrieving revision 1.32
> diff -u -p -r1.32 locore.S
> --- arch/arm64/arm64/locore.S 19 Oct 2020 17:57:40 -  1.32
> +++ arch/arm64/arm64/locore.S 14 Feb 2021 21:28:26 -
> @@ -233,9 +233,10 @@ switch_mmu_kernel:
> mair:
>   /* Device | Normal (no cache, write-back, write-through) */
>   .quad   MAIR_ATTR(0x00, 0) |\
> - MAIR_ATTR(0x44, 1) |\
> - MAIR_ATTR(0xff, 2) |\
> - MAIR_ATTR(0x88, 3)
> + MAIR_ATTR(0x04, 1) |\
> + MAIR_ATTR(0x44, 2) |\
> + MAIR_ATTR(0xff, 3) |\
> + MAIR_ATTR(0x88, 4)
> tcr:
>   .quad (TCR_T1SZ(64 - VIRT_BITS) | TCR_T0SZ(64 - 48) | \
>   TCR_AS | TCR_TG1_4K | TCR_CACHE_ATTRS | TCR_SMP_ATTRS)
> Index: arch/arm64/arm64/locore0.S
> ===
> RCS file: /cvs/src/sys/arch/arm64/arm64/locore0.S,v
> retrieving revision 1.5
> diff -u -p -r1.5 locore0.S
> --- arch/arm64/arm64/locore0.S28 May 2019 20:32:30 -  1.5
> +++ arch/arm64/arm64/locore0.S14 Feb 2021 21:28:26 -
> @@ -34,8 +34,8 @@
> #include 
> 
> #define   DEVICE_MEM  0
> -#define  NORMAL_UNCACHED 1
> -#define  NORMAL_MEM  2
> +#define  NORMAL_UNCACHED 2
> +#define  NORMAL_MEM  3
> 
> /*
>  * We assume:
> Index: arch/arm64/arm64/machdep.c
> ===
> RCS file: /cvs/src/sys/arch/arm64/arm64/machdep.c,v
> retrieving revision 1.57
> diff -u -p -r1.57 machdep.c
> --- arch/arm64/arm64/machdep.c11 Feb 2021 23:55:48 -  1.57
> +++ arch/arm64/arm64/machdep.c14 Feb 2021 21:28:27 -
> @@ -1188,7 +1188,7 @@ pmap_bootstrap_bs_map(bus_space_tag_t t,
> 
>   for (pa = startpa; pa < endpa; pa += PAGE_SIZE, va += PAGE_SIZE)
>   pmap_kenter_cache(va, pa, PROT_READ | PROT_WRITE,
> - PMAP_CACHE_DEV);
> + PMAP_CACHE_DEV_NGNRNE);
> 
>   virtual_avail = va;
> 
> Index: arch/arm64/arm64/pmap.c
> ===
> RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v
> retrieving revision 1.70
> diff -u -p -r1.70 pmap.c
> --- arch/arm64/arm64/pmap.c   25 Jan 2021 19:37:17 -  1.70
> +++ arch/arm64/arm64/pmap.c   14 Feb 2021 21:28:28 -
> @@ -472,7 +472,7 @@ pmap_enter(pmap_t pm, vaddr_t va, paddr_
>   if (pa & PMAP_NOCACHE)
>   cache = PMAP_CACHE_CI;
>   if (pa & PMAP_DEVICE)
> - cache = PMAP_CACHE_DEV;
> +   

Re: "monitoring only" interfaces

2021-02-14 Thread David Gwynne
On Sun, Feb 07, 2021 at 06:55:37PM +0100, Sebastian Benoit wrote:
> David Gwynne(da...@gwynne.id.au) on 2021.01.27 17:13:09 +1000:
> > some of the discussion around dup-to made me think that a diff we
> > have here at work might be more broadly useful.
> > 
> > we run a box here with a bunch of ethernet ports plugged into span
> > ports on switches. basically every packet going to our firewalls gets
> > duplicated to this host. we then have code that generates flow data from
> > these ports. it's also nice to have one place to ssh to and so you can
> > tcpdump things. anyway, that flow collector watches packets on those
> > interfaces via bpf, but apart from that we don't actually want to
> > do anythign with the packets those interfaces receive. we especially
> > do not want them entering the stack. we ssh to this box over the
> > firewall, so if the span port copies those packets to the box and
> > the stack tries to process them, things dont work great.
> > 
> > we could enable the fildrop stuff with bpf, but there's an annoying gap
> > between when the interfaces come up and when the flow collector starts
> > running. also, if the flow collector crashes or we restart it cos we're
> > hacking on the code, this provides more gaps for packets to enter the
> > stack.
> > 
> > we prevented this by adding a "monitor" interface flag. it makes the
> > interface input code drop all the packets rather than queuing them for
> > the stack to process.
> > 
> > is there any interest in having this in the tree?
> > 
> > if so, i need to do some work to make sure all interfaces push
> > packets into the stack with if_input, ifiq_input, or if_vinput. a
> > bunch of them like gif and gre currently call protocol input routines
> > directly, so they skip this check.
> > 
> > so, thoughts?
> 
> I'd like this.
> 
> Previously when i needed something similar, i put the interface into its own
> routing domain. But of course that doesnt avoid the packets entering the
> stack, just some consequences.
> 
> I also think 'monitor' is the right keyword for ifconfig.
> 
> ok benno, but manpage is missing

this is also missing. this lets l3 interfaces use the if_vinput
machinery by providing a p2p_input handler. for if_vinput to support
p2p interfaces, they have to be able to say what kind of bpf_mtap
handling they need rather than have the machinery assume everything
is an ethernet packet. this also lets us factor out the l3 input
handling from a lot of these drivers.

in turn, this makes it possible to use monitor on gif, gre, mgre,
mpe, and mpip. looks like it would already work on tun, but im not
sure what the point of that is.

ok?

Index: net/if.c
===
RCS file: /cvs/src/sys/net/if.c,v
retrieving revision 1.625
diff -u -p -r1.625 if.c
--- net/if.c18 Jan 2021 09:55:43 -  1.625
+++ net/if.c14 Feb 2021 12:12:21 -
@@ -847,13 +847,17 @@ if_vinput(struct ifnet *ifp, struct mbuf
m->m_pkthdr.ph_ifidx = ifp->if_index;
m->m_pkthdr.ph_rtableid = ifp->if_rdomain;
 
+#if NPF > 0
+   pf_pkt_addr_changed(m);
+#endif
+
counters_pkt(ifp->if_counters,
ifc_ipackets, ifc_ibytes, m->m_pkthdr.len);
 
 #if NBPFILTER > 0
if_bpf = ifp->if_bpf;
if (if_bpf) {
-   if (bpf_mtap_ether(if_bpf, m, BPF_DIRECTION_IN)) {
+   if (ifp->if_bpf_mtap(if_bpf, m, BPF_DIRECTION_IN)) {
m_freem(m);
return;
}
@@ -1497,6 +1501,42 @@ p2p_rtrequest(struct ifnet *ifp, int req
}
 }
 
+int
+p2p_bpf_mtap(caddr_t if_bpf, const struct mbuf *m, u_int dir)
+{
+#if NBPFILTER > 0
+   return (bpf_mtap_af(if_bpf, m->m_pkthdr.ph_family, m, dir));
+#else
+   return (0);
+#endif
+}
+
+void
+p2p_input(struct ifnet *ifp, struct mbuf *m)
+{
+   void (*input)(struct ifnet *, struct mbuf *);
+
+   switch (m->m_pkthdr.ph_family) {
+   case AF_INET:
+   input = ipv4_input;
+   break;
+#ifdef INET6
+   case AF_INET6:
+   input = ipv6_input;
+   break;
+#endif
+#ifdef MPLS
+   case AF_MPLS:
+   input = mpls_input;
+   break;
+#endif
+   default:
+   m_freem(m);
+   return;
+   }
+
+   (*input)(ifp, m);
+}
 
 /*
  * Bring down all interfaces
Index: net/if_ethersubr.c
===
RCS file: /cvs/src/sys/net/if_ethersubr.c,v
retrieving revision 1.268
diff -u -p -r1.268 if_ethersubr.c
--- net/if_ethersubr.c  4 Jan 2021 21:21:41 -   1.268
+++ net/if_ethersubr.c  14 Feb 2021 12:12:21 -
@@ -680,7 +680,9 @@

Re: Unbound: add support for pf tables to ipset module

2021-02-07 Thread David Gwynne
On Sun, Feb 07, 2021 at 06:27:24PM +0100, Christopher Zimmermann wrote:
> On Sun, Feb 07, 2021 at 04:20:26PM +, Stuart Henderson wrote:
> > On 2021/02/07 17:04, Christopher Zimmermann wrote:
> > > Hi,
> > > 
> > > a year ago I added support for our pf tables to the unbound ipset module.
> > > Upstream does not seem eager to merge it:
> > > https://github.com/NLnetLabs/unbound/pull/144
> > > 
> > > Implementing pf tables support was pretty straightforward. It has been 
> > > more
> > > work to adjust module's privilege management to allow the modules to open
> > > privileget files like /dev/pf and keep them open across reloads.
> > > This is also what upstream was unsure about.
> > > 
> > > So below you find the diff against our base unbound.
> > > 
> > > Should this go in? Continue to wait for upstream?
> > > Suggestions for improvement?
> > 
> > I would not be happy about including this in base unbound. Partly
> > because it is a large diff to carry, partly unbound is a much more
> > complex process than I'd be happy with having direct access to
> > reconfigure PF.
> > 
> > The whole approach (including for linux ipset) doesn't seem ideal to
> > me. It would seem much better to have this done out-of-process with a
> > communication mechanism to allow sending the addresses across, then
> > unbound wouldn't need firewall-specific knowledge in the code, and
> > there's a clear separation of privilege.
> 
> Hi Stuart and Florian,
> 
> thanks for giving a thought. I agree to your analysis.
> It seems to me a more sane approach would be to change / create an "ipset"
> module that doesn't talk a specific ipset / pf protocol, but simply dumps
> raw ipv4 /ipv6 addresses in a file / fifo / pipe, which can then be consumed
> by a thin privileged translator process that adds those ips to pf / ipset /
> whatever. This would also get rid of the privilege management issues I
> encountered.

having unbound fiddle with pf tables itself also means that you'd have
to run unbound on the firewall actually moving the packets around. this
is fine in a small environment, but as soon as you add a redundant
firewall you will need to sync tables between the firewalls and then
unbound gets even more complicated. then you'll want to run the
resolvers on separate boxes to the firewalls and then it gets even
more complicated again.

unbound has support for dnstap (https://dnstap.info/), which basically
sends a copy of the dns messages it handles along with some metadata
to a listener. a dnstap receiver that wants to do something interesting
with those messages has to be able to parse dns packets. an example
of something interesting might be mapping domain names to ips and
distributing them to firewalls would be
https://github.com/blind-oracle/dnstap-bgp.

however, dnstap is not enabled in base unbound because it has some
dependencies we don't want in base. im working on a simpler message
encapsulation and transport that is dumb enough to not need extra
depends in base, but it is barely more than a prototype at this
stage.

> Now I'm just wondering whether the filtering feature of ipset should remain
> in unbound or be moved to another process. I would tend to keep it within
> unbound, since parsing queries is what it is built to do after all.

if you follow the dnstap example, the other process does the filtering
and downstream stuff. i think it would be Good Enough(tm).

dlg



Re: ifg_refcnt atomic operation

2021-02-05 Thread David Gwynne
refcnt_init starts counting at 1, while the existing code starts at 0. Do
the crashes stop because we never fully release all the references and
never free it now?

On Sat, 6 Feb 2021, 10:55 Alexander Bluhm,  wrote:

> Hi,
>
> When I replace the ++ and -- of ifg_refcnt with an atomic operation,
> it fixes this syzkaller panic.
>
>
> https://syzkaller.appspot.com/bug?id=54e16dc5bce6929e14b42e2f1379f1c18f62be43
>
> Without the fix "syz-execprog -repeat=0 -procs=8 repro-pfi.syz"
> crashes my vmm in a few seconds.  With the diff I cannot reproduce
> for several minutes.
>
> ok?
>
> bluhm
>
> Index: net/if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.626
> diff -u -p -r1.626 if.c
> --- net/if.c1 Feb 2021 07:43:33 -   1.626
> +++ net/if.c6 Feb 2021 00:37:50 -
> @@ -2601,7 +2601,7 @@ if_creategroup(const char *groupname)
> return (NULL);
>
> strlcpy(ifg->ifg_group, groupname, sizeof(ifg->ifg_group));
> -   ifg->ifg_refcnt = 0;
> +   refcnt_init(>ifg_refcnt);
> ifg->ifg_carp_demoted = 0;
> TAILQ_INIT(>ifg_members);
>  #if NPF > 0
> @@ -2648,7 +2648,7 @@ if_addgroup(struct ifnet *ifp, const cha
> return (ENOMEM);
> }
>
> -   ifg->ifg_refcnt++;
> +   refcnt_take(>ifg_refcnt);
> ifgl->ifgl_group = ifg;
> ifgm->ifgm_ifp = ifp;
>
> @@ -2692,7 +2692,7 @@ if_delgroup(struct ifnet *ifp, const cha
> pfi_group_change(groupname);
>  #endif
>
> -   if (--ifgl->ifgl_group->ifg_refcnt == 0) {
> +   if (refcnt_rele(>ifgl_group->ifg_refcnt)) {
> TAILQ_REMOVE(_head, ifgl->ifgl_group, ifg_next);
>  #if NPF > 0
> pfi_detach_ifgroup(ifgl->ifgl_group);
> Index: net/if_var.h
> ===
> RCS file: /cvs/src/sys/net/if_var.h,v
> retrieving revision 1.112
> diff -u -p -r1.112 if_var.h
> --- net/if_var.h29 Jul 2020 12:09:31 -  1.112
> +++ net/if_var.h6 Feb 2021 00:38:23 -
> @@ -263,7 +263,7 @@ struct ifmaddr {
>
>  struct ifg_group {
> char ifg_group[IFNAMSIZ];
> -   u_intifg_refcnt;
> +   struct refcntifg_refcnt;
> caddr_t  ifg_pf_kif;
> int  ifg_carp_demoted;
> TAILQ_HEAD(, ifg_member) ifg_members;
> Index: netinet/ip_carp.c
> ===
> RCS file: /cvs/src/sys/netinet/ip_carp.c,v
> retrieving revision 1.351
> diff -u -p -r1.351 ip_carp.c
> --- netinet/ip_carp.c   21 Jan 2021 13:18:07 -  1.351
> +++ netinet/ip_carp.c   6 Feb 2021 00:39:14 -
> @@ -789,7 +789,7 @@ carpattach(int n)
> struct ifg_group*ifg;
>
> if ((ifg = if_creategroup("carp")) != NULL)
> -   ifg->ifg_refcnt++;  /* keep around even if empty */
> +   refcnt_take(>ifg_refcnt);  /* keep around even if
> empty */
> if_clone_attach(_cloner);
> carpcounters = counters_alloc(carps_ncounters);
>  }
>
>


Re: have pf_route bail out if it resolves a route with RTF_LOCAL set

2021-02-04 Thread David Gwynne
On Fri, Jan 29, 2021 at 03:23:31PM +0100, Alexander Bluhm wrote:
> On Fri, Jan 29, 2021 at 10:53:09AM +1000, David Gwynne wrote:
> > > Are you sure that it does not break any use case?  I have seen so
> > > much strange stuff.  What is the advantage?
> >
> > The current behaviour is lucky at best, and quirky at worst. Usually I
> > would agree with you that breaking stuff isn't great, even if it's
> > wrong, but while I'm changing how route-to etc works I think it's
> > a good chance to clean up some of these edge cases.
> 
> I have been developping products based on pf edge cases for 15
> years.  I don't know which dragons are in our codebase.  This should
> not prevent improvements in OpenBSD.  I am just asking not to remove
> anything just because we currently don't know, how it can be used.

i understand.

> Changing syntax like address@interface can easily be adpted.  Slight
> semantic changes may cause debugging sessions on productive customer
> systems.  And then we might need a quick new solution for a previously
> existing feature.  So please be careful.

the regress tests i just updated made it clear that using route-to with
loopback interfaces was not supposed to work. i think blacklisting
RTF_LOCAL routes is in keeping with that idea, and would go a bit
further and drop packets going out IFF_LOOPBACK interfaces too. i can't
think of a good edge case that this would break.

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1107
diff -u -p -r1.1107 pf.c
--- pf.c3 Feb 2021 07:41:12 -   1.1107
+++ pf.c4 Feb 2021 22:43:11 -
@@ -6015,7 +6015,7 @@ pf_route(struct pf_pdesc *pd, struct pf_
rtableid = m0->m_pkthdr.ph_rtableid;
 
rt = rtalloc(sintosa(dst), RT_RESOLVE, rtableid);
-   if (!rtisvalid(rt)) {
+   if (!rtisvalid(rt) || ISSET(rt->rt_flags, RTF_LOCAL)) {
if (s->rt != PF_DUPTO) {
pf_send_icmp(m0, ICMP_UNREACH, ICMP_UNREACH_HOST,
0, pd->af, s->rule.ptr, pd->rdomain);
@@ -6025,7 +6025,7 @@ pf_route(struct pf_pdesc *pd, struct pf_
}
 
ifp = if_get(rt->rt_ifidx);
-   if (ifp == NULL)
+   if (ifp == NULL || ISSET(ifp->if_flags, IFF_LOOPBACK))
goto bad;
 
/* A locally generated packet may have invalid source address. */
@@ -6159,7 +6159,7 @@ pf_route6(struct pf_pdesc *pd, struct pf
if (IN6_IS_SCOPE_EMBED(>sin6_addr))
dst->sin6_addr.s6_addr16[1] = htons(ifp->if_index);
rt = rtalloc(sin6tosa(dst), RT_RESOLVE, rtableid);
-   if (!rtisvalid(rt)) {
+   if (!rtisvalid(rt) || ISSET(rt->rt_flags, RTF_LOCAL)) {
if (s->rt != PF_DUPTO) {
pf_send_icmp(m0, ICMP6_DST_UNREACH,
ICMP6_DST_UNREACH_NOROUTE, 0,
@@ -6170,7 +6170,7 @@ pf_route6(struct pf_pdesc *pd, struct pf
}
 
ifp = if_get(rt->rt_ifidx);
-   if (ifp == NULL)
+   if (ifp == NULL || ISSET(ifp->if_flags, IFF_LOOPBACK))
goto bad;
 
/* A locally generated packet may have invalid source address. */

> 



Re: [External] : pf route-to: only run pf_test when packets enter and leave the stack

2021-02-02 Thread David Gwynne
On Tue, Feb 02, 2021 at 11:30:12AM +0100, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
> On Tue, Feb 02, 2021 at 02:52:52PM +1000, David Gwynne wrote:
> > 
> > however, like most things relating to route-to/reply-to/dup-to, im
> > pretty sure at this point it's not used a lot, so the impact is minimal.
> > a lot of changes in this space have already been made, so adding another
> > simplification is justifiable. if this does remove functionality that
> > people need, i believe sashan@ has agreed to help me implement route-to
> > on match rules to give more flexibility and composability of rules.
> > 
> 
> as David says my concern is single corner case, which combines
> NAT with route-to action. I think the escape plan for people,
> who combine route-to with nat-to, is already there. If someone
> has rule as follows:
> 
>   pass in on em0 from v.x.y.z/n to a.b.c.d/m \
>   route-to o.p.q.r@em2 nat-to(em2)

You can nat-to and route-to on the same rule, so this should still
work if all you do is drop the @em2:

pass in on em0 from v.x.y.z/n to a.b.c.d/m \
route-to o.p.q.r nat-to (em2)

> then this needs to be converted to two rules:
> 
>   match in on em0 from v.x.y.z/n to a.b.c.d/m nat-to(em2)
>   pass in on em0 from v.x.y.z/n to a.b.c.d/m route-to o.p.q.r
> 
> I have not tried that yet. However I think this should work. If it does
> not work, then I'll try to fix it.

I thought the problem was for rules like this:

pass out on em1 from v.x.y.z/n to a.b.c.d/m \
route-to o.p.q.r@em2
pass out on em2 nat-to (em2)

Only one pass out rule will win if I commit this, because the packet
will only go through the ruleset when it leaves the stack, not every
time the interface changes. If we can do match route-to rules, we could
do the following:

match out on em1 from v.x.y.z/n to a.b.c.d/m \
route-to o.p.q.r # o.p.q.r is reachable via em2
pass out on em2 nat-to (em2) 

> > i've canvassed a few people, and their responses have varied from "i
> > don't care, route-to is the worst" to "i thought we did option 2
> > anyway". anyone else want to chime in?
> > 
> > this keeps the behaviour where route-to on a packet coming into the
> > stack is pushed past it and immediately forwarded to the output
> > interface. the condition for that is greatly simplified now though.
> > 
> > ok?
> 
> given there is an escape plan, I'm fine with the change.

Thank you.



pf route-to: only run pf_test when packets enter and leave the stack

2021-02-01 Thread David Gwynne
this is part of a high level discussion about when pf runs against a
packet. the options are:

1. pf runs when a packet goes over an interface
or
2. pf runs when a packet enters or leaves the network stack.

for normal packet handling there isn't a difference between these
options. in the routing case a packet comes in on an interface, pf tests
it, then the stack processes it and decides to send it out another
interface, pf tests it again on the way out, the packet goes on the
wire. for packets handled by the local system, a packet comes in on an
interface, pf tests it, the stack processes it locally, something
generates a reply, the stack decides to route that out an interface, pf
tests it on the way out, the reply packet ends up on the wire.

in both situations, you get the same sequence of events if you think
that pf runs when a packet goes over an interface or wether pf runs when
a packet enters or leaves the stack.

however, there is a difference if route-to gets involved. if route-to is
applied on an outbound rule/state, it could change which interface the
packet should be going over.

currently the code implements option 1. this means that if route-to
changes the interface, it reruns pf test for the packet going over the
new interface. i would like to change it to option 2.

the main reason i want to change it is that option 1 creates confusion
for the state table. by default, pf states are floating, meaning that
packets are matched to states regardless of which interface they're
going over. if a packet leaving on em0 is rerouted out em1, both
traversals will end up using the same state, which at best will make the
accounting look weird, or at worst fail some checks in the state and get
dropped.

another reason i want to change this is to make it consistent with
other changes that are made to packet. eg, when nat is applied to
a packet, we don't run pf_test again with the new addresses.

the downside to this change is that the pf_test rerun may have been used
to do things like push a packet out another interface with the first run
through pf, and pick up a broad "nat all packets leaving this interface"
rule on the second one.

however, like most things relating to route-to/reply-to/dup-to, im
pretty sure at this point it's not used a lot, so the impact is minimal.
a lot of changes in this space have already been made, so adding another
simplification is justifiable. if this does remove functionality that
people need, i believe sashan@ has agreed to help me implement route-to
on match rules to give more flexibility and composability of rules.

i've canvassed a few people, and their responses have varied from "i
don't care, route-to is the worst" to "i thought we did option 2
anyway". anyone else want to chime in?

this keeps the behaviour where route-to on a packet coming into the
stack is pushed past it and immediately forwarded to the output
interface. the condition for that is greatly simplified now though.

ok?

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1106
diff -u -p -r1.1106 pf.c
--- pf.c1 Feb 2021 00:31:05 -   1.1106
+++ pf.c2 Feb 2021 03:44:51 -
@@ -6033,7 +6033,7 @@ pf_route(struct pf_pdesc *pd, struct pf_
(ifp->if_flags & IFF_LOOPBACK) == 0)
ip->ip_src = ifatoia(rt->rt_ifa)->ia_addr.sin_addr;
 
-   if (s->rt != PF_DUPTO && pd->kif->pfik_ifp != ifp) {
+   if (s->rt != PF_DUPTO && pd->dir == PF_IN) {
if (pf_test(AF_INET, PF_OUT, ifp, ) != PF_PASS)
goto bad;
else if (m0 == NULL)
@@ -6178,7 +6178,7 @@ pf_route6(struct pf_pdesc *pd, struct pf
(ifp->if_flags & IFF_LOOPBACK) == 0)
ip6->ip6_src = ifatoia6(rt->rt_ifa)->ia_addr.sin6_addr;
 
-   if (s->rt != PF_DUPTO && pd->kif->pfik_ifp != ifp) {
+   if (s->rt != PF_DUPTO && pd->dir == PF_IN) {
if (pf_test(AF_INET6, PF_OUT, ifp, ) != PF_PASS)
goto bad;
else if (m0 == NULL)




Re: have pf_route bail out if it resolves a route with RTF_LOCAL set

2021-01-28 Thread David Gwynne
On Thu, Jan 28, 2021 at 08:09:36PM +0100, Alexander Bluhm wrote:
> On Thu, Jan 28, 2021 at 09:57:33AM +1000, David Gwynne wrote:
> > calling if_output with a route to a local IP is confusing, and I'm not
> > sure it makes sense anyway.
> >
> > this treats a an RTF_LOCAL route like an invalid round and drops the
> > packet.
> >
> > ok?
> 
> Are you sure that it does not break any use case?  I have seen so
> much strange stuff.  What is the advantage?

The current behaviour is lucky at best, and quirky at worst. Usually I
would agree with you that breaking stuff isn't great, even if it's
wrong, but while I'm changing how route-to etc works I think it's
a good chance to clean up some of these edge cases.

> 
> bluhm
> 
> > Index: pf.c
> > ===
> > RCS file: /cvs/src/sys/net/pf.c,v
> > retrieving revision 1.1104
> > diff -u -p -r1.1104 pf.c
> > --- pf.c27 Jan 2021 23:53:35 -  1.1104
> > +++ pf.c27 Jan 2021 23:55:49 -
> > @@ -6054,7 +6054,7 @@ pf_route(struct pf_pdesc *pd, struct pf_
> > }
> >
> > rt = rtalloc(sintosa(dst), RT_RESOLVE, rtableid);
> > -   if (!rtisvalid(rt)) {
> > +   if (!rtisvalid(rt) || ISSET(rt->rt_flags, RTF_LOCAL)) {
> > if (r->rt != PF_DUPTO) {
> > pf_send_icmp(m0, ICMP_UNREACH, ICMP_UNREACH_HOST,
> > 0, pd->af, s->rule.ptr, pd->rdomain);
> > @@ -6213,7 +6213,7 @@ pf_route6(struct pf_pdesc *pd, struct pf
> > if (IN6_IS_SCOPE_EMBED(>sin6_addr))
> > dst->sin6_addr.s6_addr16[1] = htons(ifp->if_index);
> > rt = rtalloc(sin6tosa(dst), RT_RESOLVE, rtableid);
> > -   if (!rtisvalid(rt)) {
> > +   if (!rtisvalid(rt) || ISSET(rt->rt_flags, RTF_LOCAL)) {
> > if (r->rt != PF_DUPTO) {
> > pf_send_icmp(m0, ICMP6_DST_UNREACH,
> > ICMP6_DST_UNREACH_NOROUTE, 0,



pf: route-to IPs, not interfaces

2021-01-28 Thread David Gwynne
this is the diff from the "pf route-to issues" thread, but on it's own.

the summary of why i wanted to do this is:

- route-to, reply-to, and dup-to do not work with pfsync

  this is because the information about where to route-to is stored in
  rules, and it is hard to have a ruleset synced 100% between firewalls.

- i can make my boxes panic when i try to use it in certain situations

  yeah...

- the configuration and syntax for route-to rules are confusing.

  the argument to route-to and co is an interace name with an optional
  ip address. there are several problems with this. one is that people
  tend to think about routing as sending packets to peers by their
  address, not by the interface they're reachable on. another is that
  we currently have no way to synchronise interface topology information
  between firewalls, so using an interface to say where packets go
  means we can't do failover of these states with pfsync. another
  is that a change in routing topology means a host may become
  reachable over a different interface. tying routing policy to
  interfaces gets in the way of failover and load balancing.

this change does the following:

- stores the route info in the state instead of the pf rule

  this allows route-to to keep working when the ruleset changes, and
  allows route-to info to be sent over pfsync. there's enough spare bits
  in pfsync messages that the protocol doesnt break.

  the caveat is that route-to becomes tied to pass rules that create
  state, like rdr-to and nat-to.

- the argument to route-to etc is a destination ip address

  it's not limited to a next-hop address (thought a next-hop can be a
  destination address). this allows for the failover and load balancing
  referred to above.

- deprecates the address@interface host syntax in pfctl

  because routing is done entirely by IPs, the interface is derived from
  the route lookup, not pf.

this change does not affect some other stuff discussed in the thread:

- it keeps the current semantic where when route-to changes which
  interface the packet is travelling over, it runs pf_test again.

  that's a separate change for broader discussion.

id like to thank sashan@, bluhm@, and sthen@ for working through this
stuff with me. i've got a lot out of it so far.

ok?

Index: sbin/pfctl/parse.y
===
RCS file: /cvs/src/sbin/pfctl/parse.y,v
retrieving revision 1.708
diff -u -p -r1.708 parse.y
--- sbin/pfctl/parse.y  12 Jan 2021 00:10:34 -  1.708
+++ sbin/pfctl/parse.y  28 Jan 2021 11:45:58 -
@@ -276,6 +276,7 @@ struct filter_opts {
struct redirspec nat;
struct redirspec rdr;
struct redirspec rroute;
+   u_int8_t rt;
 
/* scrub opts */
int  nodf;
@@ -284,15 +285,6 @@ struct filter_opts {
int  randomid;
int  max_mss;
 
-   /* route opts */
-   struct {
-   struct node_host*host;
-   u_int8_t rt;
-   u_int8_t pool_opts;
-   sa_family_t  af;
-   struct pf_poolhashkey   *key;
-   }route;
-
struct {
u_int32_t   limit;
u_int32_t   seconds;
@@ -372,7 +364,7 @@ void expand_label(char *, size_t, cons
struct node_port *, u_int8_t);
 int expand_divertspec(struct pf_rule *, struct divertspec *);
 int collapse_redirspec(struct pf_pool *, struct pf_rule *,
-   struct redirspec *rs, u_int8_t);
+   struct redirspec *rs, int);
 int apply_redirspec(struct pf_pool *, struct pf_rule *,
struct redirspec *, int, struct node_port *);
 voidexpand_rule(struct pf_rule *, int, struct node_if *,
@@ -518,7 +510,6 @@ int parseport(char *, struct range *r, i
 %type  ipspec xhost host dynaddr host_list
 %type  table_host_list tablespec
 %type  redir_host_list redirspec
-%type  route_host route_host_list routespec
 %typeos xos os_list
 %type  portspec port_list port_item
 %type   uids uid_list uid_item
@@ -975,7 +966,7 @@ anchorrule  : ANCHOR anchorname dir quick
YYERROR;
}
 
-   if ($9.route.rt) {
+   if ($9.rt) {
yyerror("cannot specify route handling "
"on anchors");
YYERROR;
@@ -1843,37 +1834,13 @@ pfrule  : action dir logquick interface 
decide_address_family($7.src.host, );
decide_address_family($7.dst.host, );
 
-   

handle PFRULE_ONCE before pfsync may defer tx of the packet

2021-01-27 Thread David Gwynne
i think these code chunks are around the wrong way.

pfsync may want to defer the transmission of a packet. it does this so
it can try and get a state over to a peer firewall before a host may
send a reply to the peer, which would get dropped cos there's no
matching state.

i think the once rule processing should happen before that. the state
is created from the rule, whether the packet the state is for goes out
immediately or not shouldn't matter.

ok?

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1104
diff -u -p -U8 -r1.1104 pf.c
--- pf.c27 Jan 2021 23:53:35 -  1.1104
+++ pf.c28 Jan 2021 01:43:22 -
@@ -3932,45 +3932,45 @@ pf_test_rule(struct pf_pdesc *pd, struct
}
}
 
/* copy back packet headers if needed */
if (rewrite && pd->hdrlen) {
m_copyback(pd->m, pd->off, pd->hdrlen, >hdr, M_NOWAIT);
}
 
-#if NPFSYNC > 0
-   if (*sm != NULL && !ISSET((*sm)->state_flags, PFSTATE_NOSYNC) &&
-   pd->dir == PF_OUT && pfsync_up()) {
-   /*
-* We want the state created, but we dont
-* want to send this in case a partner
-* firewall has to know about it to allow
-* replies through it.
-*/
-   if (pfsync_defer(*sm, pd->m))
-   return (PF_DEFER);
-   }
-#endif /* NPFSYNC > 0 */
-
if (r->rule_flag & PFRULE_ONCE) {
u_int32_t   rule_flag;
 
/*
 * Use atomic_cas() to determine a clear winner, which will
 * insert an expired rule to gcl.
 */
rule_flag = r->rule_flag;
if (((rule_flag & PFRULE_EXPIRED) == 0) &&
atomic_cas_uint(>rule_flag, rule_flag,
rule_flag | PFRULE_EXPIRED) == rule_flag) {
r->exptime = gettime();
SLIST_INSERT_HEAD(_rule_gcl, r, gcle);
}
}
+
+#if NPFSYNC > 0
+   if (*sm != NULL && !ISSET((*sm)->state_flags, PFSTATE_NOSYNC) &&
+   pd->dir == PF_OUT && pfsync_up()) {
+   /*
+* We want the state created, but we dont
+* want to send this in case a partner
+* firewall has to know about it to allow
+* replies through it.
+*/
+   if (pfsync_defer(*sm, pd->m))
+   return (PF_DEFER);
+   }
+#endif /* NPFSYNC > 0 */
 
return (action);
 
 cleanup:
while ((ctx.ri = SLIST_FIRST())) {
SLIST_REMOVE_HEAD(, entry);
pool_put(_rule_item_pl, ctx.ri);
}



have pf_route bail out if it resolves a route with RTF_LOCAL set

2021-01-27 Thread David Gwynne
calling if_output with a route to a local IP is confusing, and I'm not
sure it makes sense anyway.

this treats a an RTF_LOCAL route like an invalid round and drops the
packet.

ok?

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1104
diff -u -p -r1.1104 pf.c
--- pf.c27 Jan 2021 23:53:35 -  1.1104
+++ pf.c27 Jan 2021 23:55:49 -
@@ -6054,7 +6054,7 @@ pf_route(struct pf_pdesc *pd, struct pf_
}
 
rt = rtalloc(sintosa(dst), RT_RESOLVE, rtableid);
-   if (!rtisvalid(rt)) {
+   if (!rtisvalid(rt) || ISSET(rt->rt_flags, RTF_LOCAL)) {
if (r->rt != PF_DUPTO) {
pf_send_icmp(m0, ICMP_UNREACH, ICMP_UNREACH_HOST,
0, pd->af, s->rule.ptr, pd->rdomain);
@@ -6213,7 +6213,7 @@ pf_route6(struct pf_pdesc *pd, struct pf
if (IN6_IS_SCOPE_EMBED(>sin6_addr))
dst->sin6_addr.s6_addr16[1] = htons(ifp->if_index);
rt = rtalloc(sin6tosa(dst), RT_RESOLVE, rtableid);
-   if (!rtisvalid(rt)) {
+   if (!rtisvalid(rt) || ISSET(rt->rt_flags, RTF_LOCAL)) {
if (r->rt != PF_DUPTO) {
pf_send_icmp(m0, ICMP6_DST_UNREACH,
ICMP6_DST_UNREACH_NOROUTE, 0,



"monitoring only" interfaces

2021-01-26 Thread David Gwynne
some of the discussion around dup-to made me think that a diff we
have here at work might be more broadly useful.

we run a box here with a bunch of ethernet ports plugged into span
ports on switches. basically every packet going to our firewalls gets
duplicated to this host. we then have code that generates flow data from
these ports. it's also nice to have one place to ssh to and so you can
tcpdump things. anyway, that flow collector watches packets on those
interfaces via bpf, but apart from that we don't actually want to
do anythign with the packets those interfaces receive. we especially
do not want them entering the stack. we ssh to this box over the
firewall, so if the span port copies those packets to the box and
the stack tries to process them, things dont work great.

we could enable the fildrop stuff with bpf, but there's an annoying gap
between when the interfaces come up and when the flow collector starts
running. also, if the flow collector crashes or we restart it cos we're
hacking on the code, this provides more gaps for packets to enter the
stack.

we prevented this by adding a "monitor" interface flag. it makes the
interface input code drop all the packets rather than queuing them for
the stack to process.

is there any interest in having this in the tree?

if so, i need to do some work to make sure all interfaces push
packets into the stack with if_input, ifiq_input, or if_vinput. a
bunch of them like gif and gre currently call protocol input routines
directly, so they skip this check.

so, thoughts?

Index: sbin/ifconfig/ifconfig.c
===
RCS file: /cvs/src/sbin/ifconfig/ifconfig.c,v
retrieving revision 1.432
diff -u -p -r1.432 ifconfig.c
--- sbin/ifconfig/ifconfig.c16 Jan 2021 17:44:29 -  1.432
+++ sbin/ifconfig/ifconfig.c27 Jan 2021 06:57:37 -
@@ -469,6 +469,8 @@ const structcmd {
{ "soii",   -IFXF_INET6_NOSOII, 0,  setifxflags },
{ "-soii",  IFXF_INET6_NOSOII,  0,  setifxflags },
 #ifndef SMALL
+   { "monitor",IFXF_MONITOR,   0,  setifxflags },
+   { "-monitor",   -IFXF_MONITOR,  0,  setifxflags },
{ "hwfeatures", NEXTARG0,   0,  printifhwfeatures },
{ "metric", NEXTARG,0,  setifmetric },
{ "powersave",  NEXTARG0,   0,  setifpowersave },
@@ -675,7 +677,7 @@ const structcmd {
"\7RUNNING\10NOARP\11PROMISC\12ALLMULTI\13OACTIVE\14SIMPLEX"\
"\15LINK0\16LINK1\17LINK2\20MULTICAST"  \
"\23INET6_NOPRIVACY\24MPLS\25WOL\26AUTOCONF6\27INET6_NOSOII"\
-   "\30AUTOCONF4"
+   "\30AUTOCONF4" "\32MONITOR"
 
 intgetinfo(struct ifreq *, int);
 void   getsock(int);
Index: sys/net/if.c
===
RCS file: /cvs/src/sys/net/if.c,v
retrieving revision 1.625
diff -u -p -r1.625 if.c
--- sys/net/if.c18 Jan 2021 09:55:43 -  1.625
+++ sys/net/if.c27 Jan 2021 06:57:37 -
@@ -860,7 +860,8 @@ if_vinput(struct ifnet *ifp, struct mbuf
}
 #endif
 
-   (*ifp->if_input)(ifp, m);
+   if (__predict_true(!ISSET(ifp->if_xflags, IFXF_MONITOR)))
+   (*ifp->if_input)(ifp, m);
 }
 
 void
Index: sys/net/if.h
===
RCS file: /cvs/src/sys/net/if.h,v
retrieving revision 1.205
diff -u -p -r1.205 if.h
--- sys/net/if.h18 Jan 2021 09:55:43 -  1.205
+++ sys/net/if.h27 Jan 2021 06:57:37 -
@@ -230,6 +230,7 @@ struct if_status_description {
 #defineIFXF_AUTOCONF6  0x20/* [N] v6 autoconf enabled */
 #define IFXF_INET6_NOSOII  0x40/* [N] don't do RFC 7217 */
 #defineIFXF_AUTOCONF4  0x80/* [N] v4 autoconf (aka dhcp) 
enabled */
+#defineIFXF_MONITOR0x200   /* [N] only used for bpf */
 
 #defineIFXF_CANTCHANGE \
(IFXF_MPSAFE|IFXF_CLONED)
Index: sys/net/ifq.c
===
RCS file: /cvs/src/sys/net/ifq.c,v
retrieving revision 1.41
diff -u -p -r1.41 ifq.c
--- sys/net/ifq.c   7 Jul 2020 00:00:03 -   1.41
+++ sys/net/ifq.c   27 Jan 2021 06:57:37 -
@@ -715,10 +715,12 @@ ifiq_input(struct ifiqueue *ifiq, struct
ifiq->ifiq_bytes += bytes;
 
len = ml_len(>ifiq_ml);
-   if (len > ifiq_maxlen_drop)
-   ifiq->ifiq_qdrops += ml_len(ml);
-   else
-   ml_enlist(>ifiq_ml, ml);
+   if (__predict_true(!ISSET(ifp->if_xflags, IFXF_MONITOR))) {
+   if (len > ifiq_maxlen_drop)
+   ifiq->ifiq_qdrops += ml_len(ml);
+   else
+   ml_enlist(>ifiq_ml, ml);
+   }
mtx_leave(>ifiq_mtx);
 
if (ml_empty(ml))




if pf_route{,6} route isn't valid, generate an icmp error

2021-01-26 Thread David Gwynne
at the moment if the route is invalid, we drop the packet. this
generates an icmp error.

ok?

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1103
diff -u -p -r1.1103 pf.c
--- pf.c27 Jan 2021 04:46:21 -  1.1103
+++ pf.c27 Jan 2021 06:38:12 -
@@ -6055,6 +6055,10 @@ pf_route(struct pf_pdesc *pd, struct pf_
 
rt = rtalloc(sintosa(dst), RT_RESOLVE, rtableid);
if (!rtisvalid(rt)) {
+   if (r->rt != PF_DUPTO) {
+   pf_send_icmp(m0, ICMP_UNREACH, ICMP_UNREACH_HOST,
+   0, pd->af, s->rule.ptr, pd->rdomain);
+   }
ipstat_inc(ips_noroute);
goto bad;
}
@@ -6210,6 +6214,11 @@ pf_route6(struct pf_pdesc *pd, struct pf
dst->sin6_addr.s6_addr16[1] = htons(ifp->if_index);
rt = rtalloc(sin6tosa(dst), RT_RESOLVE, rtableid);
if (!rtisvalid(rt)) {
+   if (r->rt != PF_DUPTO) {
+   pf_send_icmp(m0, ICMP6_DST_UNREACH,
+   ICMP6_DST_UNREACH_NOROUTE, 0,
+   pd->af, s->rule.ptr, pd->rdomain);
+   }
ip6stat_inc(ip6s_noroute);
goto bad;
}



don't run dup-to generated packets through pf_test in pf_route{,6}

2021-01-26 Thread David Gwynne
this was discussed as part of the big route-to issues thread. i think
it's easy to break out and handle separately now.

the diff does what the subject line says. it seems to work as expected
for me. i don't see weird state issues anymore when i dup my ssh session
out over a tunnel interface.

sasha suggested setting PF_TAG_GENERATED on the duplicated packet, but i
didn't set it in this diff. the reason is that i can't see
PF_TAG_GENERATED get cleared anywhere. this means that if you dup-to a
host over a tunnel (eg, gif, gre, etc), the encapsulated packet still
has that tag, which means pf doesn't run against the encapsulated
packet.

ok?

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1101
diff -u -p -r1.1101 pf.c
--- pf.c19 Jan 2021 22:22:23 -  1.1101
+++ pf.c27 Jan 2021 01:21:24 -
@@ -6039,7 +6041,7 @@ pf_route(struct pf_pdesc *pd, struct pf_
if (ifp == NULL)
goto bad;
 
-   if (pd->kif->pfik_ifp != ifp) {
+   if (r->rt != PF_DUPTO && pd->kif->pfik_ifp != ifp) {
if (pf_test(AF_INET, PF_OUT, ifp, ) != PF_PASS)
goto bad;
else if (m0 == NULL)
@@ -6194,7 +6195,7 @@ pf_route6(struct pf_pdesc *pd, struct pf
if (ifp == NULL)
goto bad;
 
-   if (pd->kif->pfik_ifp != ifp) {
+   if (r->rt != PF_DUPTO && pd->kif->pfik_ifp != ifp) {
if (pf_test(AF_INET6, PF_OUT, ifp, ) != PF_PASS)
goto bad;
else if (m0 == NULL)



Re: tiny pf_route{,6} tweak

2021-01-26 Thread David Gwynne
On Wed, Jan 27, 2021 at 11:13:12AM +1000, David Gwynne wrote:
> when pf_route (and pf_route6) are supposed to handle forwarding the
> packet (ie, for route-to or reply-to rules), they take the mbuf
> away from the calling code path. this is done by clearing the mbuf
> pointer in the pf_pdesc struct. it doesn't do this for dup-to rules
> though.
> 
> at the moment pf_route clears that pointer on the way out, but it could
> take the mbuf away up front in the same place that it already checks if
> it's a dup-to rule or not.
> 
> it's a small change. i've bumped up the number of lines of context so
> it's easier to read too.
> 
> ok?

sigh. here's the diff with the extra context.

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1101
diff -u -p -U8 -r1.1101 pf.c
--- pf.c19 Jan 2021 22:22:23 -  1.1101
+++ pf.c27 Jan 2021 01:10:52 -
@@ -5983,16 +5983,17 @@ pf_route(struct pf_pdesc *pd, struct pf_
 
if (r->rt == PF_DUPTO) {
if ((m0 = m_dup_pkt(pd->m, max_linkhdr, M_NOWAIT)) == NULL)
return;
} else {
if ((r->rt == PF_REPLYTO) == (r->direction == pd->dir))
return;
m0 = pd->m;
+   pd->m = NULL;
}
 
if (m0->m_len < sizeof(struct ip)) {
DPFPRINTF(LOG_ERR,
"%s: m0->m_len < sizeof(struct ip)", __func__);
goto bad;
}
 
@@ -6103,18 +6104,16 @@ pf_route(struct pf_pdesc *pd, struct pf_
else
m_freem(m0);
}
 
if (error == 0)
ipstat_inc(ips_fragmented);
 
 done:
-   if (r->rt != PF_DUPTO)
-   pd->m = NULL;
rtfree(rt);
return;
 
 bad:
m_freem(m0);
goto done;
 }
 
@@ -6141,16 +6140,17 @@ pf_route6(struct pf_pdesc *pd, struct pf
 
if (r->rt == PF_DUPTO) {
if ((m0 = m_dup_pkt(pd->m, max_linkhdr, M_NOWAIT)) == NULL)
return;
} else {
if ((r->rt == PF_REPLYTO) == (r->direction == pd->dir))
return;
m0 = pd->m;
+   pd->m = NULL;
}
 
if (m0->m_len < sizeof(struct ip6_hdr)) {
DPFPRINTF(LOG_ERR,
"%s: m0->m_len < sizeof(struct ip6_hdr)", __func__);
goto bad;
}
ip6 = mtod(m0, struct ip6_hdr *);
@@ -6232,18 +6232,16 @@ pf_route6(struct pf_pdesc *pd, struct pf
ip6stat_inc(ip6s_cantfrag);
if (r->rt != PF_DUPTO)
pf_send_icmp(m0, ICMP6_PACKET_TOO_BIG, 0,
ifp->if_mtu, pd->af, r, pd->rdomain);
goto bad;
}
 
 done:
-   if (r->rt != PF_DUPTO)
-   pd->m = NULL;
rtfree(rt);
return;
 
 bad:
m_freem(m0);
goto done;
 }
 #endif /* INET6 */



<    1   2   3   4   5   6   7   8   9   10   >