On Fri, Jun 30, 2023 at 11:57:06AM +0200, Alexandr Nedvedicky wrote:
> Hello,
>
> I'm not familiar enough with relayd, so perhaps other folks
> here might provide better way to troubleshoot the issue.
>
> On Fri, Jun 30, 2023 at 11:10:44AM +0300, Kapetanakis Giannis wrote:
> > Hello,
> >
> > This happened to me twice.
> > OpenBSD 7.3 with syspatches.
> >
> > I have a pair of carp/pfsync/pf/relayd firewall-load balancers with many
> > redirects (only) on them.
> >
> > I wanted to do maintenance of some hosts bellow load balancers.
> > After a while relayd crashed on Master firewall only.
>
> when you say crash: does it mean the relayd was terminated
> by system because of memory/stack/program violation?
> if it is the case is there any chance to collect core file?
>
> or was it rather voluntary exit, when relayd called its function fatal()
>
> the 'No such file or director' error code, which comes from DIOCRGETTSTATS
> ioctl() come from line 1746 in sys/net/pf_table.c:
>
> 1731 int
> 1732 pfr_get_tstats(struct pfr_table *filter, struct pfr_tstats *tbl, int
> *size,
> 1733 int flags)
> 1734 {
> 1735 struct pfr_ktable *p;
> 1736 struct pfr_ktableworkq workq;
> 1737 int n, nn;
> 1738 time_t tzero = gettime();
> 1739
> 1740 /* XXX PFR_FLAG_CLSTATS disabled */
> 1741 ACCEPT_FLAGS(flags, PFR_FLAG_ALLRSETS);
> 1742 if (pfr_fix_anchor(filter->pfrt_anchor))
> 1743 return (EINVAL);
> 1744 n = nn = pfr_table_count(filter, flags);
> 1745 if (n < 0)
> 1746 return (ENOENT);
>
>
> the pfr_table_count() function fails if and only if desired ruleset
> does not exists.
>
> 2177 int
> 2178 pfr_table_count(struct pfr_table *filter, int flags)
> 2179 {
> 2180 struct pf_ruleset *rs;
> 2181
> 2182 if (flags & PFR_FLAG_ALLRSETS)
> 2183 return (pfr_ktable_cnt);
> 2184 if (filter->pfrt_anchor[0]) {
> 2185 rs = pf_find_ruleset(filter->pfrt_anchor);
> 2186 return ((rs != NULL) ? rs->tables : -1);
> 2187 }
> 2188 return (pf_main_ruleset.tables);
> 2189 }
>
> I wonder if it would help if adjust a fatal() line in relayd
> to also capture table name and anchor it is trying to find.
> diff which adjusts a call to fatal is below.
>
> if you don't want to build the whole tree and do in-place
> build you will need to adjust CFLAGS and LDFLAGS. Something
> like that will be needed:
>
> cd /path/to/your/src/usr.sbin/relayd
> export CFLAGS='-I/path/to/your/src/sys -I/path/to/your/src/lib/libutil
> export LDFLAGS='-L /path/to/your/src/lib/libutil'
> make
>
>
> </snip>
>
> >
> > same logs on Backup firewall so far, but after a minute or so:
> >
> > Jun 30 01:47:46 ll1 relayd[61766]: pfe: check_table: cannot get table
> > stats: No such file or directory
> this is where I'd like to see what table relayd is trying
> to look up. The process 61766 then exits using call `exit(1)`
> on behalf of function fatal()
>
> > Jun 30 01:47:46 ll1 relayd[94434]: ca exiting, pid 94434
> > Jun 30 01:47:46 ll1 relayd[83189]: ca exiting, pid 83189
> > Jun 30 01:47:46 ll1 relayd[9023]: ca exiting, pid 9023
> > Jun 30 01:47:46 ll1 relayd[89820]: ca exiting, pid 89820
> > Jun 30 01:47:46 ll1 relayd[94676]: ca exiting, pid 94676
> > Jun 30 01:47:46 ll1 relayd[1820]: hce exiting, pid 1820
> > Jun 30 01:47:46 ll1 relayd[52103]: lost child: pid 61766 exited abnormally
> parent relayd process noticed the child took exit(1)
> because it could not find table.
>
> once you'll be able to run patched relayd can you try to reproduce
> the issue?
>
> also it will help if you will collect additional data.
>
> pfctl -vsA > anchors-before
> # reproduce the issue wait for relayd to exit/crrash
> pfctl -vsA > anchors-after
>
> those data, together with output from adjusted call
> to fatal() should help us to better understand
> what's going on.
>
> thanks for your help
> regards
> sashan
>
> --------8<---------------8<---------------8<------------------8<--------
> diff --git a/usr.sbin/relayd/pfe_filter.c b/usr.sbin/relayd/pfe_filter.c
> index 347048ece56..e1ae050b768 100644
> --- a/usr.sbin/relayd/pfe_filter.c
> +++ b/usr.sbin/relayd/pfe_filter.c
> @@ -632,7 +632,8 @@ check_table(struct relayd *env, struct rdr *rdr, struct
> table *table)
> goto toolong;
>
> if (ioctl(env->sc_pf->dev, DIOCRGETTSTATS, &io) == -1)
> - fatal("%s: cannot get table stats", __func__);
> + fatal("%s: cannot get table stats for %s@%s", __func__,
> + io.pfrio_table.pfrt_name, io.pfrio_table.pfrt_anchor);
>
> return (tstats.pfrts_match);
>
I agree printing this info is useful.
OK claudio@ to improve the error message.
--
:wq Claudio