Hello,
I'm not familiar enough with relayd, so perhaps other folks
here might provide better way to troubleshoot the issue.
On Fri, Jun 30, 2023 at 11:10:44AM +0300, Kapetanakis Giannis wrote:
> Hello,
>
> This happened to me twice.
> OpenBSD 7.3 with syspatches.
>
> I have a pair of carp/pfsync/pf/relayd firewall-load balancers with many
> redirects (only) on them.
>
> I wanted to do maintenance of some hosts bellow load balancers.
> After a while relayd crashed on Master firewall only.
when you say crash: does it mean the relayd was terminated
by system because of memory/stack/program violation?
if it is the case is there any chance to collect core file?
or was it rather voluntary exit, when relayd called its function fatal()
the 'No such file or director' error code, which comes from DIOCRGETTSTATS
ioctl() come from line 1746 in sys/net/pf_table.c:
1731 int
1732 pfr_get_tstats(struct pfr_table *filter, struct pfr_tstats *tbl, int *size,
1733 int flags)
1734 {
1735 struct pfr_ktable *p;
1736 struct pfr_ktableworkq workq;
1737 int n, nn;
1738 time_t tzero = gettime();
1739
1740 /* XXX PFR_FLAG_CLSTATS disabled */
1741 ACCEPT_FLAGS(flags, PFR_FLAG_ALLRSETS);
1742 if (pfr_fix_anchor(filter->pfrt_anchor))
1743 return (EINVAL);
1744 n = nn = pfr_table_count(filter, flags);
1745 if (n < 0)
1746 return (ENOENT);
the pfr_table_count() function fails if and only if desired ruleset
does not exists.
2177 int
2178 pfr_table_count(struct pfr_table *filter, int flags)
2179 {
2180 struct pf_ruleset *rs;
2181
2182 if (flags & PFR_FLAG_ALLRSETS)
2183 return (pfr_ktable_cnt);
2184 if (filter->pfrt_anchor[0]) {
2185 rs = pf_find_ruleset(filter->pfrt_anchor);
2186 return ((rs != NULL) ? rs->tables : -1);
2187 }
2188 return (pf_main_ruleset.tables);
2189 }
I wonder if it would help if adjust a fatal() line in relayd
to also capture table name and anchor it is trying to find.
diff which adjusts a call to fatal is below.
if you don't want to build the whole tree and do in-place
build you will need to adjust CFLAGS and LDFLAGS. Something
like that will be needed:
cd /path/to/your/src/usr.sbin/relayd
export CFLAGS='-I/path/to/your/src/sys -I/path/to/your/src/lib/libutil
export LDFLAGS='-L /path/to/your/src/lib/libutil'
make
</snip>
>
> same logs on Backup firewall so far, but after a minute or so:
>
> Jun 30 01:47:46 ll1 relayd[61766]: pfe: check_table: cannot get table stats:
> No such file or directory
this is where I'd like to see what table relayd is trying
to look up. The process 61766 then exits using call `exit(1)`
on behalf of function fatal()
> Jun 30 01:47:46 ll1 relayd[94434]: ca exiting, pid 94434
> Jun 30 01:47:46 ll1 relayd[83189]: ca exiting, pid 83189
> Jun 30 01:47:46 ll1 relayd[9023]: ca exiting, pid 9023
> Jun 30 01:47:46 ll1 relayd[89820]: ca exiting, pid 89820
> Jun 30 01:47:46 ll1 relayd[94676]: ca exiting, pid 94676
> Jun 30 01:47:46 ll1 relayd[1820]: hce exiting, pid 1820
> Jun 30 01:47:46 ll1 relayd[52103]: lost child: pid 61766 exited abnormally
parent relayd process noticed the child took exit(1)
because it could not find table.
once you'll be able to run patched relayd can you try to reproduce
the issue?
also it will help if you will collect additional data.
pfctl -vsA > anchors-before
# reproduce the issue wait for relayd to exit/crrash
pfctl -vsA > anchors-after
those data, together with output from adjusted call
to fatal() should help us to better understand
what's going on.
thanks for your help
regards
sashan
--------8<---------------8<---------------8<------------------8<--------
diff --git a/usr.sbin/relayd/pfe_filter.c b/usr.sbin/relayd/pfe_filter.c
index 347048ece56..e1ae050b768 100644
--- a/usr.sbin/relayd/pfe_filter.c
+++ b/usr.sbin/relayd/pfe_filter.c
@@ -632,7 +632,8 @@ check_table(struct relayd *env, struct rdr *rdr, struct
table *table)
goto toolong;
if (ioctl(env->sc_pf->dev, DIOCRGETTSTATS, &io) == -1)
- fatal("%s: cannot get table stats", __func__);
+ fatal("%s: cannot get table stats for %s@%s", __func__,
+ io.pfrio_table.pfrt_name, io.pfrio_table.pfrt_anchor);
return (tstats.pfrts_match);
>