Re: Multiple issues with current (kldload failures, missing CTF stuff, pty issues, ...)

2024-03-31 Thread Alexander Leidinger

Am 2024-03-29 18:21, schrieb Alexander Leidinger:

Am 2024-03-29 18:13, schrieb Mark Johnston:

On Fri, Mar 29, 2024 at 04:52:55PM +0100, Alexander Leidinger wrote:

Hi,

sources from 2024-03-11 work. Sources from 2024-03-25 and today don't 
work
(see below for the issue). As the monthly stabilisation pass didn't 
find

obvious issues, it is something related to my setup:
 - not a generic kernel
 - very modular kernel (as much as possible as a module)
 - bind_now (a build without fails too, tested with clean /usr/obj)
 - ccache (a build without fails too, tested with clean /usr/obj)
 - kernel retpoline (build without in progress)
 - userland retpoline (build without in progress)
 - kernel build with WITH_CTF / DDB_CTF (next one to test if it isn't
retpoline)
 - -fno-builtin
 - CPUFLAGS=native (except for stuff in /usr/src/sys/boot)
 - malloc production
 - COPTFLAGS= -O2 -pipe

The issue is, that kernel modules load OK from loader, but once it 
starts
init any module fails to load (e.g. via autodetection of hardware or 
rc.conf
kld_list) with the message that the kernel and module versions are 
out of

sync and the module refuses to load.


What is the exact revision you're running?  There were some unrelated
changes to the kernel linker around the same time.


The working src is from 2024-03-11-094351 (GMT+0100).
The failing src was fetched after Glebs stabilization week message (and 
todays src before the sound stuff still fails).


Retpoline wasn't the cause, next test is the CTF stuff in the kernel...


A rather obscure problem was causing this. The "last" BE had canmount 
set to "on" instead of "noauto". No idea how this happened, but this 
resulted in the "last" BE to be mounted on "zfs mount -a" on top of the 
current BE. This means that all modules loaded after the zfs rc script 
has run was loading old kernel modules and the error message of kernel 
version mismatch was correct. I fiund the issue while bisecting the tree 
and suddenly the error message went away but the new issue of missing 
dev entries popped up (/dev was mounted correctly on the booting 
dataset, but the last BE was mounted on top of it and /dev went 
empty...).


It looks to me like bectl was doing this (from "zpool history")...
2024-03-11.14:16:31 zpool set bootfs=rpool/ROOT/2024-03-11-094351 rpool
2024-03-11.14:16:31 zfs set canmount=noauto rpool/ROOT/2024-01-18-092730
2024-03-11.14:16:31 zfs set canmount=noauto rpool/ROOT/2024-02-10-144617
2024-03-11.14:16:32 zfs set canmount=noauto rpool/ROOT/2024-02-11-212006
2024-03-11.14:16:32 zfs set canmount=noauto rpool/ROOT/2024-02-16-082836
2024-03-11.14:16:32 zfs set canmount=noauto rpool/ROOT/2024-02-24-140211
2024-03-11.14:16:32 zfs set canmount=noauto 
rpool/ROOT/2024-02-24-140211_ok

2024-03-11.14:16:33 zfs set canmount=on rpool/ROOT/2024-03-11-094351
2024-03-11.14:16:33 zfs promote rpool/ROOT/2024-03-11-094351
2024-03-11.14:17:03 zfs destroy -r rpool/ROOT/2024-02-24-140211_ok

I surely didn't do the "zfs set canmount=..." for those by hand.

Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


signature.asc
Description: OpenPGP digital signature


Re: Multiple issues with current (kldload failures, missing CTF stuff, pty issues, ...)

2024-03-29 Thread Bojan Novković

On 3/29/24 16:52, Alexander Leidinger wrote:

Hi,

sources from 2024-03-11 work. Sources from 2024-03-25 and today don't 
work (see below for the issue). As the monthly stabilisation pass 
didn't find obvious issues, it is something related to my setup:

 - not a generic kernel
 - very modular kernel (as much as possible as a module)
 - bind_now (a build without fails too, tested with clean /usr/obj)
 - ccache (a build without fails too, tested with clean /usr/obj)
 - kernel retpoline (build without in progress)
 - userland retpoline (build without in progress)
 - kernel build with WITH_CTF / DDB_CTF (next one to test if it isn't 
retpoline)

 - -fno-builtin
 - CPUFLAGS=native (except for stuff in /usr/src/sys/boot)
 - malloc production
 - COPTFLAGS= -O2 -pipe

The issue is, that kernel modules load OK from loader, but once it 
starts init any module fails to load (e.g. via autodetection of 
hardware or rc.conf kld_list) with the message that the kernel and 
module versions are out of sync and the module refuses to load.


I tried the workaround to load the modules from the loader, which 
works, but then I can't login remotely as ssh fails to allocate a pty. 
By loading modules via the loader, I can see messages about missing 
CTF info when the nvidia modules (from ports = not yet rebuild = in 
/boot/modules/...ko instead of /boot/kernel/...ko) try to get 
initialised... and it looks like they are failing to get initialised 
because of this missing CTF stuff (I'm back to the previous boot env 
to be able to login remotely and send mails, I don't have a copy of 
the failure message at hand).


I assume the missing CTF stuff is due to the CTF based pretty printing 
(https://cgit.freebsd.org/src/commit/?id=c21bc6f3c2425de74141bfee07b609bf65b5a6b3). 
Is this supposed to fail to load modules which are compiled without 
CTF data? Shouldn't this work gracefully (e.g. spit out a warning that 
pretty printing is not available for module X and have the module 
working)?


This is indeed how it works, those messages are emitted by CTF loading 
routines in 'kern/kern_ctf.c' as a warning and do not affect the rest of 
the module loading process.


However, I completely agree that they are cryptic and spammy, I'll try 
to do something about that.


Bojan




Re: Multiple issues with current (kldload failures, missing CTF stuff, pty issues, ...)

2024-03-29 Thread Alexander Leidinger

Am 2024-03-29 18:13, schrieb Mark Johnston:

On Fri, Mar 29, 2024 at 04:52:55PM +0100, Alexander Leidinger wrote:

Hi,

sources from 2024-03-11 work. Sources from 2024-03-25 and today don't 
work
(see below for the issue). As the monthly stabilisation pass didn't 
find

obvious issues, it is something related to my setup:
 - not a generic kernel
 - very modular kernel (as much as possible as a module)
 - bind_now (a build without fails too, tested with clean /usr/obj)
 - ccache (a build without fails too, tested with clean /usr/obj)
 - kernel retpoline (build without in progress)
 - userland retpoline (build without in progress)
 - kernel build with WITH_CTF / DDB_CTF (next one to test if it isn't
retpoline)
 - -fno-builtin
 - CPUFLAGS=native (except for stuff in /usr/src/sys/boot)
 - malloc production
 - COPTFLAGS= -O2 -pipe

The issue is, that kernel modules load OK from loader, but once it 
starts
init any module fails to load (e.g. via autodetection of hardware or 
rc.conf
kld_list) with the message that the kernel and module versions are out 
of

sync and the module refuses to load.


What is the exact revision you're running?  There were some unrelated
changes to the kernel linker around the same time.


The working src is from 2024-03-11-094351 (GMT+0100).
The failing src was fetched after Glebs stabilization week message (and 
todays src before the sound stuff still fails).


Retpoline wasn't the cause, next test is the CTF stuff in the kernel...

I tried the workaround to load the modules from the loader, which 
works, but

then I can't login remotely as ssh fails to allocate a pty. By loading
modules via the loader, I can see messages about missing CTF info when 
the

nvidia modules (from ports = not yet rebuild = in /boot/modules/...ko
instead of /boot/kernel/...ko) try to get initialised... and it looks 
like
they are failing to get initialised because of this missing CTF stuff 
(I'm
back to the previous boot env to be able to login remotely and send 
mails, I

don't have a copy of the failure message at hand).

I assume the missing CTF stuff is due to the CTF based pretty printing 
(https://cgit.freebsd.org/src/commit/?id=c21bc6f3c2425de74141bfee07b609bf65b5a6b3).
Is this supposed to fail to load modules which are compiled without 
CTF
data? Shouldn't this work gracefully (e.g. spit out a warning that 
pretty

printing is not available for module X and have the module working)?


From my reading of linker_ctf_load_file(), this is exactly how it
already works.


Great that it works this way, I still suggest to print a message what 
the warning about missing stuff means.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


signature.asc
Description: OpenPGP digital signature


Re: Multiple issues with current (kldload failures, missing CTF stuff, pty issues, ...)

2024-03-29 Thread Mark Johnston
On Fri, Mar 29, 2024 at 04:52:55PM +0100, Alexander Leidinger wrote:
> Hi,
> 
> sources from 2024-03-11 work. Sources from 2024-03-25 and today don't work
> (see below for the issue). As the monthly stabilisation pass didn't find
> obvious issues, it is something related to my setup:
>  - not a generic kernel
>  - very modular kernel (as much as possible as a module)
>  - bind_now (a build without fails too, tested with clean /usr/obj)
>  - ccache (a build without fails too, tested with clean /usr/obj)
>  - kernel retpoline (build without in progress)
>  - userland retpoline (build without in progress)
>  - kernel build with WITH_CTF / DDB_CTF (next one to test if it isn't
> retpoline)
>  - -fno-builtin
>  - CPUFLAGS=native (except for stuff in /usr/src/sys/boot)
>  - malloc production
>  - COPTFLAGS= -O2 -pipe
> 
> The issue is, that kernel modules load OK from loader, but once it starts
> init any module fails to load (e.g. via autodetection of hardware or rc.conf
> kld_list) with the message that the kernel and module versions are out of
> sync and the module refuses to load.

What is the exact revision you're running?  There were some unrelated
changes to the kernel linker around the same time.

> I tried the workaround to load the modules from the loader, which works, but
> then I can't login remotely as ssh fails to allocate a pty. By loading
> modules via the loader, I can see messages about missing CTF info when the
> nvidia modules (from ports = not yet rebuild = in /boot/modules/...ko
> instead of /boot/kernel/...ko) try to get initialised... and it looks like
> they are failing to get initialised because of this missing CTF stuff (I'm
> back to the previous boot env to be able to login remotely and send mails, I
> don't have a copy of the failure message at hand).
> 
> I assume the missing CTF stuff is due to the CTF based pretty printing 
> (https://cgit.freebsd.org/src/commit/?id=c21bc6f3c2425de74141bfee07b609bf65b5a6b3).
> Is this supposed to fail to load modules which are compiled without CTF
> data? Shouldn't this work gracefully (e.g. spit out a warning that pretty
> printing is not available for module X and have the module working)?

>From my reading of linker_ctf_load_file(), this is exactly how it
already works.

> Next steps:
>  - try a world without retpoline (bind_now and ccache active)
>  - try a kernel without CTF (bind now, ccache, retpoline active)
>  - try a world without bind_now, retpoline, CTF, CPUFLAGS, COPTFLAGS
> 
> If anyone has an idea how to debug this in some other way...
> 
> Bye,
> Alexander.
> 
> -- 
> http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
> http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF





Multiple issues with current (kldload failures, missing CTF stuff, pty issues, ...)

2024-03-29 Thread Alexander Leidinger

Hi,

sources from 2024-03-11 work. Sources from 2024-03-25 and today don't 
work (see below for the issue). As the monthly stabilisation pass didn't 
find obvious issues, it is something related to my setup:

 - not a generic kernel
 - very modular kernel (as much as possible as a module)
 - bind_now (a build without fails too, tested with clean /usr/obj)
 - ccache (a build without fails too, tested with clean /usr/obj)
 - kernel retpoline (build without in progress)
 - userland retpoline (build without in progress)
 - kernel build with WITH_CTF / DDB_CTF (next one to test if it isn't 
retpoline)

 - -fno-builtin
 - CPUFLAGS=native (except for stuff in /usr/src/sys/boot)
 - malloc production
 - COPTFLAGS= -O2 -pipe

The issue is, that kernel modules load OK from loader, but once it 
starts init any module fails to load (e.g. via autodetection of hardware 
or rc.conf kld_list) with the message that the kernel and module 
versions are out of sync and the module refuses to load.


I tried the workaround to load the modules from the loader, which works, 
but then I can't login remotely as ssh fails to allocate a pty. By 
loading modules via the loader, I can see messages about missing CTF 
info when the nvidia modules (from ports = not yet rebuild = in 
/boot/modules/...ko instead of /boot/kernel/...ko) try to get 
initialised... and it looks like they are failing to get initialised 
because of this missing CTF stuff (I'm back to the previous boot env to 
be able to login remotely and send mails, I don't have a copy of the 
failure message at hand).


I assume the missing CTF stuff is due to the CTF based pretty printing 
(https://cgit.freebsd.org/src/commit/?id=c21bc6f3c2425de74141bfee07b609bf65b5a6b3). 
Is this supposed to fail to load modules which are compiled without CTF 
data? Shouldn't this work gracefully (e.g. spit out a warning that 
pretty printing is not available for module X and have the module 
working)?


Next steps:
 - try a world without retpoline (bind_now and ccache active)
 - try a kernel without CTF (bind now, ccache, retpoline active)
 - try a world without bind_now, retpoline, CTF, CPUFLAGS, COPTFLAGS

If anyone has an idea how to debug this in some other way...

Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


signature.asc
Description: OpenPGP digital signature