Re: panic at reboot - tsc_test_sync_ap
On Wed, Dec 14, 2022 at 11:37:14AM +, Pedro Caetano wrote: > Hi bugs@ > > In the process of upgrading a pair of servers to release 7.2, the following > panic was triggered after sysupgrade reboot. (dell poweredge R740) > > One of the reboots happened before syspatch, the other happened after > applying the release patches. > > After powercycling, both servers managed to boot successfully. > > Please keep me copied as I'm not subscribed to bugs@ > > > Screenshot of the panic uploaded attached to this email. For reference: cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 16-way L2 cache, 8MB 64b/line 11-way L3 cache cpu2: smt 0, core 5, package 0 panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1 panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at mainbus0: apid 26 (application process Somehow your machine is violating one of the TSC sync test sanity checks. The idea behind this one is that there should only be one AP in the sync test at a time. At the start of each test, in tsc_test_sync_ap(), the AP sets tsc_ap_name to its dv_xname. It does this with an atomic CAS expecting NULL to ensure no other AP is still running the sync test. You're hitting this panic: 449 void 450 tsc_test_sync_ap(struct cpu_info *ci) 451 { 452 if (!tsc_is_invariant) 453 return; 454 #ifndef TSC_DEBUG 455 if (!tsc_is_synchronized) 456 return; 457 #endif 458 /* The BP needs our name in order to report any problems. */ 459 if (atomic_cas_ptr(_ap_name, NULL, ci->ci_dev->dv_xname) != NULL) { 460 panic("%s: %s: tsc_ap_name is not NULL: %s", 461 __func__, ci->ci_dev->dv_xname, tsc_ap_name); 462 } The BP is supposed to reset tsc_ap_name to NULL at the conclusion of every sync test, from tsc_test_sync_bp(): 415 /* 416 * Report what happened. Adjust the TSC's quality 417 * if this is the first time we've failed the test. 418 */ 419 tsc_report_test_results(); 420 if (tsc_ap_status.lag_count || tsc_bp_status.lag_count) { 421 if (tsc_is_synchronized) { 422 tsc_is_synchronized = 0; 423 tc_reset_quality(_timecounter, -1000); 424 } 425 tsc_test_rounds = 0; 426 } else 427 tsc_test_rounds--; 428 429 /* 430 * Clean up for the next round. It is safe to reset the 431 * ingress barrier because at this point we know the AP 432 * has reached the egress barrier. 433 */ 434 memset(_ap_status, 0, sizeof tsc_ap_status); 435 memset(_bp_status, 0, sizeof tsc_bp_status); 436 tsc_ingress_barrier = 0; 437 if (tsc_test_rounds == 0) 438 tsc_ap_name = NULL; It's possible the BP's store: tsc_ap_name = NULL; is not *always* globally visible by the time the next AP reaches the tsc_ap_name CAS, triggering the panic. If so, we could force the store to complete with membar_producer(). tsc_ap_name should be volatile, too. OTOH, it's possible this particular check is not the right thing here. My intention is correct... we definitely don't want more than one AP in the sync test at any given moment. But this tsc_ap_name handshake thing may be the wrong way to assert that. Index: tsc.c === RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v retrieving revision 1.30 diff -u -p -r1.30 tsc.c --- tsc.c 24 Oct 2022 00:56:33 - 1.30 +++ tsc.c 14 Dec 2022 18:12:54 - @@ -372,7 +372,7 @@ struct tsc_test_status { struct tsc_test_status tsc_ap_status; /* Test results from AP */ struct tsc_test_status tsc_bp_status; /* Test results from BP */ uint64_t tsc_test_cycles; /* [p] TSC cycles per test round */ -const char *tsc_ap_name; /* [b] Name of AP running test */ +volatile const char *tsc_ap_name; /* [b] Name of AP running test */ volatile u_int tsc_egress_barrier; /* [a] Test end barrier */ volatile u_int tsc_ingress_barrier;/* [a] Test start barrier */ volatile u_int tsc_test_rounds;/* [p] Remaining test rounds */ @@ -434,8 +434,10 @@ tsc_test_sync_bp(struct cpu_info *ci) memset(_ap_status, 0, sizeof tsc_ap_status); memset(_bp_status, 0, sizeof tsc_bp_status); tsc_ingress_barrier = 0; - if (tsc_test_rounds == 0) + if (tsc_test_rounds == 0) { tsc_ap_name =
Re: acme-client canary corrupted issue
> Try this ok tb > > Index: revokeproc.c > === > RCS file: /home/cvs/src/usr.sbin/acme-client/revokeproc.c,v > retrieving revision 1.19 > diff -u -p -r1.19 revokeproc.c > --- revokeproc.c 22 Nov 2021 08:26:08 - 1.19 > +++ revokeproc.c 14 Dec 2022 14:16:46 - > @@ -239,6 +239,7 @@ revokeproc(int fd, const char *certfile, > goto out; > } > force = 2; > + continue; > } > if (found[j]++) { > if (revocate) { >
Re: acme-client canary corrupted issue
On 12/14/22 15:56, Otto Moerbeek wrote: On Wed, Dec 14, 2022 at 03:51:44PM +0100, Renaud Allard wrote: On 12/14/22 14:44, Theo de Raadt wrote: sysctl kern.nosuidcoredump=3 mkdir /var/crash/acme-client and then try to reproduce, and see if a core file is delivered there. This coredump mechanism was added to capture some hard-to-capture coredumps, you can see more info in core(5) and sysctl(3) Thanks I have been able to reproduce it reliably with the staging API, however, there is no core dump generated in /var/crash/acme-client. To reproduce it, you need a certificate with alternative names using multiple different domains. Generate a cert, then fully remove one of the domains and ask for a forced reissue. I tried with following Otto patch from today, and it seems it solves the issue. Are you sure you attached the right patch? -Otto Index: acctproc.c === RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v retrieving revision 1.23 diff -u -p -r1.23 acctproc.c --- acctproc.c 14 Jan 2022 09:20:18 - 1.23 +++ acctproc.c 14 Dec 2022 11:06:45 - @@ -439,6 +439,7 @@ op_sign(int fd, EVP_PKEY *pkey, enum acc rc = 1; out: + ECDSA_SIG_free(ec_sig); EVP_MD_CTX_free(ctx); free(pay); free(sign); OK, with both patches (one from Otto and the other from Theo B (sorry I mistaken the first patch author)) and 4 tries, I have not got the crash anymore. Index: revokeproc.c === RCS file: /home/cvs/src/usr.sbin/acme-client/revokeproc.c,v retrieving revision 1.19 diff -u -p -r1.19 revokeproc.c --- revokeproc.c22 Nov 2021 08:26:08 - 1.19 +++ revokeproc.c14 Dec 2022 14:16:46 - @@ -239,6 +239,7 @@ revokeproc(int fd, const char *certfile, goto out; } force = 2; + continue; } if (found[j]++) { if (revocate) { smime.p7s Description: S/MIME Cryptographic Signature
Re: acme-client canary corrupted issue
On 12/14/22 15:56, Otto Moerbeek wrote: On Wed, Dec 14, 2022 at 03:51:44PM +0100, Renaud Allard wrote: On 12/14/22 14:44, Theo de Raadt wrote: sysctl kern.nosuidcoredump=3 mkdir /var/crash/acme-client and then try to reproduce, and see if a core file is delivered there. This coredump mechanism was added to capture some hard-to-capture coredumps, you can see more info in core(5) and sysctl(3) Thanks I have been able to reproduce it reliably with the staging API, however, there is no core dump generated in /var/crash/acme-client. To reproduce it, you need a certificate with alternative names using multiple different domains. Generate a cert, then fully remove one of the domains and ask for a forced reissue. I tried with following Otto patch from today, and it seems it solves the issue. Are you sure you attached the right patch? -Otto Ahh, that's a strange one. On the first run, the crash didn't happen anymore with that patch, but I retried again to be sure and the crash is still there. I will try again with your "continue" patch too, and make 3 tries to be sure. Would a ktrace help? Index: acctproc.c === RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v retrieving revision 1.23 diff -u -p -r1.23 acctproc.c --- acctproc.c 14 Jan 2022 09:20:18 - 1.23 +++ acctproc.c 14 Dec 2022 11:06:45 - @@ -439,6 +439,7 @@ op_sign(int fd, EVP_PKEY *pkey, enum acc rc = 1; out: + ECDSA_SIG_free(ec_sig); EVP_MD_CTX_free(ctx); free(pay); free(sign); smime.p7s Description: S/MIME Cryptographic Signature
Re: acme-client canary corrupted issue
On Wed, Dec 14, 2022 at 03:51:44PM +0100, Renaud Allard wrote: > > > On 12/14/22 14:44, Theo de Raadt wrote: > > sysctl kern.nosuidcoredump=3 > > > > mkdir /var/crash/acme-client > > > > and then try to reproduce, and see if a core file is delivered there. > > This coredump mechanism was added to capture some hard-to-capture coredumps, > > you can see more info in core(5) and sysctl(3) > > > > Thanks > > I have been able to reproduce it reliably with the staging API, however, > there is no core dump generated in /var/crash/acme-client. > > To reproduce it, you need a certificate with alternative names using > multiple different domains. Generate a cert, then fully remove one of the > domains and ask for a forced reissue. > > I tried with following Otto patch from today, and it seems it solves the > issue. Are you sure you attached the right patch? -Otto > > Index: acctproc.c > === > RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v > retrieving revision 1.23 > diff -u -p -r1.23 acctproc.c > --- acctproc.c14 Jan 2022 09:20:18 - 1.23 > +++ acctproc.c14 Dec 2022 11:06:45 - > @@ -439,6 +439,7 @@ op_sign(int fd, EVP_PKEY *pkey, enum acc > > rc = 1; > out: > + ECDSA_SIG_free(ec_sig); > EVP_MD_CTX_free(ctx); > free(pay); > free(sign);
Re: acme-client canary corrupted issue
On 12/14/22 14:44, Theo de Raadt wrote: sysctl kern.nosuidcoredump=3 mkdir /var/crash/acme-client and then try to reproduce, and see if a core file is delivered there. This coredump mechanism was added to capture some hard-to-capture coredumps, you can see more info in core(5) and sysctl(3) Thanks I have been able to reproduce it reliably with the staging API, however, there is no core dump generated in /var/crash/acme-client. To reproduce it, you need a certificate with alternative names using multiple different domains. Generate a cert, then fully remove one of the domains and ask for a forced reissue. I tried with following Otto patch from today, and it seems it solves the issue. Index: acctproc.c === RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v retrieving revision 1.23 diff -u -p -r1.23 acctproc.c --- acctproc.c 14 Jan 2022 09:20:18 - 1.23 +++ acctproc.c 14 Dec 2022 11:06:45 - @@ -439,6 +439,7 @@ op_sign(int fd, EVP_PKEY *pkey, enum acc rc = 1; out: + ECDSA_SIG_free(ec_sig); EVP_MD_CTX_free(ctx); free(pay); free(sign); smime.p7s Description: S/MIME Cryptographic Signature
Re: acme-client canary corrupted issue
On Wed, Dec 14, 2022 at 12:30:25PM +0100, Renaud Allard wrote: > Hi Otto, > > > On 12/14/22 12:01, Otto Moerbeek wrote: > > On Tue, Dec 13, 2022 at 10:34:53AM +0100, Renaud Allard wrote: > > > > > Hello, > > > > > > I was force renewing some certs because I removed some domains from > > > the cert, and got this: > > > acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 > > > 0xb0@0xb0 > > > > > > I am using vm.malloc_conf=SUR>> > > > > > > Best Regards > > > > > > I cannot reproduce with several attempts. Please include details on > > platform and version. > > > > Can you show a run with -v on? That gives a hint where the problem > > occurs. > > > > Do you get a core dump? If so, try to get a backtrace. > > > > > It's quite hard to reproduce, I only had it once when I shrank the > alternative names involved in one certificate. There was no core dump. > > This was produced on 7.2-stable amd64 > account and domain keys are ecdsa > > I ran it with -vvF and could get my run log thanks to tmux back buffer. > I will skip all the verification/certs babble > > isildur# acme-client -vvF arnor.org > > acme-client: /somewhere/arnor.org.key: loaded domain key > > acme-client: /etc/acme/letsencrypt-privkey.pem: loaded account key > > acme-client: /somewhere/arnor.org.crt: certificate valid: 74 days left > > acme-client: /somewhere/arnor.org.crt: domain list changed, forcing renewal > acme-client: https://acme-v02.api.letsencrypt.org/directory: directories > > acme-client: acme-v02.api.letsencrypt.org: DNS: 172.65.32.248 > > lots of standard certs/verif dialog * > -END CERTIFICATE- ] (5800 bytes) > > acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 0xb0@0xb0 > acme-client: /somewhere/arnor.org.crt: created > > acme-client: /somewhere/arnor.org.fullchain.pem: created > > acme-client: signal: revokeproc(53931): Abort trap > > Best Regards Try this -Otto Index: revokeproc.c === RCS file: /home/cvs/src/usr.sbin/acme-client/revokeproc.c,v retrieving revision 1.19 diff -u -p -r1.19 revokeproc.c --- revokeproc.c22 Nov 2021 08:26:08 - 1.19 +++ revokeproc.c14 Dec 2022 14:16:46 - @@ -239,6 +239,7 @@ revokeproc(int fd, const char *certfile, goto out; } force = 2; + continue; } if (found[j]++) { if (revocate) {
Re: acme-client canary corrupted issue
sysctl kern.nosuidcoredump=3 mkdir /var/crash/acme-client and then try to reproduce, and see if a core file is delivered there. This coredump mechanism was added to capture some hard-to-capture coredumps, you can see more info in core(5) and sysctl(3) Renaud Allard wrote: > Hi Otto, > > > On 12/14/22 12:01, Otto Moerbeek wrote: > > On Tue, Dec 13, 2022 at 10:34:53AM +0100, Renaud Allard wrote: > > > >> Hello, > >> > >> I was force renewing some certs because I removed some domains from > >> the cert, and got this: > >> acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 > >> 0xb0@0xb0 > >> > >> I am using vm.malloc_conf=SUR>> > >> > >> Best Regards > > I cannot reproduce with several attempts. Please include details on > > platform and version. > > Can you show a run with -v on? That gives a hint where the problem > > occurs. > > Do you get a core dump? If so, try to get a backtrace. > > > > > It's quite hard to reproduce, I only had it once when I shrank the > alternative names involved in one certificate. There was no core dump. > > This was produced on 7.2-stable amd64 > account and domain keys are ecdsa > > I ran it with -vvF and could get my run log thanks to tmux back buffer. > I will skip all the verification/certs babble > > isildur# acme-client -vvF arnor.org acme-client: > /somewhere/arnor.org.key: loaded domain key acme-client: > /etc/acme/letsencrypt-privkey.pem: loaded account key acme-client: > /somewhere/arnor.org.crt: certificate valid: 74 days left acme-client: > /somewhere/arnor.org.crt: domain list changed, forcing renewal > acme-client: https://acme-v02.api.letsencrypt.org/directory: > directories acme-client: acme-v02.api.letsencrypt.org: DNS: > 172.65.32.248 lots of standard certs/verif dialog * > -END CERTIFICATE- ] (5800 bytes) acme-client(53931) in free(): > chunk canary corrupted 0xa06cb09db00 0xb0@0xb0 > acme-client: /somewhere/arnor.org.crt: created acme-client: > /somewhere/arnor.org.fullchain.pem: created acme-client: signal: > revokeproc(53931): Abort trap > > Best Regards
Re: acme-client canary corrupted issue
Hi Otto, On 12/14/22 12:01, Otto Moerbeek wrote: On Tue, Dec 13, 2022 at 10:34:53AM +0100, Renaud Allard wrote: Hello, I was force renewing some certs because I removed some domains from the cert, and got this: acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 0xb0@0xb0 I am using vm.malloc_conf=SUR>> Best Regards I cannot reproduce with several attempts. Please include details on platform and version. Can you show a run with -v on? That gives a hint where the problem occurs. Do you get a core dump? If so, try to get a backtrace. It's quite hard to reproduce, I only had it once when I shrank the alternative names involved in one certificate. There was no core dump. This was produced on 7.2-stable amd64 account and domain keys are ecdsa I ran it with -vvF and could get my run log thanks to tmux back buffer. I will skip all the verification/certs babble isildur# acme-client -vvF arnor.org acme-client: /somewhere/arnor.org.key: loaded domain key acme-client: /etc/acme/letsencrypt-privkey.pem: loaded account key acme-client: /somewhere/arnor.org.crt: certificate valid: 74 days left acme-client: /somewhere/arnor.org.crt: domain list changed, forcing renewal acme-client: https://acme-v02.api.letsencrypt.org/directory: directories acme-client: acme-v02.api.letsencrypt.org: DNS: 172.65.32.248 lots of standard certs/verif dialog * -END CERTIFICATE- ] (5800 bytes) acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 0xb0@0xb0 acme-client: /somewhere/arnor.org.crt: created acme-client: /somewhere/arnor.org.fullchain.pem: created acme-client: signal: revokeproc(53931): Abort trap Best Regards smime.p7s Description: S/MIME Cryptographic Signature
Re: acme-client canary corrupted issue
On Tue, Dec 13, 2022 at 10:34:53AM +0100, Renaud Allard wrote: > Hello, > > I was force renewing some certs because I removed some domains from > the cert, and got this: > acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 0xb0@0xb0 > > I am using vm.malloc_conf=SUR>> > > Best Regards I cannot reproduce with several attempts. Please include details on platform and version. Can you show a run with -v on? That gives a hint where the problem occurs. Do you get a core dump? If so, try to get a backtrace. -Otto