Re: panic at reboot - tsc_test_sync_ap

2022-12-14 Thread Scott Cheloha
On Wed, Dec 14, 2022 at 11:37:14AM +, Pedro Caetano wrote:
> Hi bugs@
> 
> In the process of upgrading a pair of servers to release 7.2, the following
> panic was triggered after sysupgrade reboot. (dell poweredge R740)
> 
> One of the reboots happened before syspatch, the other happened after
> applying the release patches.
> 
> After powercycling, both servers managed to boot successfully.
> 
> Please keep me copied as I'm not subscribed to bugs@
> 
> 
> Screenshot of the panic uploaded attached to this email.

For reference:

cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 
16-way L2 cache, 8MB 64b/line 11-way L3 cache
cpu2: smt 0, core 5, package 0
panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1
panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at mainbus0: 
apid 26 (application process

Somehow your machine is violating one of the TSC sync test sanity
checks.  The idea behind this one is that there should only be one AP
in the sync test at a time.

At the start of each test, in tsc_test_sync_ap(), the AP sets
tsc_ap_name to its dv_xname.  It does this with an atomic CAS
expecting NULL to ensure no other AP is still running the sync test.
You're hitting this panic:

   449  void
   450  tsc_test_sync_ap(struct cpu_info *ci)
   451  {
   452  if (!tsc_is_invariant)
   453  return;
   454  #ifndef TSC_DEBUG
   455  if (!tsc_is_synchronized)
   456  return;
   457  #endif
   458  /* The BP needs our name in order to report any problems. */
   459  if (atomic_cas_ptr(_ap_name, NULL, ci->ci_dev->dv_xname) != 
NULL) {
   460  panic("%s: %s: tsc_ap_name is not NULL: %s",
   461  __func__, ci->ci_dev->dv_xname, tsc_ap_name);
   462  }

The BP is supposed to reset tsc_ap_name to NULL at the conclusion of
every sync test, from tsc_test_sync_bp():

   415  /*
   416   * Report what happened.  Adjust the TSC's quality
   417   * if this is the first time we've failed the test.
   418   */
   419  tsc_report_test_results();
   420  if (tsc_ap_status.lag_count || tsc_bp_status.lag_count) 
{
   421  if (tsc_is_synchronized) {
   422  tsc_is_synchronized = 0;
   423  tc_reset_quality(_timecounter, 
-1000);
   424  }
   425  tsc_test_rounds = 0;
   426  } else
   427  tsc_test_rounds--;
   428
   429  /*
   430   * Clean up for the next round.  It is safe to reset the
   431   * ingress barrier because at this point we know the AP
   432   * has reached the egress barrier.
   433   */
   434  memset(_ap_status, 0, sizeof tsc_ap_status);
   435  memset(_bp_status, 0, sizeof tsc_bp_status);
   436  tsc_ingress_barrier = 0;
   437  if (tsc_test_rounds == 0)
   438  tsc_ap_name = NULL;

It's possible the BP's store:

tsc_ap_name = NULL;

is not *always* globally visible by the time the next AP reaches the
tsc_ap_name CAS, triggering the panic.  If so, we could force the
store to complete with membar_producer().  tsc_ap_name should be
volatile, too.

OTOH, it's possible this particular check is not the right thing here.
My intention is correct... we definitely don't want more than one AP
in the sync test at any given moment.  But this tsc_ap_name handshake
thing may be the wrong way to assert that.

Index: tsc.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
retrieving revision 1.30
diff -u -p -r1.30 tsc.c
--- tsc.c   24 Oct 2022 00:56:33 -  1.30
+++ tsc.c   14 Dec 2022 18:12:54 -
@@ -372,7 +372,7 @@ struct tsc_test_status {
 struct tsc_test_status tsc_ap_status;  /* Test results from AP */
 struct tsc_test_status tsc_bp_status;  /* Test results from BP */
 uint64_t tsc_test_cycles;  /* [p] TSC cycles per test round */
-const char *tsc_ap_name;   /* [b] Name of AP running test */
+volatile const char *tsc_ap_name;  /* [b] Name of AP running test */
 volatile u_int tsc_egress_barrier; /* [a] Test end barrier */
 volatile u_int tsc_ingress_barrier;/* [a] Test start barrier */
 volatile u_int tsc_test_rounds;/* [p] Remaining test rounds */
@@ -434,8 +434,10 @@ tsc_test_sync_bp(struct cpu_info *ci)
memset(_ap_status, 0, sizeof tsc_ap_status);
memset(_bp_status, 0, sizeof tsc_bp_status);
tsc_ingress_barrier = 0;
-   if (tsc_test_rounds == 0)
+   if (tsc_test_rounds == 0) {
tsc_ap_name = 

Re: acme-client canary corrupted issue

2022-12-14 Thread Theo Buehler
> Try this

ok tb

> 
> Index: revokeproc.c
> ===
> RCS file: /home/cvs/src/usr.sbin/acme-client/revokeproc.c,v
> retrieving revision 1.19
> diff -u -p -r1.19 revokeproc.c
> --- revokeproc.c  22 Nov 2021 08:26:08 -  1.19
> +++ revokeproc.c  14 Dec 2022 14:16:46 -
> @@ -239,6 +239,7 @@ revokeproc(int fd, const char *certfile,
>   goto out;
>   }
>   force = 2;
> + continue;
>   }
>   if (found[j]++) {
>   if (revocate) {
> 



Re: acme-client canary corrupted issue

2022-12-14 Thread Renaud Allard



On 12/14/22 15:56, Otto Moerbeek wrote:

On Wed, Dec 14, 2022 at 03:51:44PM +0100, Renaud Allard wrote:




On 12/14/22 14:44, Theo de Raadt wrote:

sysctl kern.nosuidcoredump=3

mkdir /var/crash/acme-client

and then try to reproduce, and see if a core file is delivered there.
This coredump mechanism was added to capture some hard-to-capture coredumps,
you can see more info in core(5) and sysctl(3)



Thanks

I have been able to reproduce it reliably with the staging API, however,
there is no core dump generated in /var/crash/acme-client.

To reproduce it, you need a certificate with alternative names using
multiple different domains. Generate a cert, then fully remove one of the
domains and ask for a forced reissue.

I tried with following Otto patch from today, and it seems it solves the
issue.


Are you sure you attached the right patch?

-Otto



Index: acctproc.c
===
RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v
retrieving revision 1.23
diff -u -p -r1.23 acctproc.c
--- acctproc.c  14 Jan 2022 09:20:18 -  1.23
+++ acctproc.c  14 Dec 2022 11:06:45 -
@@ -439,6 +439,7 @@ op_sign(int fd, EVP_PKEY *pkey, enum acc

rc = 1;
  out:
+   ECDSA_SIG_free(ec_sig);
EVP_MD_CTX_free(ctx);
free(pay);
free(sign);





OK, with both patches (one from Otto and the other from Theo B (sorry I 
mistaken the first patch author)) and 4 tries, I have not got the crash 
anymore.


Index: revokeproc.c
===
RCS file: /home/cvs/src/usr.sbin/acme-client/revokeproc.c,v
retrieving revision 1.19
diff -u -p -r1.19 revokeproc.c
--- revokeproc.c22 Nov 2021 08:26:08 -  1.19
+++ revokeproc.c14 Dec 2022 14:16:46 -
@@ -239,6 +239,7 @@ revokeproc(int fd, const char *certfile,
goto out;
}
force = 2;
+   continue;
}
if (found[j]++) {
if (revocate) {



smime.p7s
Description: S/MIME Cryptographic Signature


Re: acme-client canary corrupted issue

2022-12-14 Thread Renaud Allard



On 12/14/22 15:56, Otto Moerbeek wrote:

On Wed, Dec 14, 2022 at 03:51:44PM +0100, Renaud Allard wrote:




On 12/14/22 14:44, Theo de Raadt wrote:

sysctl kern.nosuidcoredump=3

mkdir /var/crash/acme-client

and then try to reproduce, and see if a core file is delivered there.
This coredump mechanism was added to capture some hard-to-capture coredumps,
you can see more info in core(5) and sysctl(3)



Thanks

I have been able to reproduce it reliably with the staging API, however,
there is no core dump generated in /var/crash/acme-client.

To reproduce it, you need a certificate with alternative names using
multiple different domains. Generate a cert, then fully remove one of the
domains and ask for a forced reissue.

I tried with following Otto patch from today, and it seems it solves the
issue.


Are you sure you attached the right patch?

-Otto




Ahh, that's a strange one. On the first run, the crash didn't happen 
anymore with that patch, but I retried again to be sure and the crash is 
still there.
I will try again with your "continue" patch too, and make 3 tries to be 
sure.

Would a ktrace help?




Index: acctproc.c
===
RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v
retrieving revision 1.23
diff -u -p -r1.23 acctproc.c
--- acctproc.c  14 Jan 2022 09:20:18 -  1.23
+++ acctproc.c  14 Dec 2022 11:06:45 -
@@ -439,6 +439,7 @@ op_sign(int fd, EVP_PKEY *pkey, enum acc

rc = 1;
  out:
+   ECDSA_SIG_free(ec_sig);
EVP_MD_CTX_free(ctx);
free(pay);
free(sign);





smime.p7s
Description: S/MIME Cryptographic Signature


Re: acme-client canary corrupted issue

2022-12-14 Thread Otto Moerbeek
On Wed, Dec 14, 2022 at 03:51:44PM +0100, Renaud Allard wrote:

> 
> 
> On 12/14/22 14:44, Theo de Raadt wrote:
> > sysctl kern.nosuidcoredump=3
> > 
> > mkdir /var/crash/acme-client
> > 
> > and then try to reproduce, and see if a core file is delivered there.
> > This coredump mechanism was added to capture some hard-to-capture coredumps,
> > you can see more info in core(5) and sysctl(3)
> > 
> 
> Thanks
> 
> I have been able to reproduce it reliably with the staging API, however,
> there is no core dump generated in /var/crash/acme-client.
> 
> To reproduce it, you need a certificate with alternative names using
> multiple different domains. Generate a cert, then fully remove one of the
> domains and ask for a forced reissue.
> 
> I tried with following Otto patch from today, and it seems it solves the
> issue.

Are you sure you attached the right patch?

-Otto

> 
> Index: acctproc.c
> ===
> RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v
> retrieving revision 1.23
> diff -u -p -r1.23 acctproc.c
> --- acctproc.c14 Jan 2022 09:20:18 -  1.23
> +++ acctproc.c14 Dec 2022 11:06:45 -
> @@ -439,6 +439,7 @@ op_sign(int fd, EVP_PKEY *pkey, enum acc
> 
>   rc = 1;
>  out:
> + ECDSA_SIG_free(ec_sig);
>   EVP_MD_CTX_free(ctx);
>   free(pay);
>   free(sign);




Re: acme-client canary corrupted issue

2022-12-14 Thread Renaud Allard



On 12/14/22 14:44, Theo de Raadt wrote:

sysctl kern.nosuidcoredump=3

mkdir /var/crash/acme-client

and then try to reproduce, and see if a core file is delivered there.
This coredump mechanism was added to capture some hard-to-capture coredumps,
you can see more info in core(5) and sysctl(3)



Thanks

I have been able to reproduce it reliably with the staging API, however, 
there is no core dump generated in /var/crash/acme-client.


To reproduce it, you need a certificate with alternative names using 
multiple different domains. Generate a cert, then fully remove one of 
the domains and ask for a forced reissue.


I tried with following Otto patch from today, and it seems it solves the 
issue.


Index: acctproc.c
===
RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v
retrieving revision 1.23
diff -u -p -r1.23 acctproc.c
--- acctproc.c  14 Jan 2022 09:20:18 -  1.23
+++ acctproc.c  14 Dec 2022 11:06:45 -
@@ -439,6 +439,7 @@ op_sign(int fd, EVP_PKEY *pkey, enum acc

rc = 1;
 out:
+   ECDSA_SIG_free(ec_sig);
EVP_MD_CTX_free(ctx);
free(pay);
free(sign);


smime.p7s
Description: S/MIME Cryptographic Signature


Re: acme-client canary corrupted issue

2022-12-14 Thread Otto Moerbeek
On Wed, Dec 14, 2022 at 12:30:25PM +0100, Renaud Allard wrote:

> Hi Otto,
> 
> 
> On 12/14/22 12:01, Otto Moerbeek wrote:
> > On Tue, Dec 13, 2022 at 10:34:53AM +0100, Renaud Allard wrote:
> > 
> > > Hello,
> > > 
> > > I was force renewing some certs because I removed some domains from
> > > the cert, and got this:
> > > acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 
> > > 0xb0@0xb0
> > > 
> > > I am using vm.malloc_conf=SUR>>
> > > 
> > > Best Regards
> > 
> > 
> > I cannot reproduce with several attempts. Please include details on
> > platform and version.
> > 
> > Can you show a run with -v on? That gives a hint where the problem
> > occurs.
> > 
> > Do you get a core dump? If so, try to get a backtrace.
> > 
> 
> 
> It's quite hard to reproduce, I only had it once when I shrank the
> alternative names involved in one certificate. There was no core dump.
> 
> This was produced on 7.2-stable amd64
> account and domain keys are ecdsa
> 
> I ran it with -vvF and could get my run log thanks to tmux back buffer.
> I will skip all the verification/certs babble
> 
> isildur# acme-client -vvF arnor.org
> 
> acme-client: /somewhere/arnor.org.key: loaded domain key
> 
> acme-client: /etc/acme/letsencrypt-privkey.pem: loaded account key
> 
> acme-client: /somewhere/arnor.org.crt: certificate valid: 74 days left
> 
> acme-client: /somewhere/arnor.org.crt: domain list changed, forcing renewal
> acme-client: https://acme-v02.api.letsencrypt.org/directory: directories
> 
> acme-client: acme-v02.api.letsencrypt.org: DNS: 172.65.32.248
> 
>  lots of standard certs/verif dialog *
> -END CERTIFICATE- ] (5800 bytes)
> 
> acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 0xb0@0xb0
> acme-client: /somewhere/arnor.org.crt: created
> 
> acme-client: /somewhere/arnor.org.fullchain.pem: created
> 
> acme-client: signal: revokeproc(53931): Abort trap
> 
> Best Regards


Try this

-Otto

Index: revokeproc.c
===
RCS file: /home/cvs/src/usr.sbin/acme-client/revokeproc.c,v
retrieving revision 1.19
diff -u -p -r1.19 revokeproc.c
--- revokeproc.c22 Nov 2021 08:26:08 -  1.19
+++ revokeproc.c14 Dec 2022 14:16:46 -
@@ -239,6 +239,7 @@ revokeproc(int fd, const char *certfile,
goto out;
}
force = 2;
+   continue;
}
if (found[j]++) {
if (revocate) {



Re: acme-client canary corrupted issue

2022-12-14 Thread Theo de Raadt
sysctl kern.nosuidcoredump=3

mkdir /var/crash/acme-client

and then try to reproduce, and see if a core file is delivered there.
This coredump mechanism was added to capture some hard-to-capture coredumps,
you can see more info in core(5) and sysctl(3)

Renaud Allard  wrote:

> Hi Otto,
> 
> 
> On 12/14/22 12:01, Otto Moerbeek wrote:
> > On Tue, Dec 13, 2022 at 10:34:53AM +0100, Renaud Allard wrote:
> > 
> >> Hello,
> >>
> >> I was force renewing some certs because I removed some domains from
> >> the cert, and got this:
> >> acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 
> >> 0xb0@0xb0
> >>
> >> I am using vm.malloc_conf=SUR>>
> >>
> >> Best Regards
> > I cannot reproduce with several attempts. Please include details on
> > platform and version.
> > Can you show a run with -v on? That gives a hint where the problem
> > occurs.
> > Do you get a core dump? If so, try to get a backtrace.
> > 
> 
> 
> It's quite hard to reproduce, I only had it once when I shrank the
> alternative names involved in one certificate. There was no core dump.
> 
> This was produced on 7.2-stable amd64
> account and domain keys are ecdsa
> 
> I ran it with -vvF and could get my run log thanks to tmux back buffer.
> I will skip all the verification/certs babble
> 
> isildur# acme-client -vvF arnor.org acme-client:
> /somewhere/arnor.org.key: loaded domain key acme-client:
> /etc/acme/letsencrypt-privkey.pem: loaded account key acme-client:
> /somewhere/arnor.org.crt: certificate valid: 74 days left acme-client:
> /somewhere/arnor.org.crt: domain list changed, forcing renewal
> acme-client: https://acme-v02.api.letsencrypt.org/directory:
> directories acme-client: acme-v02.api.letsencrypt.org: DNS:
> 172.65.32.248  lots of standard certs/verif dialog *
> -END CERTIFICATE- ] (5800 bytes) acme-client(53931) in free():
>  chunk canary corrupted 0xa06cb09db00 0xb0@0xb0
> acme-client: /somewhere/arnor.org.crt: created acme-client:
> /somewhere/arnor.org.fullchain.pem: created acme-client: signal:
> revokeproc(53931): Abort trap
> 
> Best Regards



Re: acme-client canary corrupted issue

2022-12-14 Thread Renaud Allard

Hi Otto,


On 12/14/22 12:01, Otto Moerbeek wrote:

On Tue, Dec 13, 2022 at 10:34:53AM +0100, Renaud Allard wrote:


Hello,

I was force renewing some certs because I removed some domains from
the cert, and got this:
acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 0xb0@0xb0

I am using vm.malloc_conf=SUR>>

Best Regards



I cannot reproduce with several attempts. Please include details on
platform and version.

Can you show a run with -v on? That gives a hint where the problem
occurs.

Do you get a core dump? If so, try to get a backtrace.




It's quite hard to reproduce, I only had it once when I shrank the 
alternative names involved in one certificate. There was no core dump.


This was produced on 7.2-stable amd64
account and domain keys are ecdsa

I ran it with -vvF and could get my run log thanks to tmux back buffer.
I will skip all the verification/certs babble

isildur# acme-client -vvF arnor.org 

acme-client: /somewhere/arnor.org.key: loaded domain key 

acme-client: /etc/acme/letsencrypt-privkey.pem: loaded account key 

acme-client: /somewhere/arnor.org.crt: certificate valid: 74 days left 

acme-client: /somewhere/arnor.org.crt: domain list changed, forcing 
renewal
acme-client: https://acme-v02.api.letsencrypt.org/directory: directories 

acme-client: acme-v02.api.letsencrypt.org: DNS: 172.65.32.248 


 lots of standard certs/verif dialog *
-END CERTIFICATE- ] (5800 bytes) 

acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 
0xb0@0xb0
acme-client: /somewhere/arnor.org.crt: created 

acme-client: /somewhere/arnor.org.fullchain.pem: created 


acme-client: signal: revokeproc(53931): Abort trap

Best Regards


smime.p7s
Description: S/MIME Cryptographic Signature


Re: acme-client canary corrupted issue

2022-12-14 Thread Otto Moerbeek
On Tue, Dec 13, 2022 at 10:34:53AM +0100, Renaud Allard wrote:

> Hello,
> 
> I was force renewing some certs because I removed some domains from
> the cert, and got this:
> acme-client(53931) in free(): chunk canary corrupted 0xa06cb09db00 0xb0@0xb0
> 
> I am using vm.malloc_conf=SUR>>
> 
> Best Regards


I cannot reproduce with several attempts. Please include details on
platform and version.

Can you show a run with -v on? That gives a hint where the problem
occurs.

Do you get a core dump? If so, try to get a backtrace.

-Otto