Re: panic at reboot - tsc_test_sync_ap

Scott Cheloha Mon, 09 Jan 2023 08:27:17 -0800

Pedro,

On Wed, Dec 14, 2022 at 12:24:42PM -0600, Scott Cheloha wrote:
> On Wed, Dec 14, 2022 at 11:37:14AM +0000, Pedro Caetano wrote:
> > Hi bugs@
> > 
> > In the process of upgrading a pair of servers to release 7.2, the following
> > panic was triggered after sysupgrade reboot. (dell poweredge R740)
> > 
> > One of the reboots happened before syspatch, the other happened after
> > applying the release patches.
> > 
> > After powercycling, both servers managed to boot successfully.
> > 
> > Please keep me copied as I'm not subscribed to bugs@
> > 
> > 
> > Screenshot of the panic uploaded attached to this email.
> 
> For reference:
> 
> cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 
> 16-way L2 cache, 8MB 64b/line 11-way L3 cache
> cpu2: smt 0, core 5, package 0
> panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1
> panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at mainbus0: 
> apid 26 (application process
> 
> Somehow your machine is violating one of the TSC sync test sanity
> checks.  The idea behind this one is that there should only be one AP
> in the sync test at a time.
> 
> At the start of each test, in tsc_test_sync_ap(), the AP sets
> tsc_ap_name to its dv_xname.  It does this with an atomic CAS
> expecting NULL to ensure no other AP is still running the sync test.
> You're hitting this panic:
> 
>    449  void
>    450  tsc_test_sync_ap(struct cpu_info *ci)
>    451  {
>    452          if (!tsc_is_invariant)
>    453                  return;
>    454  #ifndef TSC_DEBUG
>    455          if (!tsc_is_synchronized)
>    456                  return;
>    457  #endif
>    458          /* The BP needs our name in order to report any problems. */
>    459          if (atomic_cas_ptr(&tsc_ap_name, NULL, ci->ci_dev->dv_xname) 
> != NULL) {
>    460                  panic("%s: %s: tsc_ap_name is not NULL: %s",
>    461                      __func__, ci->ci_dev->dv_xname, tsc_ap_name);
>    462          }
> 
> The BP is supposed to reset tsc_ap_name to NULL at the conclusion of
> every sync test, from tsc_test_sync_bp():
> 
>    415                  /*
>    416                   * Report what happened.  Adjust the TSC's quality
>    417                   * if this is the first time we've failed the test.
>    418                   */
>    419                  tsc_report_test_results();
>    420                  if (tsc_ap_status.lag_count || 
> tsc_bp_status.lag_count) {
>    421                          if (tsc_is_synchronized) {
>    422                                  tsc_is_synchronized = 0;
>    423                                  tc_reset_quality(&tsc_timecounter, 
> -1000);
>    424                          }
>    425                          tsc_test_rounds = 0;
>    426                  } else
>    427                          tsc_test_rounds--;
>    428
>    429                  /*
>    430                   * Clean up for the next round.  It is safe to reset 
> the
>    431                   * ingress barrier because at this point we know the 
> AP
>    432                   * has reached the egress barrier.
>    433                   */
>    434                  memset(&tsc_ap_status, 0, sizeof tsc_ap_status);
>    435                  memset(&tsc_bp_status, 0, sizeof tsc_bp_status);
>    436                  tsc_ingress_barrier = 0;
>    437                  if (tsc_test_rounds == 0)
>    438                          tsc_ap_name = NULL;
> 
> It's possible the BP's store:
> 
>       tsc_ap_name = NULL;
> 
> is not *always* globally visible by the time the next AP reaches the
> tsc_ap_name CAS, triggering the panic.  If so, we could force the
> store to complete with membar_producer().  tsc_ap_name should be
> volatile, too.
> 
> OTOH, it's possible this particular check is not the right thing here.
> My intention is correct... we definitely don't want more than one AP
> in the sync test at any given moment.  But this tsc_ap_name handshake
> thing may be the wrong way to assert that.


Just following up on this.  Have you seen the panic you reported again?

Re: panic at reboot - tsc_test_sync_ap

Reply via email to