Re: panic at reboot - tsc_test_sync_ap

Pedro Caetano Mon, 09 Jan 2023 09:45:19 -0800

Hi Scott, all,

Thank you for the follow up.
I haven't tried this diff yet. (This routers support an application that
doesn't allow much downtime)


I'm planning on testing this whenever a new patch forces me to reboot.
Worst case on next version upgrade.

Best,
Pedro

A segunda, 9/01/2023, 16:26, Scott Cheloha <[email protected]>
escreveu:

> Pedro,
>
> On Wed, Dec 14, 2022 at 12:24:42PM -0600, Scott Cheloha wrote:
> > On Wed, Dec 14, 2022 at 11:37:14AM +0000, Pedro Caetano wrote:
> > > Hi bugs@
> > >
> > > In the process of upgrading a pair of servers to release 7.2, the
> following
> > > panic was triggered after sysupgrade reboot. (dell poweredge R740)
> > >
> > > One of the reboots happened before syspatch, the other happened after
> > > applying the release patches.
> > >
> > > After powercycling, both servers managed to boot successfully.
> > >
> > > Please keep me copied as I'm not subscribed to bugs@
> > >
> > >
> > > Screenshot of the panic uploaded attached to this email.
> >
> > For reference:
> >
> > cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB
> 64b/line 16-way L2 cache, 8MB 64b/line 11-way L3 cache
> > cpu2: smt 0, core 5, package 0
> > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1
> > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at
> mainbus0: apid 26 (application process
> >
> > Somehow your machine is violating one of the TSC sync test sanity
> > checks.  The idea behind this one is that there should only be one AP
> > in the sync test at a time.
> >
> > At the start of each test, in tsc_test_sync_ap(), the AP sets
> > tsc_ap_name to its dv_xname.  It does this with an atomic CAS
> > expecting NULL to ensure no other AP is still running the sync test.
> > You're hitting this panic:
> >
> >    449  void
> >    450  tsc_test_sync_ap(struct cpu_info *ci)
> >    451  {
> >    452          if (!tsc_is_invariant)
> >    453                  return;
> >    454  #ifndef TSC_DEBUG
> >    455          if (!tsc_is_synchronized)
> >    456                  return;
> >    457  #endif
> >    458          /* The BP needs our name in order to report any
> problems. */
> >    459          if (atomic_cas_ptr(&tsc_ap_name, NULL,
> ci->ci_dev->dv_xname) != NULL) {
> >    460                  panic("%s: %s: tsc_ap_name is not NULL: %s",
> >    461                      __func__, ci->ci_dev->dv_xname, tsc_ap_name);
> >    462          }
> >
> > The BP is supposed to reset tsc_ap_name to NULL at the conclusion of
> > every sync test, from tsc_test_sync_bp():
> >
> >    415                  /*
> >    416                   * Report what happened.  Adjust the TSC's
> quality
> >    417                   * if this is the first time we've failed the
> test.
> >    418                   */
> >    419                  tsc_report_test_results();
> >    420                  if (tsc_ap_status.lag_count ||
> tsc_bp_status.lag_count) {
> >    421                          if (tsc_is_synchronized) {
> >    422                                  tsc_is_synchronized = 0;
> >    423
> tc_reset_quality(&tsc_timecounter, -1000);
> >    424                          }
> >    425                          tsc_test_rounds = 0;
> >    426                  } else
> >    427                          tsc_test_rounds--;
> >    428
> >    429                  /*
> >    430                   * Clean up for the next round.  It is safe to
> reset the
> >    431                   * ingress barrier because at this point we know
> the AP
> >    432                   * has reached the egress barrier.
> >    433                   */
> >    434                  memset(&tsc_ap_status, 0, sizeof tsc_ap_status);
> >    435                  memset(&tsc_bp_status, 0, sizeof tsc_bp_status);
> >    436                  tsc_ingress_barrier = 0;
> >    437                  if (tsc_test_rounds == 0)
> >    438                          tsc_ap_name = NULL;
> >
> > It's possible the BP's store:
> >
> >       tsc_ap_name = NULL;
> >
> > is not *always* globally visible by the time the next AP reaches the
> > tsc_ap_name CAS, triggering the panic.  If so, we could force the
> > store to complete with membar_producer().  tsc_ap_name should be
> > volatile, too.
> >
> > OTOH, it's possible this particular check is not the right thing here.
> > My intention is correct... we definitely don't want more than one AP
> > in the sync test at any given moment.  But this tsc_ap_name handshake
> > thing may be the wrong way to assert that.
>
> Just following up on this.  Have you seen the panic you reported again?
>

Re: panic at reboot - tsc_test_sync_ap

Reply via email to