Pedro,
On Wed, Dec 14, 2022 at 12:24:42PM -0600, Scott Cheloha wrote:
> On Wed, Dec 14, 2022 at 11:37:14AM +0000, Pedro Caetano wrote:
> > Hi bugs@
> >
> > In the process of upgrading a pair of servers to release 7.2, the following
> > panic was triggered after sysupgrade reboot. (dell poweredge R740)
> >
> > One of the reboots happened before syspatch, the other happened after
> > applying the release patches.
> >
> > After powercycling, both servers managed to boot successfully.
> >
> > Please keep me copied as I'm not subscribed to bugs@
> >
> >
> > Screenshot of the panic uploaded attached to this email.
>
> For reference:
>
> cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line
> 16-way L2 cache, 8MB 64b/line 11-way L3 cache
> cpu2: smt 0, core 5, package 0
> panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1
> panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at mainbus0:
> apid 26 (application process
>
> Somehow your machine is violating one of the TSC sync test sanity
> checks. The idea behind this one is that there should only be one AP
> in the sync test at a time.
>
> At the start of each test, in tsc_test_sync_ap(), the AP sets
> tsc_ap_name to its dv_xname. It does this with an atomic CAS
> expecting NULL to ensure no other AP is still running the sync test.
> You're hitting this panic:
>
> 449 void
> 450 tsc_test_sync_ap(struct cpu_info *ci)
> 451 {
> 452 if (!tsc_is_invariant)
> 453 return;
> 454 #ifndef TSC_DEBUG
> 455 if (!tsc_is_synchronized)
> 456 return;
> 457 #endif
> 458 /* The BP needs our name in order to report any problems. */
> 459 if (atomic_cas_ptr(&tsc_ap_name, NULL, ci->ci_dev->dv_xname)
> != NULL) {
> 460 panic("%s: %s: tsc_ap_name is not NULL: %s",
> 461 __func__, ci->ci_dev->dv_xname, tsc_ap_name);
> 462 }
>
> The BP is supposed to reset tsc_ap_name to NULL at the conclusion of
> every sync test, from tsc_test_sync_bp():
>
> 415 /*
> 416 * Report what happened. Adjust the TSC's quality
> 417 * if this is the first time we've failed the test.
> 418 */
> 419 tsc_report_test_results();
> 420 if (tsc_ap_status.lag_count ||
> tsc_bp_status.lag_count) {
> 421 if (tsc_is_synchronized) {
> 422 tsc_is_synchronized = 0;
> 423 tc_reset_quality(&tsc_timecounter,
> -1000);
> 424 }
> 425 tsc_test_rounds = 0;
> 426 } else
> 427 tsc_test_rounds--;
> 428
> 429 /*
> 430 * Clean up for the next round. It is safe to reset
> the
> 431 * ingress barrier because at this point we know the
> AP
> 432 * has reached the egress barrier.
> 433 */
> 434 memset(&tsc_ap_status, 0, sizeof tsc_ap_status);
> 435 memset(&tsc_bp_status, 0, sizeof tsc_bp_status);
> 436 tsc_ingress_barrier = 0;
> 437 if (tsc_test_rounds == 0)
> 438 tsc_ap_name = NULL;
>
> It's possible the BP's store:
>
> tsc_ap_name = NULL;
>
> is not *always* globally visible by the time the next AP reaches the
> tsc_ap_name CAS, triggering the panic. If so, we could force the
> store to complete with membar_producer(). tsc_ap_name should be
> volatile, too.
>
> OTOH, it's possible this particular check is not the right thing here.
> My intention is correct... we definitely don't want more than one AP
> in the sync test at any given moment. But this tsc_ap_name handshake
> thing may be the wrong way to assert that.
Just following up on this. Have you seen the panic you reported again?