Hi Scott, all, Thank you for the follow up. I haven't tried this diff yet. (This routers support an application that doesn't allow much downtime)
I'm planning on testing this whenever a new patch forces me to reboot. Worst case on next version upgrade. Best, Pedro A segunda, 9/01/2023, 16:26, Scott Cheloha <[email protected]> escreveu: > Pedro, > > On Wed, Dec 14, 2022 at 12:24:42PM -0600, Scott Cheloha wrote: > > On Wed, Dec 14, 2022 at 11:37:14AM +0000, Pedro Caetano wrote: > > > Hi bugs@ > > > > > > In the process of upgrading a pair of servers to release 7.2, the > following > > > panic was triggered after sysupgrade reboot. (dell poweredge R740) > > > > > > One of the reboots happened before syspatch, the other happened after > > > applying the release patches. > > > > > > After powercycling, both servers managed to boot successfully. > > > > > > Please keep me copied as I'm not subscribed to bugs@ > > > > > > > > > Screenshot of the panic uploaded attached to this email. > > > > For reference: > > > > cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB > 64b/line 16-way L2 cache, 8MB 64b/line 11-way L3 cache > > cpu2: smt 0, core 5, package 0 > > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1 > > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at > mainbus0: apid 26 (application process > > > > Somehow your machine is violating one of the TSC sync test sanity > > checks. The idea behind this one is that there should only be one AP > > in the sync test at a time. > > > > At the start of each test, in tsc_test_sync_ap(), the AP sets > > tsc_ap_name to its dv_xname. It does this with an atomic CAS > > expecting NULL to ensure no other AP is still running the sync test. > > You're hitting this panic: > > > > 449 void > > 450 tsc_test_sync_ap(struct cpu_info *ci) > > 451 { > > 452 if (!tsc_is_invariant) > > 453 return; > > 454 #ifndef TSC_DEBUG > > 455 if (!tsc_is_synchronized) > > 456 return; > > 457 #endif > > 458 /* The BP needs our name in order to report any > problems. */ > > 459 if (atomic_cas_ptr(&tsc_ap_name, NULL, > ci->ci_dev->dv_xname) != NULL) { > > 460 panic("%s: %s: tsc_ap_name is not NULL: %s", > > 461 __func__, ci->ci_dev->dv_xname, tsc_ap_name); > > 462 } > > > > The BP is supposed to reset tsc_ap_name to NULL at the conclusion of > > every sync test, from tsc_test_sync_bp(): > > > > 415 /* > > 416 * Report what happened. Adjust the TSC's > quality > > 417 * if this is the first time we've failed the > test. > > 418 */ > > 419 tsc_report_test_results(); > > 420 if (tsc_ap_status.lag_count || > tsc_bp_status.lag_count) { > > 421 if (tsc_is_synchronized) { > > 422 tsc_is_synchronized = 0; > > 423 > tc_reset_quality(&tsc_timecounter, -1000); > > 424 } > > 425 tsc_test_rounds = 0; > > 426 } else > > 427 tsc_test_rounds--; > > 428 > > 429 /* > > 430 * Clean up for the next round. It is safe to > reset the > > 431 * ingress barrier because at this point we know > the AP > > 432 * has reached the egress barrier. > > 433 */ > > 434 memset(&tsc_ap_status, 0, sizeof tsc_ap_status); > > 435 memset(&tsc_bp_status, 0, sizeof tsc_bp_status); > > 436 tsc_ingress_barrier = 0; > > 437 if (tsc_test_rounds == 0) > > 438 tsc_ap_name = NULL; > > > > It's possible the BP's store: > > > > tsc_ap_name = NULL; > > > > is not *always* globally visible by the time the next AP reaches the > > tsc_ap_name CAS, triggering the panic. If so, we could force the > > store to complete with membar_producer(). tsc_ap_name should be > > volatile, too. > > > > OTOH, it's possible this particular check is not the right thing here. > > My intention is correct... we definitely don't want more than one AP > > in the sync test at any given moment. But this tsc_ap_name handshake > > thing may be the wrong way to assert that. > > Just following up on this. Have you seen the panic you reported again? >
