On Wed, Apr 3, 2019 at 7:21 PM Thomas Gleixner <t...@linutronix.de> wrote: > Btw, one of those links you provided > > https://www.manualslib.com/manual/1316475/Ecs-Ed20pa2.html?page=23 > > claims that you have to disable MWAIT as well. No idea why. Is MWAIT > disabled on your platform?
I don't have that option in the BIOS. However there's no mention of "mwait" nor "mwaitx" in /proc/cpuinfo. Checking our more general database of 202 x86_64 consumer products released over the last few years, only 19 of them have mwait/mwaitx listed there and they tend to be older platforms. > We have early-quirks.c in arch/x86/kernel/ for that. Nice, we should be able to work around the issue there, but I hope we can find something better... > For newer CPUs we might assume that: > > 1) The TSC and APIC timer are actually usable > > 2) The frequencies can be retrieved from CPUID or MSRs > > If #1 and #2 are reliable we can avoid the whole calibration and interrupt > delivery mess. Let's take a step back and re-examine the wider sequence of events here (which I've now done thanks to your pointers). 1. In very early boot, we face the TSC calibration challenge, arriving at determine_cpu_tsc_frequencies(). This function calculates CPU frequency and TSC frequency separately. For the CPU frequency, native_calibrate_cpu_early() tries to do it via cpu_khz_from_cpuid() with CPUID leaf 0x16, but this is not available on the platforms in question, which have max cpuid level 0x15. cpu_khz_from_msr() is then tried, but that doesn't support this platform either (looks like it only supports older SoC generations). So now we arrive in quick_pit_calibrate(), which directly programs the PIT and measures the TSC rate against the PIT ticks. When the 8254 is ungated in the BIOS, this function fails early because: if (pit_expect_msb(0xff, &tsc, &d1)) { /* returned at count=13, d1 is now 32118 */ for (i = 1; i <= MAX_QUICK_PIT_ITERATIONS; i++) { if (!pit_expect_msb(0xff-i, &delta, &d2)) /* returned at count=13, d2 is now 48595 */ break; delta -= tsc; /* delta is now 246741 */ /* * Extrapolate the error and fail fast if the error will * never be below 500 ppm. */ if (i == 1 && d1 + d2 >= (delta * MAX_QUICK_PIT_ITERATIONS) >> 11) return 0; /* this return statement is hit, the calculation is: 32118 + 48595 >= (246741 * 233) >> 11 */ so the error was too high (I'm not sure why) and determine_cpu_tsc_frequencies() records the CPU frequency as 0. Then, comparing to when the 8254 is gated via the BIOS option, the behaviour is surprising too. In that case, the quick_calibrate_pit() loop runs through up to i=128, at which point the error level is low enough to be accepted, calculating the CPU frequency as 4448MHz (4x higher than reality). During each loop iteration, pit_expect_msb() returns when the value changes at count=63 (compared to 13 in the PIT-ungated case). Does this suggest the PIT is not actually fully gated, it's just ticking a lot slower than otherwise? Anyway, back in determine_cpu_tsc_frequencies() with the CPU frequency calibration done, we now do TSC calibration. This one succeeds in all cases via native_calibrate_tsc() using CPUID leaf 0x15 to read the correct value. The TSC is 1094MHz. Then, in both cases (8254 gated or not), the CPU frequency calculation is discarded here, because it's wildly different from the TSC rate: else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz) cpu_khz = tsc_khz; So it seems that this code already behaves along the lines you describe: it gives more trust to the TSC value read in the modern way, and does not get upset if the CPU frequency calibration against the PIT didn't produce a meaningful result. 2. Significantly later during boot, x86_late_time_init() calls hpet_time_init() which sets up either the PIT or HPET. However, as far as I can see, there is no checking that the selected clock is actually ticking. In the case of these affected platforms with the 8254 gated, we sail right pass this point without a working clock source. 3. x86_late_time_init() then calls apic_intr_mode_init() -> apic_bsp_setup() -> setup_IO_APIC() and at this point we reach check_timer(), which attempts to verify/fixup IRQ0 delivery via the IO-APIC. At this point we check that jiffies increments, and if not, panic. 4. Some time later, naive_smp_prepare_cpus() calls setup_boot_APIC_clock() -> setup_APIC_timer() which registers the local APIC clocksource, replacing the previous PIT/HPET clocksource. There's no check to make sure that the new clocksource is ticking, as far as I can see. 5. Some time later, start_secondary() calls start_secondary_APIC_clock() -> setup_APIC_timer() registering the APIC clocksource (again? or just for another CPU core?). Hopefully that analysis helps refine/elaborate the plan a bit more... > That means we need the following decision logic: > > 1) If HPET is available in ACPI, boot normal. > > 2) If HPET is not available, verify that the PIT actually counts. If it > does, boot normal. > > If it does not either: > > 2A) Verify that this is a PCH 300/C240 and fiddle with that ISST bit. > > But that means that we need to chase PCH ids forever... (I found the ISST bit in the coreboot source code, which shows that the register is shared over multiple Intel SoC generations. I then searched for the register name online and found it documented in the 320/C240 public documentation, which I linked to. However that's not actually the platform in question. In this case we are working with Intel Apollo Lake N3350.) Anyway, I agree that doing it with PCI IDs would be painful. > 2B) Shrug and just avoid the whole PIT/HPET magic all over the place: > > - Avoid the interrupt delivery check in the IOAPIC code as it's > uninteresting in that case. Trivial to do. What do you mean by "in that case"? In the case of having an IOAPIC? >From my analysis above, this interrupt delivery check feels misplaced. Other parts of the clock setup code (e.g. where PIC, HPET and APIC timer are enabled) do not seem to check that the timers being set up actually work. If I were to try a kernel with no APIC/LAPIC support then Linux would boot with a broken PIT as the clock source without checking it. So why do we check it here specifically in the IOAPIC code? I see it does some tricks which are presumably needed on historical platforms, but maybe it could let boot continue even if it can't find a working IRQ0 setup? Or it could at least skip the check if IRQ0 was not working before the IOAPIC gets set up? If there is desire for some "check that the clocksource is actually ticking" panic logic, maybe this could be done after the local APIC timer is setup (which is ultimately the clock source selected and used), maybe it should even be done in arch-independent code? > - Prevent the TSC calibration code from touching PIT/HPET. It > should do that already when the TSC frequency can be retrieved > via CPUID or MSR. Should work, emphasis on should ... >From above, this seems to be working acceptably already. It does touch the PIT, but ultimately ignores the information that it provided. > - Prevent the APIC calibration code from touching PIT/HPET. That's > only happening right now when the TSC frequency comes from > the MSRs. No idea why the CPUID method does not provide that. Where's the APIC calibration code? > CPUID leaf 0x16 provides the bus frequency, so we can deduce the > APIC timer frequency from there and spare the whole APIC timer > calibration mess: > > ECX Bits 15 - 00: Bus (Reference) Frequency (in MHz). That's not available on this platform, plus https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf page 1-21 says that the data returned is actually marketing stuff, and shouldn't be treated as real. I think you mean CPUID leaf 0x15 instead. Thanks for your input! Daniel