Re: [coreboot] GM45 S3 resume issues
[Resending after accidentally replying off-list] On 2015-11-12 14:32, Nico Huber wrote: > On 12.11.2015 04:37, Patrick 'P. J.' McDermott wrote: >> On 2015-11-11 16:50, Nico Huber wrote: >>> Hi, >>> >>> On 11.11.2015 00:49, Patrick 'P. J.' McDermott wrote: I've been looking into S3 resume on GM45 mainboards, which often fails in rather interesting ways. >>> Well, the S3 support wasn't really tested during GM45 development. Maybe >>> it's just plainly broken. My development system at work (roda/rk9) >>> doesn't resume because of another problem (but didn't fail raminit on >>> the resume path in 3 of 3 tries). So it will need some work before I can >>> test this. >> >> Ah, OK. What kind of other problem? In addition to the raminit reset, >> I've seen resume fail by an SMM hang and in other ways. > There was a minor flaw in the mainboard code: > http://review.coreboot.org/#/q/topic:rk9-resume Ah, I see. Yeah, X200 and T400 romstages don't have that flaw. > With patches applied, it works halfway reliable: about 30 good suspend- > resume cycles before it fails. Only eye-catching thing in dmesg was a > warning about the backlight being already enabled. I haven't yet a > serial log for a failed resume. How does it fail? > One more thing that came in mind: The reset after a failed receive- > enable calibration is kind of wanted. IIRC, we left the watchdog enable > for a case of failing raminit. But I don't remember what the exact fai- > lure was. It wasn't very unlikely to occur (> 1/1000). If we run into > this on the resume path too, it might never work reliable :-/ -- Patrick "P. J." McDermott http://www.pehjota.net/ Lead Developer, ProteanOS http://www.proteanos.com/ -- coreboot mailing list: coreboot@coreboot.org http://www.coreboot.org/mailman/listinfo/coreboot
Re: [coreboot] GM45 S3 resume issues
On 2015-11-12 13:55, Nico Huber wrote: > Hi, > > had a look at your logs: > > On 11.11.2015 00:49, Patrick 'P. J.' McDermott wrote: >> These systems fail to resume in one of the following ways: >> >> * S3 resume (indicated by the SLP_TYP bit) is detected, SLP_TYP is >> cleared, DRAM receive-enable calibration fails with a timing >> under/overflow, the system resets, and coreboot boots normally into >> the payload (with the sleep LED still on) because SLP_TYP is now >> unset. See x200-resume-fail-receive-enable-calibration.log and >> t400-resume-fail-receive-enable-calibration.log. >> * S3 resume is detected, SLP_TYP is cleared, raminit and the rest of >> romstage completes without error, but then something between the >> southbridge's smm_init() and cpu_initialize() hangs (maybe the >> system is stuck in SMM). See x200-resume-fail-smm-hang.log and >> t400-resume-fail-smm-hang.log. > I have yet no idea about the SMM hang. > >> * S3 resume is detected, SLP_TYP is cleared, romstage completes, but >> something within smm_init() hangs before dumping (possibly while >> clearing [1]) TCO1_STS bits. See t400-resume-fail-tco-hang.log > The logs are all a little garbled. It looks to me like this is exactly > the same hang as in *-resume-fail-smm-hang.log. > >> There are a couple of other ways in which I've seen S3 resume fail, but >> these are the most common. >> >> I thought of working around the first issue (clearing SLP_TYP, resetting >> due to a raminit error, then booting into the payload) by clearing >> SLP_TYP near the end of the romstage main() (after raminit). So I tried >> the following patch: >> >> --- >> diff --git a/src/mainboard/lenovo/x200/romstage.c >> b/src/mainboard/lenovo/x200/romstage.c >> index 86a973f..915baf2 100644 >> --- a/src/mainboard/lenovo/x200/romstage.c >> +++ b/src/mainboard/lenovo/x200/romstage.c >> @@ -103,10 +103,6 @@ void main(unsigned long bist) >> #if CONFIG_HAVE_ACPI_RESUME >> printk(BIOS_DEBUG, "Resume from S3 detected.\n"); >> s3resume = 1; >> -/* Clear SLP_TYPE. This will break stage2 but >> - * we care for that when we get there. >> - */ >> -outl(pm1_cnt & ~(7 << 10), DEFAULT_PMBASE + 0x04); >> #else >> printk(BIOS_DEBUG, "Resume from S3 detected, but disabled.\n"); >> #endif >> @@ -190,6 +186,11 @@ void main(unsigned long bist) >> >> /* Magic for S3 resume */ >> pci_write_config32(PCI_DEV(0, 0, 0), D0F0_SKPD, >> SKPAD_ACPI_S3_MAGIC); >> + >> +/* Clear SLP_TYPE. This will break stage2 but >> + * we care for that when we get there. >> + */ >> +outl(pm1_cnt & ~(7 << 10), DEFAULT_PMBASE + 0x04); >> } else { >> /* Magic for S3 resume */ >> pci_write_config32(PCI_DEV(0, 0, 0), D0F0_SKPD, >> SKPAD_NORMAL_BOOT_MAGIC); >> --- >> >> But that just made these errors even more frequent. Trying to resume >> from S3 put the system into a reset loop with receive-enable calibration >> errors (see x200-patched-resume-fail-receive-enable-loop.log). So >> instead of rebooting into the payload or hanging, the system just resets >> forever. > This reset loop is very interesting. Did it end sometime? It could mean > the worst, i.e. the RAM lost its configuration (self refresh failed). I > suspect that's the case as there is not much difference in the normal > vs. the resume path until receive-enable calibration. No, it didn't end. I once left it running for probably at least 10 or 20 minutes, so it must have gone through hundreds of raminit/reset cycles. As shown in *-resume-fail-receive-enable-calibration.log, that kind of raminit failure happens with an unpatched coreboot as well (more commonly on the X200 than the SMM hang does). So it makes some sense that patching romstage main() in the way that I did would cause that error to happen in a loop. Basically, that patch fixed the problem of losing SLP_TYP after reset but worsened the problem of raminit failing. (It's a little odd that the loop /never/ ended, while sometimes an unpatched coreboot would get past receive-enable calibration in the resume path.) What doesn't make sense is that receive-enable calibration only fails when SLP_TYP is set. But it always works when SLP_TYP is unset (as in a normal boot or with an unpatched coreboot after S3 resume is detected, raminit fails, and the system resets). -- Patrick "P. J." McDermott http://www.pehjota.net/ Lead Developer, ProteanOS http://www.proteanos.com/ -- coreboot mailing list: coreboot@coreboot.org http://www.coreboot.org/mailman/listinfo/coreboot
Re: [coreboot] GM45 S3 resume issues
On 12.11.2015 04:37, Patrick 'P. J.' McDermott wrote: > On 2015-11-11 16:50, Nico Huber wrote: >> Hi, >> >> On 11.11.2015 00:49, Patrick 'P. J.' McDermott wrote: >>> I've been looking into S3 resume on GM45 mainboards, which often fails >>> in rather interesting ways. >> Well, the S3 support wasn't really tested during GM45 development. Maybe >> it's just plainly broken. My development system at work (roda/rk9) >> doesn't resume because of another problem (but didn't fail raminit on >> the resume path in 3 of 3 tries). So it will need some work before I can >> test this. > > Ah, OK. What kind of other problem? In addition to the raminit reset, > I've seen resume fail by an SMM hang and in other ways. There was a minor flaw in the mainboard code: http://review.coreboot.org/#/q/topic:rk9-resume With patches applied, it works halfway reliable: about 30 good suspend- resume cycles before it fails. Only eye-catching thing in dmesg was a warning about the backlight being already enabled. I haven't yet a serial log for a failed resume. One more thing that came in mind: The reset after a failed receive- enable calibration is kind of wanted. IIRC, we left the watchdog enable for a case of failing raminit. But I don't remember what the exact fai- lure was. It wasn't very unlikely to occur (> 1/1000). If we run into this on the resume path too, it might never work reliable :-/ Nico -- coreboot mailing list: coreboot@coreboot.org http://www.coreboot.org/mailman/listinfo/coreboot
Re: [coreboot] GM45 S3 resume issues
Hi, had a look at your logs: On 11.11.2015 00:49, Patrick 'P. J.' McDermott wrote: > These systems fail to resume in one of the following ways: > > * S3 resume (indicated by the SLP_TYP bit) is detected, SLP_TYP is > cleared, DRAM receive-enable calibration fails with a timing > under/overflow, the system resets, and coreboot boots normally into > the payload (with the sleep LED still on) because SLP_TYP is now > unset. See x200-resume-fail-receive-enable-calibration.log and > t400-resume-fail-receive-enable-calibration.log. > * S3 resume is detected, SLP_TYP is cleared, raminit and the rest of > romstage completes without error, but then something between the > southbridge's smm_init() and cpu_initialize() hangs (maybe the > system is stuck in SMM). See x200-resume-fail-smm-hang.log and > t400-resume-fail-smm-hang.log. I have yet no idea about the SMM hang. > * S3 resume is detected, SLP_TYP is cleared, romstage completes, but > something within smm_init() hangs before dumping (possibly while > clearing [1]) TCO1_STS bits. See t400-resume-fail-tco-hang.log The logs are all a little garbled. It looks to me like this is exactly the same hang as in *-resume-fail-smm-hang.log. > There are a couple of other ways in which I've seen S3 resume fail, but > these are the most common. > > I thought of working around the first issue (clearing SLP_TYP, resetting > due to a raminit error, then booting into the payload) by clearing > SLP_TYP near the end of the romstage main() (after raminit). So I tried > the following patch: > > --- > diff --git a/src/mainboard/lenovo/x200/romstage.c > b/src/mainboard/lenovo/x200/romstage.c > index 86a973f..915baf2 100644 > --- a/src/mainboard/lenovo/x200/romstage.c > +++ b/src/mainboard/lenovo/x200/romstage.c > @@ -103,10 +103,6 @@ void main(unsigned long bist) > #if CONFIG_HAVE_ACPI_RESUME > printk(BIOS_DEBUG, "Resume from S3 detected.\n"); > s3resume = 1; > - /* Clear SLP_TYPE. This will break stage2 but > - * we care for that when we get there. > - */ > - outl(pm1_cnt & ~(7 << 10), DEFAULT_PMBASE + 0x04); > #else > printk(BIOS_DEBUG, "Resume from S3 detected, but disabled.\n"); > #endif > @@ -190,6 +186,11 @@ void main(unsigned long bist) > > /* Magic for S3 resume */ > pci_write_config32(PCI_DEV(0, 0, 0), D0F0_SKPD, > SKPAD_ACPI_S3_MAGIC); > + > + /* Clear SLP_TYPE. This will break stage2 but > + * we care for that when we get there. > + */ > + outl(pm1_cnt & ~(7 << 10), DEFAULT_PMBASE + 0x04); > } else { > /* Magic for S3 resume */ > pci_write_config32(PCI_DEV(0, 0, 0), D0F0_SKPD, > SKPAD_NORMAL_BOOT_MAGIC); > --- > > But that just made these errors even more frequent. Trying to resume > from S3 put the system into a reset loop with receive-enable calibration > errors (see x200-patched-resume-fail-receive-enable-loop.log). So > instead of rebooting into the payload or hanging, the system just resets > forever. This reset loop is very interesting. Did it end sometime? It could mean the worst, i.e. the RAM lost its configuration (self refresh failed). I suspect that's the case as there is not much difference in the normal vs. the resume path until receive-enable calibration. Nico -- coreboot mailing list: coreboot@coreboot.org http://www.coreboot.org/mailman/listinfo/coreboot
Re: [coreboot] GM45 S3 resume issues
On 2015-11-11 16:50, Nico Huber wrote: > Hi, > > On 11.11.2015 00:49, Patrick 'P. J.' McDermott wrote: >> I've been looking into S3 resume on GM45 mainboards, which often fails >> in rather interesting ways. > Well, the S3 support wasn't really tested during GM45 development. Maybe > it's just plainly broken. My development system at work (roda/rk9) > doesn't resume because of another problem (but didn't fail raminit on > the resume path in 3 of 3 tries). So it will need some work before I can > test this. Ah, OK. What kind of other problem? In addition to the raminit reset, I've seen resume fail by an SMM hang and in other ways. > Thanks for taking the time to test on different systems and looking into > this. Can you try to find out which processors (model, stepping) and > northbridge stepping was used? Also did you use the latest processor > microcode? I've seen S3 resume problems occur on X200 units with CPUID 10676 (microcode revision 0x6, stepping M0). I've also seen reports of problems on X200 units with CPUID 1067A (microcode revision 0xA, stepping R0, the latest available stepping/microcode). I have a T400 with CPUID 1067A (again, latest available microcode) that fails to resume, but I've also seen a report of a T400 on which S3 resume works. On all tested systems, the northbridge stepping is B3, the only available stepping for this northbridge. The logs in my previous message are from a couple of X200 units with 10676 and a T400 with 1067A. -- Patrick "P. J." McDermott http://www.pehjota.net/ Lead Developer, ProteanOS http://www.proteanos.com/ -- coreboot mailing list: coreboot@coreboot.org http://www.coreboot.org/mailman/listinfo/coreboot
Re: [coreboot] GM45 S3 resume issues
Hi, On 11.11.2015 00:49, Patrick 'P. J.' McDermott wrote: > I've been looking into S3 resume on GM45 mainboards, which often fails > in rather interesting ways. Well, the S3 support wasn't really tested during GM45 development. Maybe it's just plainly broken. My development system at work (roda/rk9) doesn't resume because of another problem (but didn't fail raminit on the resume path in 3 of 3 tries). So it will need some work before I can test this. Thanks for taking the time to test on different systems and looking into this. Can you try to find out which processors (model, stepping) and northbridge stepping was used? Also did you use the latest processor microcode? > [1]: Tangentially, I noticed that the i82801ix reset_tco_status() says > "Don't clear BOOT_STS before SECOND_TO_STS" when it clears > BOOT_STS. In the next two lines it clears BOOT_STS if set. It > never clears SECOND_TO_STS. Is this a bug? However, according to > the ICH9 datasheet, there is no SECOND_TO_STS bit in TCO1_STS (the > high bits of that register are reserved). Those are R/WC read/write-clear bits. The bits get cleared by writing a '1'. That's for convenience as you can clear them all by writing the value you just read. If there is a SECOND_TO_STS bit and it's set, it gets automagically cleared by the first write (by writing the value we read before). Clearing BOOT_STS gets deferred to the second write. Ah, just had a closer look at the datasheet: The bits are defined in the next register. I'm not sure if it's valid to read/write both registers at once with a 32-bit access, though. But it seems to work for other chipsets (e.g. i82801gx, bd82x6x). Nico -- coreboot mailing list: coreboot@coreboot.org http://www.coreboot.org/mailman/listinfo/coreboot