Garrett (and interested other folks), There's a brief expose/tirade about resuscitating graphics hardware, the problem of the VBIOS, and our more particular situation with ATI graphics devices in-line below.
Prior to that, there are two high-level things that I might say about debugging S/R at present: 1. There is a *great* deal of room for improvement in this. As Randy describes, those of us on the team who developed the prototype (and the initial stuff now integrated) essentially relied on the debug version of the kernel, mdb and the serial line to squirt debugging output to - especially whilst in the lower reaches of the Suspend/Resume code. We do most certainly need much better facilities - especially for use with the production (non-debug) kernel. Development of some dtrace probes specific to power management is one consideration, and in the nearer term I feel that we could benefit immediately even from some simple .d scripts that just trace various appropriate function boundaries in the CPR code, and by watching the attach and detach routines in the various drivers. This would let one at least see where things are getting. We do have uadmin 3 22 and a couple of other quick hacks, but those are not the most robust or comprehensive of things. One goal for expedient initial debugging (platform assessment) in my opinion, would be to be able to make *single* trial run in which one could determine ALL the devices on the given platform that don't appear to support S/R. At present the S/R code (via uadmin 3 20) just tries to *do* a Suspend, and will report an error and then unwind at the first failing device. Even the uadmin 3 22 feature (which intends to implement a software loopback - hence not calling the ACPI S3 method if it does make it through all the devices' Suspend command successfully) does actually invoke the Suspend command on each device, and will also therefore unwind if a failure is seen. We probably need an improved driver interface which allows us to determine (inquire) whether each driver thinks it implements S/R, without our actually having to invoke the Suspend command on each driver. This would allow rapid enumeration of everything in the dev tree that has a supporting vs. non-supporting driver and one could move on from there to try to eliminate those devices and drivers that don't, while doing an actual S/R test on the rest of the devices that think they do: Of course there can be bugs even in drivers that think they do support the operation, when they are run in a context they haven't seen before: We saw this recently with the mpt driver (for a family of LSI Logic SCSI HBA's) when SAS vs. parallel SCSI disks were plugged into it for example. The SAS code path in the driver was different, and it didn't implement S/R (nor did it return FAILURE unfortunately). 2. Another technique that can be handy (which we've used a bit) is to *remove* the drivers for problematic devices from the system while debugging the rest. Of course this is no good if it's a critical/core device such as the disk controller running disk with the root filesystem on it or the like. But, one can bump out certain problematic drivers which are not running core hardware, such as audio, some of the USB devices, and even graphics drivers (in some cases). There's a boot-time option (using -B unload=<module-name> I think - Randy can correct me on this), or one can simply move the driver aside temporarily (rename it so that it won't be found and hence won't be loaded during boot): You can either knock it out of /etc/driver_aliases, or just go to the directory where it happens to live (whether /kernel/drv, kernel/drv/amd64, /platform/i86pc/kernel/drv, or /platform/i86pc/kernel/drv/adm64) and rename it temporarily. (Other remarks are in-line below) -db Randy Fishel wrote: > On Thu, 11 Dec 2008, Garrett D'Amore wrote: > > >> So, I'm interested in ensuring that my driver properly suspend/resumes. >> However, I'm having problems in that my platform doesn't resume properly >> from a suspend, even without my driver loaded. >> >> Are there any hints that we can use to help us figure out how to debug >> failures in resume? It would be helpful to have a developer's debugging >> page for this stuff. >> > > Can you hook up a serial line, or better yet, a serial console? > Logging to the serial port is nearly all the way to power-off, and > early in power-on. And the serial console starts pretty early as > well. > > And I will see about getting a debugging page (or maybe even a wiki) > started. > > >> (And no, I cannot move to a supported platform debug, because the driver >> is for hardware that is on the motherboard. Can I please also take a >> second to bemoan the lack of suspend/resume support in ATI framebuffer >> drivers? I'm starting to believe that we need to have a facility to >> execute the BIOS on the video board for these things...) >> Yes, that sort of thing certainly is (and has been for several years now) our desire. It would though, require a fundamental change in an industry area where Sun has not historically had any influence or participation. Problem is that the historical design (I'm reluctant to use the term 'architecture') of the BIOS is such that it only expects to execute hardware initialization code (including that on option card BIOSes such as the VBIOS on graphics cards) at power on reset. Its design did not anticipate the need also to execute such code upon resume from S3: These power-related features now in the components and on the hardware platform are relatively new things and represent a disruptor at the firmware level as well as further upstairs. In fact, even if there were a hook to the VBIOS iniitalization code that the OS could get to, very often it could not be re-executed in any case since it tends to rely on routines in the motherboard's BIOS, and we've seen that various routines in the main BIOS also become unmapped after the power-on reset sequence has been completed. Typically the initialization entry points in the VBIOS are correspondingly unmapped after POR. This leaves us in the difficult situation [at present] that we have to have a Solaris kernel driver that knows how to re-initialize the graphics hardware in question from cold iron -- equivalent to what the VBIOS does at POR. In some cases we have grabbed a copy of the graphics card's VBIOS and then interpret that in the Solaris device driver during the resume operation. Having spent several months to make one of these things work for the ATI RageXL chip, I can tell you that this is not a happy way to proceed. Often the documentation is poor or non-existent (sometimes because some of the vendors feel that that might reveal proprietary aspects of their chip architecture or something), and even when documentation can be procured, we have discovered that there are implementation bugs in some of the chips which have sometimes been band-aided with subsequent undocumented bits in the hardware which can be nearly impossible to learn about. Such a thing in particular with the RageXL took us the better part of a month to discover. An excellent answer would be a change in the BIOS architecture such that the same hardware initialization code could be executed whilst coming out of S3. That way we continue to have a sensible situation in which those very low level aspects are supported by the option vendor and we don't need to think about it in OS-land. The other - poorer in my opinion, way to go is that we get the vendors' support to provide an OS-specific device driver that knows how to do the right thing(s). We currently have this situation with nVidia, and since they have a unified driver architecture, they don't have to provide us a different one for every graphics device they come out with. They just keep the one unified one up to date. ATI has been a bigger problem historically. First, because they didn't at first have a unified architecture and hence single device driver capability. They how do have that I understand (since R300 I believe), but we don't yet have a situation in which they are providing us a Solaris device driver to do S/R, nor do we yet have the capacity to do that ourselves, as I believe we still do not (after a long long effort) have the documentation and/or code examples to do the driver(s) ourselves. As Randy says, there has been some recent light in this tunnel, and there is talk that the means may now be available to us, but ... > > You can bemoan, but it may not help at all. I keep hearing that the > open source driver works, but I don't think there has been success > yet. > > ---- Randy > > >> -- Garrett >> >> > _______________________________________________ > pm-discuss mailing list > pm-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/pm-discuss > -- ; David J. Brown Ph.D. (cantab.) ; Solaris Engineering ; Sun Microsystems Inc. ; -- ; Postal Address: Telephone: (650) 786-5558 ; 4150 Network Circle, UMPK17-307 FAX: (650) 786-5734 ; Santa Clara, CA 95054 e-mail: djb at sun.com