[pm-discuss] debugging suspend/resume failures...

David J. Brown Fri, 12 Dec 2008 11:05:39 -0800

Garrett (and interested other folks),

    There's a brief expose/tirade about resuscitating graphics hardware, 
the problem of the VBIOS, and our more particular situation with ATI 
graphics devices in-line below.

Prior to that, there are two high-level things that I might say about 
debugging S/R at present:

1. There is a *great* deal of room for improvement in this.  As Randy 
describes, those of us on the team who developed the prototype (and the 
initial stuff now integrated) essentially relied on the debug version of 
the kernel, mdb and the serial line to squirt debugging output to - 
especially whilst in the lower reaches of the Suspend/Resume code.  We 
do most certainly need much better facilities - especially for use with 
the production (non-debug) kernel. 

    Development of some dtrace probes specific to power management is 
one consideration, and in the nearer term
    I feel that we could  benefit immediately even from some simple .d 
scripts that just trace various appropriate function
    boundaries in the CPR code, and by watching the attach and detach 
routines in the various drivers.  This would let
    one at least see where things are getting.  We do have uadmin 3 22 
and a couple of other quick hacks, but those are
    not the most robust or comprehensive of things.

One goal for expedient initial debugging (platform assessment) in my 
opinion, would be to be able to make *single* trial run in which one 
could determine ALL the devices on the given platform that don't appear 
to support S/R.  At present the S/R code (via uadmin 3 20) just tries to 
*do* a Suspend, and will report an error and then unwind at the first 
failing device.  Even the uadmin 3 22 feature (which intends to 
implement a software loopback - hence not calling the ACPI S3 method if 
it does make it through all the devices' Suspend command successfully) 
does actually invoke the Suspend command on each device, and will also 
therefore unwind if a failure is seen.  We probably need an improved 
driver interface which allows us to determine (inquire) whether each 
driver thinks it implements S/R, without our actually  having to invoke 
the Suspend command on each driver.  This would allow rapid enumeration 
of everything in the dev tree that has a supporting vs. non-supporting 
driver and one could move on from there to try to eliminate those 
devices and drivers that don't, while doing an actual S/R test on the 
rest of the devices that think they do:  Of course there can be bugs 
even in drivers that think they do support the operation, when they are 
run in a context they haven't seen before:  We saw this recently with 
the mpt driver (for a family of LSI Logic SCSI HBA's) when SAS vs. 
parallel SCSI disks were plugged into it for example.  The SAS code path 
in the driver was different, and it didn't implement S/R (nor did it 
return FAILURE unfortunately).

2. Another technique that can be handy (which we've used a bit) is to 
*remove* the drivers for problematic
    devices from the system while debugging the rest.  Of course this is 
no good if it's a critical/core device
    such as the disk controller running disk with the root filesystem on 
it or the like.  But, one can bump out
    certain problematic drivers which are not running core hardware, 
such as audio, some of the USB devices,
    and even graphics drivers (in some cases).

    There's a boot-time option (using -B unload=<module-name> I think - 
Randy can correct me on this), or
    one can simply move the driver aside temporarily (rename it so that 
it won't be found and hence won't be
    loaded during boot): You can either knock it out of 
/etc/driver_aliases, or just go to the directory where it
    happens to live (whether /kernel/drv, kernel/drv/amd64, 
/platform/i86pc/kernel/drv, or
    /platform/i86pc/kernel/drv/adm64) and rename it temporarily.

(Other remarks are in-line below)
-db

Randy Fishel wrote:
> On Thu, 11 Dec 2008, Garrett D'Amore wrote:
>
>   
>> So, I'm interested in ensuring that my driver properly suspend/resumes.  
>> However, I'm having problems in that my platform doesn't resume properly 
>> from a suspend, even without my driver loaded.
>>
>> Are there any hints that we can use to help us figure out how to debug 
>> failures in resume?  It would be helpful to have a developer's debugging 
>> page for this stuff.
>>     
>
>   Can you hook up a serial line, or better yet, a serial console?  
> Logging to the serial port is nearly all the way to power-off, and 
> early in power-on.  And the serial console starts pretty early as 
> well.
>
>   And I will see about getting a debugging page (or maybe even a wiki) 
> started.
>
>   
>> (And no, I cannot move to a supported platform debug, because the driver 
>> is for hardware that is on the motherboard.  Can I please also take a 
>> second to bemoan the lack of suspend/resume support in ATI framebuffer 
>> drivers?  I'm starting to believe that we need to have a facility to 
>> execute the BIOS on the video board for these things...)
>>     
Yes, that sort of thing certainly is (and has been for several years 
now) our desire.  It would though, require a fundamental change in an 
industry area where Sun has not historically had any influence or 
participation.  Problem is that the historical design (I'm reluctant to 
use the term 'architecture') of the BIOS is such that it only expects to 
execute hardware initialization code (including that on option card 
BIOSes such as the VBIOS on graphics cards) at power on  reset.  Its 
design did not anticipate the need also to execute such code upon resume 
from S3:  These power-related features now in the components and on the 
hardware platform are relatively new things and represent a disruptor at 
the firmware level as well as further upstairs. 

In fact, even if there were a hook to the VBIOS iniitalization code that 
the OS could get to, very often it could not be re-executed in any case 
since it tends to rely on routines in the motherboard's BIOS, and we've 
seen that various routines in the main BIOS also become unmapped after 
the power-on reset sequence has been completed.  Typically the 
initialization entry points in the VBIOS are correspondingly unmapped 
after POR.

This leaves us in the difficult situation [at present] that we have to 
have a Solaris kernel driver that knows how to re-initialize the 
graphics hardware in question from cold iron -- equivalent to what the 
VBIOS does at POR.
In some cases we have grabbed a copy of the graphics card's VBIOS and 
then interpret that  in the Solaris device driver during the resume 
operation. 

Having spent several months to make one of these things work for the ATI 
RageXL chip, I can tell you that this is not a happy way to proceed.  
Often the documentation is poor or non-existent (sometimes because some 
of the vendors feel that that might reveal proprietary aspects of their 
chip architecture or something), and even when documentation can be 
procured, we have discovered that there are implementation bugs in some 
of the chips which have sometimes been band-aided with subsequent 
undocumented bits in the hardware which can be nearly impossible to 
learn about.  Such a thing in particular with the RageXL took us the 
better part of a month to discover.

An excellent answer would be a change in the BIOS architecture such that 
the same hardware initialization code could be executed whilst coming 
out of S3.  That way we continue to have a sensible situation in which 
those very low level aspects are supported by the option vendor and we 
don't need to think about it in OS-land.

The other - poorer in my opinion, way to go is that we get the vendors' 
support to provide an OS-specific device driver that knows how to do the 
right thing(s).  We currently have this situation with nVidia, and since 
they have a unified driver architecture, they don't have to provide us a 
different one for every graphics device they come out with.  They just 
keep the one unified one up to date.

ATI has been a bigger problem historically.  First, because they didn't 
at first have a unified architecture and hence single device driver 
capability.  They how do have that I understand (since R300 I believe), 
but we don't yet have a situation in which they are providing us a 
Solaris device driver to do S/R, nor do we yet have the capacity to do 
that ourselves, as I believe we still do not (after a long long effort) 
have the documentation and/or code examples to
do the driver(s) ourselves.  As Randy says, there has been some recent 
light in this tunnel, and there is talk that the means may now be 
available to us, but ...
>
>   You can bemoan, but it may not help at all.  I keep hearing that the 
> open source driver works, but I don't think there has been success 
> yet.
>
>       ---- Randy
>
>   
>>     -- Garrett
>>
>>     
> _______________________________________________
> pm-discuss mailing list
> pm-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/pm-discuss
>   

-- 
; David J. Brown Ph.D. (cantab.)
; Solaris Engineering
; Sun Microsystems Inc.
; --
; Postal Address:                       Telephone: (650) 786-5558
;  4150 Network Circle, UMPK17-307      FAX:       (650) 786-5734
;  Santa Clara, CA 95054                e-mail:    djb at sun.com

[pm-discuss] debugging suspend/resume failures...

Reply via email to