Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Daniel Vetter
On Sun, Feb 17, 2013 at 5:31 PM, Hugh Dickins  wrote:
> On Sun, 17 Feb 2013, Daniel Vetter wrote:
>> On Sun, Feb 17, 2013 at 3:21 AM, Hugh Dickins  wrote:
>> > On Sat, 16 Feb 2013, Hugh Dickins wrote:
>> >> On Sat, 16 Feb 2013, Linus Torvalds wrote:
>> >> >
>> >> > I think it's worth it to give them a heads-up already. So I've cc'd
>> >> > the main suspects here..
>> >>
>> >> Okay, thanks.
>> >>
>> >> >
>> >> > Daniel, Dave - any comments about a NULL fb in
>> >> > intel_choose_pipe_bpp_dither() during either suspend or resume? Some
>> >> > googling shows this:
>> >> >
>> >> > https://bugzilla.redhat.com/show_bug.cgi?id=895123
>> >>
>> >> Great, yes, I'm sure that's the same (though it says "suspend"
>> >> and I say "resume").
>> >>
>> >> >
>> >> > which sounds remarkably similar, and is also during a suspend attempt
>> >> > (but apparently Satish got a full oops out).. Some timing race with a
>> >> > worker entry?
>> >
>> > Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
>> > the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 "drm/i915: force restore
>> > on lid open", whose force_restore case now passes down crtc->base.fb.  But
>> > I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
>> > your timing race with a worker entry, perhaps.
>> >
>> > And 45e2b5f640b3 contains a fine history of going back and forth, so I
>> > wouldn't want to play further with it out of ignorance - though tempted
>> > to replace the "if (force_restore) {" by an interim safe-seeming
>> > compromise of "if (force_restore && crtc->base.fb) {".
>
> My suggestion there in the line above was stupidly wrong :(
>
>>
>> Two things to try while I try to grow a clue on what exactly is going on:
>
> Thank you.
>
> By the way, I hope you've looked back through this thread, and realize
> that Dave and I both had ThinkPad T4?0s suspend/resume display problems,
> but they've gone off in different directions: so a lot of the discussion
> comes from Dave having CONFIG_PROVE_RCU_DELAY, and has nothing to do with
> what we now know to be my oops in i915/intel_display.c.

Oh, I haven't read the earlier parts of the thread, but agree that
it's a completely different bug: Different chipset (this matters
massively for gpus usually), Dave's issue happens on -rc1 (which
doesn't contain the offending lid_notifier patch yet) and seems to die
someplace completely else than your box.

>> 1. Related to new ACPI sleep states we've recently made the lid_notifier
>> locking more sound, maybe that helps:
>>
>> http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued=b8efb17b3d687695b81485f606fc4e6c35a50f9a
>
> Looks like it may be relevant, but I'll ignore it for now:
> preferring first to test the more minimal safety you suggest below.
>
>>
>> 2. The new i915 force restore code is indeed missing a safety check
>> compared to the old crtc helper based one:
>>
>> diff --git a/drivers/gpu/drm/i915/intel_display.c 
>> b/drivers/gpu/drm/i915/intel_display.c
>> index 6eb3882..095094c 100644
>> --- a/drivers/gpu/drm/i915/intel_display.c
>> +++ b/drivers/gpu/drm/i915/intel_display.c
>> @@ -9153,7 +9153,13 @@ void intel_modeset_setup_hw_state(struct drm_device 
>> *dev,
>>
>>   if (force_restore) {
>>   for_each_pipe(pipe) {
>> - 
>> intel_crtc_restore_mode(dev_priv->pipe_to_crtc_mapping[pipe]);
>> + struct drm_crtc * crtc =
>> + dev_priv->pipe_to_crtc_mapping[pipe];
>> +
>> + if (!crtc->enabled)
>> + continue;
>> +
>> + intel_crtc_restore_mode(crtc);
>>   }
>>
>>   i915_redisable_vga(dev);
>
> I see your followup, where you observe that intel_modeset_affected_pipes()
> should already have made this check; but you do say it would still be good
> to prove one way or the other, so I'll run from now with the patch below.
>
> Note that we haven't got to worrying about what will be in 3.9 yet
> (linux-next tells me to expect hangcheck timer problems): it's Linus's
> current git for 3.8 final that we're working on at present.

Right, patch was on top of -next, but there shouldn't be any
(functional) differences in this area compared to 3.8. The first part
of the big rework landed in 3.7 and contained the crtc->enabled check
from day one.

For the hangcheck issue in -next, that might be a new one. If you
catch it again, can you please grab the i915_error_state from debugfs
and throw it over to me? That should be enough for basic analysis.

> And since quick resumes have always worked for me, it's only occasionally
> that a long suspend (e.g. overnight) fails for me in this way, so I won't
> be able to report success for several days - but failure may come sooner.
>
> And, it being very tiresome to debug when it does fail, I have inserted
> WARN_ONs and more safety: here's what I'm running with now.

Yep, that should be interesting 

Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Daniel Vetter
On Sun, Feb 17, 2013 at 2:38 PM, Daniel Vetter  
wrote:
> 2. The new i915 force restore code is indeed missing a safety check
> compared to the old crtc helper based one:
>
> diff --git a/drivers/gpu/drm/i915/intel_display.c 
> b/drivers/gpu/drm/i915/intel_display.c
> index 6eb3882..095094c 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -9153,7 +9153,13 @@ void intel_modeset_setup_hw_state(struct drm_device 
> *dev,
>
> if (force_restore) {
> for_each_pipe(pipe) {
> -   
> intel_crtc_restore_mode(dev_priv->pipe_to_crtc_mapping[pipe]);
> +   struct drm_crtc * crtc =
> +   dev_priv->pipe_to_crtc_mapping[pipe];
> +
> +   if (!crtc->enabled)
> +   continue;
> +
> +   intel_crtc_restore_mode(crtc);
> }
>
> i915_redisable_vga(dev);
>
> The issue is that we save a pointer to the fb (since those are refcounted)
> but copy the mode into the crtc struct (since modes are not refcounted).
> So on restore the mode will always be non-NULL, which is wrong if the crtc
> is off (and so also has a NULL fb).
>
> The problem I have with that patch is figuring out how this ever worked. I
> think I need more coffee ;-)

Ok, coffee seems to be working now. I think the above diff shouldn't
change anything, since we already have a crtc->enabled check in
intel_modeset_affected_pipes in intel_display.c. Still would be good
if you can prove this one way or the other.

For those wondering why this check is buried this deeply: We're in the
middle of a massive rework of our modeset code, moving from
one-crtc-at-a-time to global modeset. We need that to implement some
fancy features like fastboot or better handling of global resource
constraints (shared clocks, bw limits, ...). In the new world we set
up the desired state in staging pointers/data structures. Then the
modeset code diffs that with the current state and computes the best
way to do the transition. But since we're still converting code some
pieces pass in the new state explicitly, but lower levels then ignore
some pieces when not required to reach the desired state.

The new lid restore code relies on that by updating the tracked hw
state from the real hw one and restoring the last desired state (which
is still around from the last modeset call).

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Daniel Vetter
On Sun, Feb 17, 2013 at 3:21 AM, Hugh Dickins  wrote:
> On Sat, 16 Feb 2013, Hugh Dickins wrote:
>> On Sat, 16 Feb 2013, Linus Torvalds wrote:
>> >
>> > I think it's worth it to give them a heads-up already. So I've cc'd
>> > the main suspects here..
>>
>> Okay, thanks.
>>
>> >
>> > Daniel, Dave - any comments about a NULL fb in
>> > intel_choose_pipe_bpp_dither() during either suspend or resume? Some
>> > googling shows this:
>> >
>> > https://bugzilla.redhat.com/show_bug.cgi?id=895123
>>
>> Great, yes, I'm sure that's the same (though it says "suspend"
>> and I say "resume").
>>
>> >
>> > which sounds remarkably similar, and is also during a suspend attempt
>> > (but apparently Satish got a full oops out).. Some timing race with a
>> > worker entry?
>
> Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
> the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 "drm/i915: force restore
> on lid open", whose force_restore case now passes down crtc->base.fb.  But
> I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
> your timing race with a worker entry, perhaps.
>
> And 45e2b5f640b3 contains a fine history of going back and forth, so I
> wouldn't want to play further with it out of ignorance - though tempted
> to replace the "if (force_restore) {" by an interim safe-seeming
> compromise of "if (force_restore && crtc->base.fb) {".

Two things to try while I try to grow a clue on what exactly is going on:

1. Related to new ACPI sleep states we've recently made the lid_notifier
locking more sound, maybe that helps:

http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued=b8efb17b3d687695b81485f606fc4e6c35a50f9a

2. The new i915 force restore code is indeed missing a safety check
compared to the old crtc helper based one:

diff --git a/drivers/gpu/drm/i915/intel_display.c 
b/drivers/gpu/drm/i915/intel_display.c
index 6eb3882..095094c 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -9153,7 +9153,13 @@ void intel_modeset_setup_hw_state(struct drm_device *dev,

if (force_restore) {
for_each_pipe(pipe) {
-   
intel_crtc_restore_mode(dev_priv->pipe_to_crtc_mapping[pipe]);
+   struct drm_crtc * crtc =
+   dev_priv->pipe_to_crtc_mapping[pipe];
+
+   if (!crtc->enabled)
+   continue;
+
+   intel_crtc_restore_mode(crtc);
}

i915_redisable_vga(dev);

The issue is that we save a pointer to the fb (since those are refcounted)
but copy the mode into the crtc struct (since modes are not refcounted).
So on restore the mode will always be non-NULL, which is wrong if the crtc
is off (and so also has a NULL fb).

The problem I have with that patch is figuring out how this ever worked. I
think I need more coffee ;-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Hugh Dickins
On Sun, 17 Feb 2013, Daniel Vetter wrote:
> On Sun, Feb 17, 2013 at 3:21 AM, Hugh Dickins  wrote:
> > On Sat, 16 Feb 2013, Hugh Dickins wrote:
> >> On Sat, 16 Feb 2013, Linus Torvalds wrote:
> >> >
> >> > I think it's worth it to give them a heads-up already. So I've cc'd
> >> > the main suspects here..
> >>
> >> Okay, thanks.
> >>
> >> >
> >> > Daniel, Dave - any comments about a NULL fb in
> >> > intel_choose_pipe_bpp_dither() during either suspend or resume? Some
> >> > googling shows this:
> >> >
> >> > https://bugzilla.redhat.com/show_bug.cgi?id=895123
> >>
> >> Great, yes, I'm sure that's the same (though it says "suspend"
> >> and I say "resume").
> >>
> >> >
> >> > which sounds remarkably similar, and is also during a suspend attempt
> >> > (but apparently Satish got a full oops out).. Some timing race with a
> >> > worker entry?
> >
> > Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
> > the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 "drm/i915: force restore
> > on lid open", whose force_restore case now passes down crtc->base.fb.  But
> > I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
> > your timing race with a worker entry, perhaps.
> >
> > And 45e2b5f640b3 contains a fine history of going back and forth, so I
> > wouldn't want to play further with it out of ignorance - though tempted
> > to replace the "if (force_restore) {" by an interim safe-seeming
> > compromise of "if (force_restore && crtc->base.fb) {".

My suggestion there in the line above was stupidly wrong :(

> 
> Two things to try while I try to grow a clue on what exactly is going on:

Thank you.

By the way, I hope you've looked back through this thread, and realize
that Dave and I both had ThinkPad T4?0s suspend/resume display problems,
but they've gone off in different directions: so a lot of the discussion
comes from Dave having CONFIG_PROVE_RCU_DELAY, and has nothing to do with
what we now know to be my oops in i915/intel_display.c.

> 
> 1. Related to new ACPI sleep states we've recently made the lid_notifier
> locking more sound, maybe that helps:
> 
> http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued=b8efb17b3d687695b81485f606fc4e6c35a50f9a

Looks like it may be relevant, but I'll ignore it for now:
preferring first to test the more minimal safety you suggest below.

> 
> 2. The new i915 force restore code is indeed missing a safety check
> compared to the old crtc helper based one:
> 
> diff --git a/drivers/gpu/drm/i915/intel_display.c 
> b/drivers/gpu/drm/i915/intel_display.c
> index 6eb3882..095094c 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -9153,7 +9153,13 @@ void intel_modeset_setup_hw_state(struct drm_device 
> *dev,
>  
>   if (force_restore) {
>   for_each_pipe(pipe) {
> - 
> intel_crtc_restore_mode(dev_priv->pipe_to_crtc_mapping[pipe]);
> + struct drm_crtc * crtc =
> + dev_priv->pipe_to_crtc_mapping[pipe];
> +
> + if (!crtc->enabled)
> + continue;
> +
> + intel_crtc_restore_mode(crtc);
>   }
>  
>   i915_redisable_vga(dev);

I see your followup, where you observe that intel_modeset_affected_pipes()
should already have made this check; but you do say it would still be good
to prove one way or the other, so I'll run from now with the patch below.

Note that we haven't got to worrying about what will be in 3.9 yet
(linux-next tells me to expect hangcheck timer problems): it's Linus's
current git for 3.8 final that we're working on at present.

And since quick resumes have always worked for me, it's only occasionally
that a long suspend (e.g. overnight) fails for me in this way, so I won't
be able to report success for several days - but failure may come sooner.

And, it being very tiresome to debug when it does fail, I have inserted
WARN_ONs and more safety: here's what I'm running with now.

--- 3.8-rc7/drivers/gpu/drm/i915/intel_display.c2013-01-17 
20:06:11.384002362 -0800
+++ linux/drivers/gpu/drm/i915/intel_display.c  2013-02-17 07:50:28.012075150 
-0800
@@ -4156,7 +4156,9 @@ static bool intel_choose_pipe_bpp_dither
 * also stays within the max display bpc discovered above.
 */

-   switch (fb->depth) {
+   if (WARN_ON(!fb))
+   bpc = 8;
+   else switch (fb->depth) {
case 8:
bpc = 8; /* since we go through a colormap */
break;
@@ -9302,6 +9304,10 @@ void intel_modeset_setup_hw_state(struct
if (force_restore) {
for_each_pipe(pipe) {
crtc = 
to_intel_crtc(dev_priv->pipe_to_crtc_mapping[pipe]);
+   if (WARN_ON(!crtc->base.enabled))
+   continue;
+   if (WARN_ON(!crtc->base.fb))
+  

Re: Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Daniel Vetter
On Sun, Feb 17, 2013 at 5:31 PM, Hugh Dickins hu...@google.com wrote:
 On Sun, 17 Feb 2013, Daniel Vetter wrote:
 On Sun, Feb 17, 2013 at 3:21 AM, Hugh Dickins hu...@google.com wrote:
  On Sat, 16 Feb 2013, Hugh Dickins wrote:
  On Sat, 16 Feb 2013, Linus Torvalds wrote:
  
   I think it's worth it to give them a heads-up already. So I've cc'd
   the main suspects here..
 
  Okay, thanks.
 
  
   Daniel, Dave - any comments about a NULL fb in
   intel_choose_pipe_bpp_dither() during either suspend or resume? Some
   googling shows this:
  
   https://bugzilla.redhat.com/show_bug.cgi?id=895123
 
  Great, yes, I'm sure that's the same (though it says suspend
  and I say resume).
 
  
   which sounds remarkably similar, and is also during a suspend attempt
   (but apparently Satish got a full oops out).. Some timing race with a
   worker entry?
 
  Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
  the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 drm/i915: force restore
  on lid open, whose force_restore case now passes down crtc-base.fb.  But
  I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
  your timing race with a worker entry, perhaps.
 
  And 45e2b5f640b3 contains a fine history of going back and forth, so I
  wouldn't want to play further with it out of ignorance - though tempted
  to replace the if (force_restore) { by an interim safe-seeming
  compromise of if (force_restore  crtc-base.fb) {.

 My suggestion there in the line above was stupidly wrong :(


 Two things to try while I try to grow a clue on what exactly is going on:

 Thank you.

 By the way, I hope you've looked back through this thread, and realize
 that Dave and I both had ThinkPad T4?0s suspend/resume display problems,
 but they've gone off in different directions: so a lot of the discussion
 comes from Dave having CONFIG_PROVE_RCU_DELAY, and has nothing to do with
 what we now know to be my oops in i915/intel_display.c.

Oh, I haven't read the earlier parts of the thread, but agree that
it's a completely different bug: Different chipset (this matters
massively for gpus usually), Dave's issue happens on -rc1 (which
doesn't contain the offending lid_notifier patch yet) and seems to die
someplace completely else than your box.

 1. Related to new ACPI sleep states we've recently made the lid_notifier
 locking more sound, maybe that helps:

 http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queuedid=b8efb17b3d687695b81485f606fc4e6c35a50f9a

 Looks like it may be relevant, but I'll ignore it for now:
 preferring first to test the more minimal safety you suggest below.


 2. The new i915 force restore code is indeed missing a safety check
 compared to the old crtc helper based one:

 diff --git a/drivers/gpu/drm/i915/intel_display.c 
 b/drivers/gpu/drm/i915/intel_display.c
 index 6eb3882..095094c 100644
 --- a/drivers/gpu/drm/i915/intel_display.c
 +++ b/drivers/gpu/drm/i915/intel_display.c
 @@ -9153,7 +9153,13 @@ void intel_modeset_setup_hw_state(struct drm_device 
 *dev,

   if (force_restore) {
   for_each_pipe(pipe) {
 - 
 intel_crtc_restore_mode(dev_priv-pipe_to_crtc_mapping[pipe]);
 + struct drm_crtc * crtc =
 + dev_priv-pipe_to_crtc_mapping[pipe];
 +
 + if (!crtc-enabled)
 + continue;
 +
 + intel_crtc_restore_mode(crtc);
   }

   i915_redisable_vga(dev);

 I see your followup, where you observe that intel_modeset_affected_pipes()
 should already have made this check; but you do say it would still be good
 to prove one way or the other, so I'll run from now with the patch below.

 Note that we haven't got to worrying about what will be in 3.9 yet
 (linux-next tells me to expect hangcheck timer problems): it's Linus's
 current git for 3.8 final that we're working on at present.

Right, patch was on top of -next, but there shouldn't be any
(functional) differences in this area compared to 3.8. The first part
of the big rework landed in 3.7 and contained the crtc-enabled check
from day one.

For the hangcheck issue in -next, that might be a new one. If you
catch it again, can you please grab the i915_error_state from debugfs
and throw it over to me? That should be enough for basic analysis.

 And since quick resumes have always worked for me, it's only occasionally
 that a long suspend (e.g. overnight) fails for me in this way, so I won't
 be able to report success for several days - but failure may come sooner.

 And, it being very tiresome to debug when it does fail, I have inserted
 WARN_ONs and more safety: here's what I'm running with now.

Yep, that should be interesting once it catches something. For more
debug noise can you set drm.debug=0xe (should work even when setting
it in /sys/modules/drm/parameters at runtime). That spills tons of
stuff into dmesg which will come handy in 

Re: Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Hugh Dickins
On Sat, 16 Feb 2013, Linus Torvalds wrote:
 On Sat, Feb 16, 2013 at 1:45 PM, Hugh Dickins hu...@google.com wrote:
 
  I hacked around on your PM_TRACE set_magic_time() / read_magic_time()
  yesterday, to save an oopsing core kernel ip there, instead of hashed
  pm trace info (it makes sense in this case to invert your sequence,
  putting the high order into years and the low order into minutes).
 
 That sounds like a good idea in general. The PM_TRACE() thing was done
 to figure out things that locked up the PCI bus etc, but encoding the
 oopses during suspend sounds like a really good idea too.

Yes, it can and should be generalized, but needs more than I gave it.

 
 Is your patch clean enough to just be made part of the standard
 PM_TRACE infrastructure, or was it something really hacky and one-off?

Not really hacky, but three reasons why it cannot be standard without
more work (I am supposing that it should not be tied to suspend/resume,
but could be used for any oops which gets hidden by graphic screen,
and also fails to reach /var/log/messages).

1. Because I usually have a version of KDB patched into my kernels
(forked many years ago, never time to integrate with that subset
which eventually went into 2.6.??), it was easiest to implement as
a pair of KDB commands (needing keyboard input: I already knew I
could reboot from KDB in this state).  (Sidenote: using KDB can
prevent getting a trace into /var/log/messages: so I had tried
without it, but still failed to get one.)

2. I don't use initrd, have very little in modules, and a pared down
kernel: so for me it was not a serious restriction to limit to core
kernel addresses, which easily fitted within the limit; but we ought
to put thought into handling module addresses too.

3. Needs care on the interface: a debug config option presumably,
but perhaps also needs to tie in with panic_on_oops or something -
I don't like my date getting messed up by surprise, and there's no
point in messing it up unless there's reason to believe the machine
will very quickly be rebooted.  Since I had to open the lid to resume
to hit the oops, in this case we could rely on me quickly rebooting.

But I've appended what I had below, so it's on the record, and can
be taken further if (likely) someone gets there sooner than I do.

 
  Rewarded last night by reboot to Feb 21 14:45:53 2006.  Which is
  812d60ed intel_choose_pipe_bpp_dither.isra.13+0x216/0x2d6
 
  /home/hugh/3087X/drivers/gpu/drm/i915/intel_display.c:4159
   * enable dithering as needed, but that costs bandwidth.  So choose
   * the minimum value that expresses the full color range of the fb 
  but
   * also stays within the max display bpc discovered above.
   */
 
  switch (fb-depth) {
  812d60e9:   48 8b 55 c0 mov-0x40(%rbp),%rdx
  812d60ed:   8b 02   mov(%rdx),%eax
 
  (gcc chose to pass a pointer to fb-depth down to the function,
  instead of fb itself, since that is the only use of it there.)
 
  I expect that fb is NULL; but with an average of one failure to resume
  per day, and ~26 bits of info per crash, this is not a fast procedure!
 
  I notice that intel_pipe_set_base() allows for NULL fb,
  so I'm currently running with the oops-in-rtc hackery, plus
  -   switch (fb-depth) {
  +   if (WARN_ON(!fb))
  +   bpc = 8;
  +   else switch (fb-depth) {
 
  There's been a fair bit of change to intel_display.c since 3.7 (if
  my 3.7 was indeed good), mainly splitting intel_ into haswell_ versus
  ironlake_, but I've not yet spotted anything obvious; nor yet looked
  to see where fb would originate from anyway.
 
  Once I've got just a little more info out of it, I'll start another
  thread addressed principally to the drm/gpu/i915 guys.
 
 I think it's worth it to give them a heads-up already. So I've cc'd
 the main suspects here..

Okay, thanks.

 
 Daniel, Dave - any comments about a NULL fb in
 intel_choose_pipe_bpp_dither() during either suspend or resume? Some
 googling shows this:
 
 https://bugzilla.redhat.com/show_bug.cgi?id=895123

Great, yes, I'm sure that's the same (though it says suspend
and I say resume).

 
 which sounds remarkably similar, and is also during a suspend attempt
 (but apparently Satish got a full oops out).. Some timing race with a
 worker entry?
 
 Linus

#include linux/rtc.h
#include asm/rtc.h
/*
 * HughD adapted the following from drivers/base/power/trace.c:
 *
 * Copyright (C) 2006 Linus Torvalds
 *
 * Trace facility for suspend/resume problems, when none of the
 * devices may be working.
 *
 * Horrid, horrid, horrid.
 *
 * It turns out that the _only_ piece of hardware that actually
 * keeps its value across a hard boot (and, more importantly, the
 * POST init sequence) is literally the realtime clock.
 *
 * Never mind that an RTC chip has 114 bytes (and often a whole
 * other bank of an additional 128 bytes) of 

Re: Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Hugh Dickins
On Sat, 16 Feb 2013, Hugh Dickins wrote:
 On Sat, 16 Feb 2013, Linus Torvalds wrote:
  
  I think it's worth it to give them a heads-up already. So I've cc'd
  the main suspects here..
 
 Okay, thanks.
 
  
  Daniel, Dave - any comments about a NULL fb in
  intel_choose_pipe_bpp_dither() during either suspend or resume? Some
  googling shows this:
  
  https://bugzilla.redhat.com/show_bug.cgi?id=895123
 
 Great, yes, I'm sure that's the same (though it says suspend
 and I say resume).
 
  
  which sounds remarkably similar, and is also during a suspend attempt
  (but apparently Satish got a full oops out).. Some timing race with a
  worker entry?

Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 drm/i915: force restore
on lid open, whose force_restore case now passes down crtc-base.fb.  But
I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
your timing race with a worker entry, perhaps.

And 45e2b5f640b3 contains a fine history of going back and forth, so I
wouldn't want to play further with it out of ignorance - though tempted
to replace the if (force_restore) { by an interim safe-seeming
compromise of if (force_restore  crtc-base.fb) {.

Hugh
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Hugh Dickins
On Sun, 17 Feb 2013, Daniel Vetter wrote:
 On Sun, Feb 17, 2013 at 3:21 AM, Hugh Dickins hu...@google.com wrote:
  On Sat, 16 Feb 2013, Hugh Dickins wrote:
  On Sat, 16 Feb 2013, Linus Torvalds wrote:
  
   I think it's worth it to give them a heads-up already. So I've cc'd
   the main suspects here..
 
  Okay, thanks.
 
  
   Daniel, Dave - any comments about a NULL fb in
   intel_choose_pipe_bpp_dither() during either suspend or resume? Some
   googling shows this:
  
   https://bugzilla.redhat.com/show_bug.cgi?id=895123
 
  Great, yes, I'm sure that's the same (though it says suspend
  and I say resume).
 
  
   which sounds remarkably similar, and is also during a suspend attempt
   (but apparently Satish got a full oops out).. Some timing race with a
   worker entry?
 
  Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
  the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 drm/i915: force restore
  on lid open, whose force_restore case now passes down crtc-base.fb.  But
  I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
  your timing race with a worker entry, perhaps.
 
  And 45e2b5f640b3 contains a fine history of going back and forth, so I
  wouldn't want to play further with it out of ignorance - though tempted
  to replace the if (force_restore) { by an interim safe-seeming
  compromise of if (force_restore  crtc-base.fb) {.

My suggestion there in the line above was stupidly wrong :(

 
 Two things to try while I try to grow a clue on what exactly is going on:

Thank you.

By the way, I hope you've looked back through this thread, and realize
that Dave and I both had ThinkPad T4?0s suspend/resume display problems,
but they've gone off in different directions: so a lot of the discussion
comes from Dave having CONFIG_PROVE_RCU_DELAY, and has nothing to do with
what we now know to be my oops in i915/intel_display.c.

 
 1. Related to new ACPI sleep states we've recently made the lid_notifier
 locking more sound, maybe that helps:
 
 http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queuedid=b8efb17b3d687695b81485f606fc4e6c35a50f9a

Looks like it may be relevant, but I'll ignore it for now:
preferring first to test the more minimal safety you suggest below.

 
 2. The new i915 force restore code is indeed missing a safety check
 compared to the old crtc helper based one:
 
 diff --git a/drivers/gpu/drm/i915/intel_display.c 
 b/drivers/gpu/drm/i915/intel_display.c
 index 6eb3882..095094c 100644
 --- a/drivers/gpu/drm/i915/intel_display.c
 +++ b/drivers/gpu/drm/i915/intel_display.c
 @@ -9153,7 +9153,13 @@ void intel_modeset_setup_hw_state(struct drm_device 
 *dev,
  
   if (force_restore) {
   for_each_pipe(pipe) {
 - 
 intel_crtc_restore_mode(dev_priv-pipe_to_crtc_mapping[pipe]);
 + struct drm_crtc * crtc =
 + dev_priv-pipe_to_crtc_mapping[pipe];
 +
 + if (!crtc-enabled)
 + continue;
 +
 + intel_crtc_restore_mode(crtc);
   }
  
   i915_redisable_vga(dev);

I see your followup, where you observe that intel_modeset_affected_pipes()
should already have made this check; but you do say it would still be good
to prove one way or the other, so I'll run from now with the patch below.

Note that we haven't got to worrying about what will be in 3.9 yet
(linux-next tells me to expect hangcheck timer problems): it's Linus's
current git for 3.8 final that we're working on at present.

And since quick resumes have always worked for me, it's only occasionally
that a long suspend (e.g. overnight) fails for me in this way, so I won't
be able to report success for several days - but failure may come sooner.

And, it being very tiresome to debug when it does fail, I have inserted
WARN_ONs and more safety: here's what I'm running with now.

--- 3.8-rc7/drivers/gpu/drm/i915/intel_display.c2013-01-17 
20:06:11.384002362 -0800
+++ linux/drivers/gpu/drm/i915/intel_display.c  2013-02-17 07:50:28.012075150 
-0800
@@ -4156,7 +4156,9 @@ static bool intel_choose_pipe_bpp_dither
 * also stays within the max display bpc discovered above.
 */
 
-   switch (fb-depth) {
+   if (WARN_ON(!fb))
+   bpc = 8;
+   else switch (fb-depth) {
case 8:
bpc = 8; /* since we go through a colormap */
break;
@@ -9302,6 +9304,10 @@ void intel_modeset_setup_hw_state(struct
if (force_restore) {
for_each_pipe(pipe) {
crtc = 
to_intel_crtc(dev_priv-pipe_to_crtc_mapping[pipe]);
+   if (WARN_ON(!crtc-base.enabled))
+   continue;
+   if (WARN_ON(!crtc-base.fb))
+   continue;
intel_set_mode(crtc-base, crtc-base.mode,
   

Debugging Thinkpad T430s occasional suspend failure.

2013-02-16 Thread Hugh Dickins
On Sat, 16 Feb 2013, Hugh Dickins wrote:
> On Sat, 16 Feb 2013, Linus Torvalds wrote:
> > 
> > I think it's worth it to give them a heads-up already. So I've cc'd
> > the main suspects here..
> 
> Okay, thanks.
> 
> > 
> > Daniel, Dave - any comments about a NULL fb in
> > intel_choose_pipe_bpp_dither() during either suspend or resume? Some
> > googling shows this:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=895123
> 
> Great, yes, I'm sure that's the same (though it says "suspend"
> and I say "resume").
> 
> > 
> > which sounds remarkably similar, and is also during a suspend attempt
> > (but apparently Satish got a full oops out).. Some timing race with a
> > worker entry?

Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 "drm/i915: force restore
on lid open", whose force_restore case now passes down crtc->base.fb.  But
I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
your timing race with a worker entry, perhaps.

And 45e2b5f640b3 contains a fine history of going back and forth, so I
wouldn't want to play further with it out of ignorance - though tempted
to replace the "if (force_restore) {" by an interim safe-seeming
compromise of "if (force_restore && crtc->base.fb) {".

Hugh


Debugging Thinkpad T430s occasional suspend failure.

2013-02-16 Thread Hugh Dickins
On Sat, 16 Feb 2013, Linus Torvalds wrote:
> On Sat, Feb 16, 2013 at 1:45 PM, Hugh Dickins  wrote:
> >
> > I hacked around on your PM_TRACE set_magic_time() / read_magic_time()
> > yesterday, to save an oopsing core kernel ip there, instead of hashed
> > pm trace info (it makes sense in this case to invert your sequence,
> > putting the high order into years and the low order into minutes).
> 
> That sounds like a good idea in general. The PM_TRACE() thing was done
> to figure out things that locked up the PCI bus etc, but encoding the
> oopses during suspend sounds like a really good idea too.

Yes, it can and should be generalized, but needs more than I gave it.

> 
> Is your patch clean enough to just be made part of the standard
> PM_TRACE infrastructure, or was it something really hacky and one-off?

Not really hacky, but three reasons why it cannot be standard without
more work (I am supposing that it should not be tied to suspend/resume,
but could be used for any oops which gets hidden by graphic screen,
and also fails to reach /var/log/messages).

1. Because I usually have a version of KDB patched into my kernels
("forked" many years ago, never time to integrate with that subset
which eventually went into 2.6.??), it was easiest to implement as
a pair of KDB commands (needing keyboard input: I already knew I
could "reboot" from KDB in this state).  (Sidenote: using KDB can
prevent getting a trace into /var/log/messages: so I had tried
without it, but still failed to get one.)

2. I don't use initrd, have very little in modules, and a pared down
kernel: so for me it was not a serious restriction to limit to core
kernel addresses, which easily fitted within the limit; but we ought
to put thought into handling module addresses too.

3. Needs care on the interface: a debug config option presumably,
but perhaps also needs to tie in with panic_on_oops or something -
I don't like my date getting messed up by surprise, and there's no
point in messing it up unless there's reason to believe the machine
will very quickly be rebooted.  Since I had to open the lid to resume
to hit the oops, in this case we could rely on me quickly rebooting.

But I've appended what I had below, so it's on the record, and can
be taken further if (likely) someone gets there sooner than I do.

> 
> > Rewarded last night by reboot to Feb 21 14:45:53 2006.  Which is
> > 812d60ed intel_choose_pipe_bpp_dither.isra.13+0x216/0x2d6
> >
> > /home/hugh/3087X/drivers/gpu/drm/i915/intel_display.c:4159
> >  * enable dithering as needed, but that costs bandwidth.  So choose
> >  * the minimum value that expresses the full color range of the fb 
> > but
> >  * also stays within the max display bpc discovered above.
> >  */
> >
> > switch (fb->depth) {
> > 812d60e9:   48 8b 55 c0 mov-0x40(%rbp),%rdx
> > 812d60ed:   8b 02   mov(%rdx),%eax
> >
> > (gcc chose to pass a pointer to fb->depth down to the function,
> > instead of fb itself, since that is the only use of it there.)
> >
> > I expect that fb is NULL; but with an average of one failure to resume
> > per day, and ~26 bits of info per crash, this is not a fast procedure!
> >
> > I notice that intel_pipe_set_base() allows for NULL fb,
> > so I'm currently running with the oops-in-rtc hackery, plus
> > -   switch (fb->depth) {
> > +   if (WARN_ON(!fb))
> > +   bpc = 8;
> > +   else switch (fb->depth) {
> >
> > There's been a fair bit of change to intel_display.c since 3.7 (if
> > my 3.7 was indeed good), mainly splitting intel_ into haswell_ versus
> > ironlake_, but I've not yet spotted anything obvious; nor yet looked
> > to see where fb would originate from anyway.
> >
> > Once I've got just a little more info out of it, I'll start another
> > thread addressed principally to the drm/gpu/i915 guys.
> 
> I think it's worth it to give them a heads-up already. So I've cc'd
> the main suspects here..

Okay, thanks.

> 
> Daniel, Dave - any comments about a NULL fb in
> intel_choose_pipe_bpp_dither() during either suspend or resume? Some
> googling shows this:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=895123

Great, yes, I'm sure that's the same (though it says "suspend"
and I say "resume").

> 
> which sounds remarkably similar, and is also during a suspend attempt
> (but apparently Satish got a full oops out).. Some timing race with a
> worker entry?
> 
> Linus

#include 
#include 
/*
 * HughD adapted the following from drivers/base/power/trace.c:
 *
 * Copyright (C) 2006 Linus Torvalds
 *
 * Trace facility for suspend/resume problems, when none of the
 * devices may be working.
 *
 * Horrid, horrid, horrid.
 *
 * It turns out that the _only_ piece of hardware that actually
 * keeps its value across a hard boot (and, more importantly, the
 * POST init sequence) is literally the realtime clock.
 *
 * Never mind that an RTC chip 

Debugging Thinkpad T430s occasional suspend failure.

2013-02-16 Thread Linus Torvalds
On Sat, Feb 16, 2013 at 1:45 PM, Hugh Dickins  wrote:
>
> I hacked around on your PM_TRACE set_magic_time() / read_magic_time()
> yesterday, to save an oopsing core kernel ip there, instead of hashed
> pm trace info (it makes sense in this case to invert your sequence,
> putting the high order into years and the low order into minutes).

That sounds like a good idea in general. The PM_TRACE() thing was done
to figure out things that locked up the PCI bus etc, but encoding the
oopses during suspend sounds like a really good idea too.

Is your patch clean enough to just be made part of the standard
PM_TRACE infrastructure, or was it something really hacky and one-off?

> Rewarded last night by reboot to Feb 21 14:45:53 2006.  Which is
> 812d60ed intel_choose_pipe_bpp_dither.isra.13+0x216/0x2d6
>
> /home/hugh/3087X/drivers/gpu/drm/i915/intel_display.c:4159
>  * enable dithering as needed, but that costs bandwidth.  So choose
>  * the minimum value that expresses the full color range of the fb but
>  * also stays within the max display bpc discovered above.
>  */
>
> switch (fb->depth) {
> 812d60e9:   48 8b 55 c0 mov-0x40(%rbp),%rdx
> 812d60ed:   8b 02   mov(%rdx),%eax
>
> (gcc chose to pass a pointer to fb->depth down to the function,
> instead of fb itself, since that is the only use of it there.)
>
> I expect that fb is NULL; but with an average of one failure to resume
> per day, and ~26 bits of info per crash, this is not a fast procedure!
>
> I notice that intel_pipe_set_base() allows for NULL fb,
> so I'm currently running with the oops-in-rtc hackery, plus
> -   switch (fb->depth) {
> +   if (WARN_ON(!fb))
> +   bpc = 8;
> +   else switch (fb->depth) {
>
> There's been a fair bit of change to intel_display.c since 3.7 (if
> my 3.7 was indeed good), mainly splitting intel_ into haswell_ versus
> ironlake_, but I've not yet spotted anything obvious; nor yet looked
> to see where fb would originate from anyway.
>
> Once I've got just a little more info out of it, I'll start another
> thread addressed principally to the drm/gpu/i915 guys.

I think it's worth it to give them a heads-up already. So I've cc'd
the main suspects here..

Daniel, Dave - any comments about a NULL fb in
intel_choose_pipe_bpp_dither() during either suspend or resume? Some
googling shows this:

https://bugzilla.redhat.com/show_bug.cgi?id=895123

which sounds remarkably similar, and is also during a suspend attempt
(but apparently Satish got a full oops out).. Some timing race with a
worker entry?

Linus


Re: Debugging Thinkpad T430s occasional suspend failure.

2013-02-16 Thread Linus Torvalds
On Sat, Feb 16, 2013 at 1:45 PM, Hugh Dickins hu...@google.com wrote:

 I hacked around on your PM_TRACE set_magic_time() / read_magic_time()
 yesterday, to save an oopsing core kernel ip there, instead of hashed
 pm trace info (it makes sense in this case to invert your sequence,
 putting the high order into years and the low order into minutes).

That sounds like a good idea in general. The PM_TRACE() thing was done
to figure out things that locked up the PCI bus etc, but encoding the
oopses during suspend sounds like a really good idea too.

Is your patch clean enough to just be made part of the standard
PM_TRACE infrastructure, or was it something really hacky and one-off?

 Rewarded last night by reboot to Feb 21 14:45:53 2006.  Which is
 812d60ed intel_choose_pipe_bpp_dither.isra.13+0x216/0x2d6

 /home/hugh/3087X/drivers/gpu/drm/i915/intel_display.c:4159
  * enable dithering as needed, but that costs bandwidth.  So choose
  * the minimum value that expresses the full color range of the fb but
  * also stays within the max display bpc discovered above.
  */

 switch (fb-depth) {
 812d60e9:   48 8b 55 c0 mov-0x40(%rbp),%rdx
 812d60ed:   8b 02   mov(%rdx),%eax

 (gcc chose to pass a pointer to fb-depth down to the function,
 instead of fb itself, since that is the only use of it there.)

 I expect that fb is NULL; but with an average of one failure to resume
 per day, and ~26 bits of info per crash, this is not a fast procedure!

 I notice that intel_pipe_set_base() allows for NULL fb,
 so I'm currently running with the oops-in-rtc hackery, plus
 -   switch (fb-depth) {
 +   if (WARN_ON(!fb))
 +   bpc = 8;
 +   else switch (fb-depth) {

 There's been a fair bit of change to intel_display.c since 3.7 (if
 my 3.7 was indeed good), mainly splitting intel_ into haswell_ versus
 ironlake_, but I've not yet spotted anything obvious; nor yet looked
 to see where fb would originate from anyway.

 Once I've got just a little more info out of it, I'll start another
 thread addressed principally to the drm/gpu/i915 guys.

I think it's worth it to give them a heads-up already. So I've cc'd
the main suspects here..

Daniel, Dave - any comments about a NULL fb in
intel_choose_pipe_bpp_dither() during either suspend or resume? Some
googling shows this:

https://bugzilla.redhat.com/show_bug.cgi?id=895123

which sounds remarkably similar, and is also during a suspend attempt
(but apparently Satish got a full oops out).. Some timing race with a
worker entry?

Linus
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel