[PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-18 Thread Andy Lutomirski
On Thu, Jun 13, 2013 at 2:22 PM, Andy Lutomirski  wrote:
> On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse  wrote:
>> Andy can you test (without your patch) and see if it helps with your issue :
>> http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch
>
> Testing now.  I'll report back in a couple of days.
>

3.9.4 plus this patch has been completely stable for several days now.

Tested-by: Andy Lutomirski 

Can you send this to Linux and -stable?

Thanks,
Andy


Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-18 Thread Andy Lutomirski
On Thu, Jun 13, 2013 at 2:22 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse j.gli...@gmail.com wrote:
 Andy can you test (without your patch) and see if it helps with your issue :
 http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch

 Testing now.  I'll report back in a couple of days.


3.9.4 plus this patch has been completely stable for several days now.

Tested-by: Andy Lutomirski l...@amacapital.net

Can you send this to Linux and -stable?

Thanks,
Andy
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-14 Thread Andy Lutomirski
On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse j.gli...@gmail.com wrote:
 On Wed, Jun 12, 2013 at 6:26 AM, Michel Dänzer mic...@daenzer.net wrote:
 On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:
 If the device is idle for over ten seconds, then the next attempt to do
 anything can race with the lockup detector and cause a bogus lockup
 to be detected.

 Oddly, the situation is well-described in the lockup detector's comments
 and a fix is even described.  This patch implements that fix (and corrects
 some typos in the description).

 My system has been stable for about a week running this code.  Without this,
 my screen would go blank every now and then and, when it came back, 
 everything
 would be remarkably slow (the latter is a separate bug).

 Signed-off-by: Andy Lutomirski l...@amacapital.net

 [...]

 diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
 b/drivers/gpu/drm/radeon/radeon_ring.c
 index 1ef5eaa..fb7b3ea 100644
 --- a/drivers/gpu/drm/radeon/radeon_ring.c
 +++ b/drivers/gpu/drm/radeon/radeon_ring.c
 @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring 
 *ring)
   * have CP rptr to a different value of jiffies wrap around which will 
 force
   * initialization of the lockup tracking informations.
   *
 - * A possible false positivie is if we get call after while and 
 last_cp_rptr ==
 - * the current CP rptr, even if it's unlikely it might happen. To avoid 
 this
 - * if the elapsed time since last call is bigger than 2 second than we 
 return
 - * false and update the tracking information. Due to this the caller must 
 call
 - * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
 reported
 - * the fencing code should be cautious about that.
 + * A possible false positive is if we get called after a while and
 + * last_cp_rptr == the current CP rptr, even if it's unlikely it might
 + * happen. To avoid this if the elapsed time since the last call is bigger
 + * than 2 second then we return false and update the tracking
 + * information. Due to this the caller must call radeon_ring_test_lockup
 + * more frequently than once every 2s when waiting.

 Is it guaranteed that radeon_ring_test_lockup will be called more often
 than every 2s when waiting? If not, this change might prevent a real
 lockup from being detected?

 Yes it will if you wait for a fence, because the fence timeout wait is
 way smaller than 2sec so radeon_ring_is_lockup get call several time,
 which call radeon_ring_force_activity and then
 radeon_ring_test_lockup.

 This also means it very very very unlikely (see below for the likely
 case) to have a wrap around that give last rptr same as current one.

 The likely case is when you have something like a long compute, then
 nothing is lockup but you keep filling ring with
 radeon_ring_force_activity but the cp is still stuck on the ib of the
 compute stuff so rptr does not progress.

 Either way, I wonder if there might not be a simpler solution to the
 problem, e.g. by updating last_activity when submitting commands to a
 previously empty ring.

 Maybe but i still don't think it should matter.

 Andy can you test (without your patch) and see if it helps with your issue :
 http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch

Testing now.  I'll report back in a couple of days.

I don't think that long computes have anything to do with it.  The
bogus lockups happen when I look away from my computer for a while and
then click something.  I thing the graphics are usually completely
idle when this happens.

AFAIK I've never run an OpenCL or similar application on this system.

--Andy
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-13 Thread Andy Lutomirski
On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse  wrote:
> On Wed, Jun 12, 2013 at 6:26 AM, Michel D?nzer  wrote:
>> On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:
>>> If the device is idle for over ten seconds, then the next attempt to do
>>> anything can race with the lockup detector and cause a bogus lockup
>>> to be detected.
>>>
>>> Oddly, the situation is well-described in the lockup detector's comments
>>> and a fix is even described.  This patch implements that fix (and corrects
>>> some typos in the description).
>>>
>>> My system has been stable for about a week running this code.  Without this,
>>> my screen would go blank every now and then and, when it came back, 
>>> everything
>>> would be remarkably slow (the latter is a separate bug).
>>>
>>> Signed-off-by: Andy Lutomirski 
>>
>> [...]
>>
>>> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
>>> b/drivers/gpu/drm/radeon/radeon_ring.c
>>> index 1ef5eaa..fb7b3ea 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_ring.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_ring.c
>>> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring 
>>> *ring)
>>>   * have CP rptr to a different value of jiffies wrap around which will 
>>> force
>>>   * initialization of the lockup tracking informations.
>>>   *
>>> - * A possible false positivie is if we get call after while and 
>>> last_cp_rptr ==
>>> - * the current CP rptr, even if it's unlikely it might happen. To avoid 
>>> this
>>> - * if the elapsed time since last call is bigger than 2 second than we 
>>> return
>>> - * false and update the tracking information. Due to this the caller must 
>>> call
>>> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
>>> reported
>>> - * the fencing code should be cautious about that.
>>> + * A possible false positive is if we get called after a while and
>>> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might
>>> + * happen. To avoid this if the elapsed time since the last call is bigger
>>> + * than 2 second then we return false and update the tracking
>>> + * information. Due to this the caller must call radeon_ring_test_lockup
>>> + * more frequently than once every 2s when waiting.
>>
>> Is it guaranteed that radeon_ring_test_lockup will be called more often
>> than every 2s when waiting? If not, this change might prevent a real
>> lockup from being detected?
>
> Yes it will if you wait for a fence, because the fence timeout wait is
> way smaller than 2sec so radeon_ring_is_lockup get call several time,
> which call radeon_ring_force_activity and then
> radeon_ring_test_lockup.
>
> This also means it very very very unlikely (see below for the likely
> case) to have a wrap around that give last rptr same as current one.
>
> The likely case is when you have something like a long compute, then
> nothing is lockup but you keep filling ring with
> radeon_ring_force_activity but the cp is still stuck on the ib of the
> compute stuff so rptr does not progress.
>
>> Either way, I wonder if there might not be a simpler solution to the
>> problem, e.g. by updating last_activity when submitting commands to a
>> previously empty ring.
>
> Maybe but i still don't think it should matter.
>
> Andy can you test (without your patch) and see if it helps with your issue :
> http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch

Testing now.  I'll report back in a couple of days.

I don't think that long computes have anything to do with it.  The
bogus lockups happen when I look away from my computer for a while and
then click something.  I thing the graphics are usually completely
idle when this happens.

AFAIK I've never run an OpenCL or similar application on this system.

--Andy


[PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-12 Thread Christian König
Am 12.06.2013 12:26, schrieb Michel D?nzer:
> On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:
>> If the device is idle for over ten seconds, then the next attempt to do
>> anything can race with the lockup detector and cause a bogus lockup
>> to be detected.
>>
>> Oddly, the situation is well-described in the lockup detector's comments
>> and a fix is even described.  This patch implements that fix (and corrects
>> some typos in the description).
>>
>> My system has been stable for about a week running this code.  Without this,
>> my screen would go blank every now and then and, when it came back, 
>> everything
>> would be remarkably slow (the latter is a separate bug).
>>
>> Signed-off-by: Andy Lutomirski 
> [...]
>
>> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
>> b/drivers/gpu/drm/radeon/radeon_ring.c
>> index 1ef5eaa..fb7b3ea 100644
>> --- a/drivers/gpu/drm/radeon/radeon_ring.c
>> +++ b/drivers/gpu/drm/radeon/radeon_ring.c
>> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring 
>> *ring)
>>* have CP rptr to a different value of jiffies wrap around which will 
>> force
>>* initialization of the lockup tracking informations.
>>*
>> - * A possible false positivie is if we get call after while and 
>> last_cp_rptr ==
>> - * the current CP rptr, even if it's unlikely it might happen. To avoid this
>> - * if the elapsed time since last call is bigger than 2 second than we 
>> return
>> - * false and update the tracking information. Due to this the caller must 
>> call
>> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
>> reported
>> - * the fencing code should be cautious about that.
>> + * A possible false positive is if we get called after a while and
>> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might
>> + * happen. To avoid this if the elapsed time since the last call is bigger
>> + * than 2 second then we return false and update the tracking
>> + * information. Due to this the caller must call radeon_ring_test_lockup
>> + * more frequently than once every 2s when waiting.
> Is it guaranteed that radeon_ring_test_lockup will be called more often
> than every 2s when waiting? If not, this change might prevent a real
> lockup from being detected?
>
> Either way, I wonder if there might not be a simpler solution to the
> problem, e.g. by updating last_activity when submitting commands to a
> previously empty ring.

If I remember correctly that's exactly what I used to do when creating 
radeon_ring_test_lockup, and now I'm really wondering why that stuff 
isn't there any more.

Anyway the problem you describe should only happen very very rarely in 
case of a wraparound, so I'm pretty sure that we have a different 
problem that's just masked by that patch.

Christian.


[PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-12 Thread Michel Dänzer
On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:
> If the device is idle for over ten seconds, then the next attempt to do
> anything can race with the lockup detector and cause a bogus lockup
> to be detected.
> 
> Oddly, the situation is well-described in the lockup detector's comments
> and a fix is even described.  This patch implements that fix (and corrects
> some typos in the description).
> 
> My system has been stable for about a week running this code.  Without this,
> my screen would go blank every now and then and, when it came back, everything
> would be remarkably slow (the latter is a separate bug).
> 
> Signed-off-by: Andy Lutomirski 

[...]

> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
> b/drivers/gpu/drm/radeon/radeon_ring.c
> index 1ef5eaa..fb7b3ea 100644
> --- a/drivers/gpu/drm/radeon/radeon_ring.c
> +++ b/drivers/gpu/drm/radeon/radeon_ring.c
> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
>   * have CP rptr to a different value of jiffies wrap around which will force
>   * initialization of the lockup tracking informations.
>   *
> - * A possible false positivie is if we get call after while and last_cp_rptr 
> ==
> - * the current CP rptr, even if it's unlikely it might happen. To avoid this
> - * if the elapsed time since last call is bigger than 2 second than we return
> - * false and update the tracking information. Due to this the caller must 
> call
> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
> reported
> - * the fencing code should be cautious about that.
> + * A possible false positive is if we get called after a while and
> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might
> + * happen. To avoid this if the elapsed time since the last call is bigger
> + * than 2 second then we return false and update the tracking
> + * information. Due to this the caller must call radeon_ring_test_lockup
> + * more frequently than once every 2s when waiting.

Is it guaranteed that radeon_ring_test_lockup will be called more often
than every 2s when waiting? If not, this change might prevent a real
lockup from being detected?

Either way, I wonder if there might not be a simpler solution to the
problem, e.g. by updating last_activity when submitting commands to a
previously empty ring.


-- 
Earthling Michel D?nzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer


[PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-12 Thread Jerome Glisse
On Wed, Jun 12, 2013 at 6:26 AM, Michel D?nzer  wrote:
> On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:
>> If the device is idle for over ten seconds, then the next attempt to do
>> anything can race with the lockup detector and cause a bogus lockup
>> to be detected.
>>
>> Oddly, the situation is well-described in the lockup detector's comments
>> and a fix is even described.  This patch implements that fix (and corrects
>> some typos in the description).
>>
>> My system has been stable for about a week running this code.  Without this,
>> my screen would go blank every now and then and, when it came back, 
>> everything
>> would be remarkably slow (the latter is a separate bug).
>>
>> Signed-off-by: Andy Lutomirski 
>
> [...]
>
>> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
>> b/drivers/gpu/drm/radeon/radeon_ring.c
>> index 1ef5eaa..fb7b3ea 100644
>> --- a/drivers/gpu/drm/radeon/radeon_ring.c
>> +++ b/drivers/gpu/drm/radeon/radeon_ring.c
>> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring 
>> *ring)
>>   * have CP rptr to a different value of jiffies wrap around which will force
>>   * initialization of the lockup tracking informations.
>>   *
>> - * A possible false positivie is if we get call after while and 
>> last_cp_rptr ==
>> - * the current CP rptr, even if it's unlikely it might happen. To avoid this
>> - * if the elapsed time since last call is bigger than 2 second than we 
>> return
>> - * false and update the tracking information. Due to this the caller must 
>> call
>> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
>> reported
>> - * the fencing code should be cautious about that.
>> + * A possible false positive is if we get called after a while and
>> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might
>> + * happen. To avoid this if the elapsed time since the last call is bigger
>> + * than 2 second then we return false and update the tracking
>> + * information. Due to this the caller must call radeon_ring_test_lockup
>> + * more frequently than once every 2s when waiting.
>
> Is it guaranteed that radeon_ring_test_lockup will be called more often
> than every 2s when waiting? If not, this change might prevent a real
> lockup from being detected?

Yes it will if you wait for a fence, because the fence timeout wait is
way smaller than 2sec so radeon_ring_is_lockup get call several time,
which call radeon_ring_force_activity and then
radeon_ring_test_lockup.

This also means it very very very unlikely (see below for the likely
case) to have a wrap around that give last rptr same as current one.

The likely case is when you have something like a long compute, then
nothing is lockup but you keep filling ring with
radeon_ring_force_activity but the cp is still stuck on the ib of the
compute stuff so rptr does not progress.

> Either way, I wonder if there might not be a simpler solution to the
> problem, e.g. by updating last_activity when submitting commands to a
> previously empty ring.

Maybe but i still don't think it should matter.

Andy can you test (without your patch) and see if it helps with your issue :
http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch

Cheers,
Jerome


Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-12 Thread Michel Dänzer
On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:
 If the device is idle for over ten seconds, then the next attempt to do
 anything can race with the lockup detector and cause a bogus lockup
 to be detected.
 
 Oddly, the situation is well-described in the lockup detector's comments
 and a fix is even described.  This patch implements that fix (and corrects
 some typos in the description).
 
 My system has been stable for about a week running this code.  Without this,
 my screen would go blank every now and then and, when it came back, everything
 would be remarkably slow (the latter is a separate bug).
 
 Signed-off-by: Andy Lutomirski l...@amacapital.net

[...]

 diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
 b/drivers/gpu/drm/radeon/radeon_ring.c
 index 1ef5eaa..fb7b3ea 100644
 --- a/drivers/gpu/drm/radeon/radeon_ring.c
 +++ b/drivers/gpu/drm/radeon/radeon_ring.c
 @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
   * have CP rptr to a different value of jiffies wrap around which will force
   * initialization of the lockup tracking informations.
   *
 - * A possible false positivie is if we get call after while and last_cp_rptr 
 ==
 - * the current CP rptr, even if it's unlikely it might happen. To avoid this
 - * if the elapsed time since last call is bigger than 2 second than we return
 - * false and update the tracking information. Due to this the caller must 
 call
 - * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
 reported
 - * the fencing code should be cautious about that.
 + * A possible false positive is if we get called after a while and
 + * last_cp_rptr == the current CP rptr, even if it's unlikely it might
 + * happen. To avoid this if the elapsed time since the last call is bigger
 + * than 2 second then we return false and update the tracking
 + * information. Due to this the caller must call radeon_ring_test_lockup
 + * more frequently than once every 2s when waiting.

Is it guaranteed that radeon_ring_test_lockup will be called more often
than every 2s when waiting? If not, this change might prevent a real
lockup from being detected?

Either way, I wonder if there might not be a simpler solution to the
problem, e.g. by updating last_activity when submitting commands to a
previously empty ring.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-12 Thread Christian König

Am 12.06.2013 12:26, schrieb Michel Dänzer:

On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:

If the device is idle for over ten seconds, then the next attempt to do
anything can race with the lockup detector and cause a bogus lockup
to be detected.

Oddly, the situation is well-described in the lockup detector's comments
and a fix is even described.  This patch implements that fix (and corrects
some typos in the description).

My system has been stable for about a week running this code.  Without this,
my screen would go blank every now and then and, when it came back, everything
would be remarkably slow (the latter is a separate bug).

Signed-off-by: Andy Lutomirski l...@amacapital.net

[...]


diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
b/drivers/gpu/drm/radeon/radeon_ring.c
index 1ef5eaa..fb7b3ea 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
   * have CP rptr to a different value of jiffies wrap around which will force
   * initialization of the lockup tracking informations.
   *
- * A possible false positivie is if we get call after while and last_cp_rptr ==
- * the current CP rptr, even if it's unlikely it might happen. To avoid this
- * if the elapsed time since last call is bigger than 2 second than we return
- * false and update the tracking information. Due to this the caller must call
- * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
reported
- * the fencing code should be cautious about that.
+ * A possible false positive is if we get called after a while and
+ * last_cp_rptr == the current CP rptr, even if it's unlikely it might
+ * happen. To avoid this if the elapsed time since the last call is bigger
+ * than 2 second then we return false and update the tracking
+ * information. Due to this the caller must call radeon_ring_test_lockup
+ * more frequently than once every 2s when waiting.

Is it guaranteed that radeon_ring_test_lockup will be called more often
than every 2s when waiting? If not, this change might prevent a real
lockup from being detected?

Either way, I wonder if there might not be a simpler solution to the
problem, e.g. by updating last_activity when submitting commands to a
previously empty ring.


If I remember correctly that's exactly what I used to do when creating 
radeon_ring_test_lockup, and now I'm really wondering why that stuff 
isn't there any more.


Anyway the problem you describe should only happen very very rarely in 
case of a wraparound, so I'm pretty sure that we have a different 
problem that's just masked by that patch.


Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-12 Thread Jerome Glisse
On Wed, Jun 12, 2013 at 6:26 AM, Michel Dänzer mic...@daenzer.net wrote:
 On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:
 If the device is idle for over ten seconds, then the next attempt to do
 anything can race with the lockup detector and cause a bogus lockup
 to be detected.

 Oddly, the situation is well-described in the lockup detector's comments
 and a fix is even described.  This patch implements that fix (and corrects
 some typos in the description).

 My system has been stable for about a week running this code.  Without this,
 my screen would go blank every now and then and, when it came back, 
 everything
 would be remarkably slow (the latter is a separate bug).

 Signed-off-by: Andy Lutomirski l...@amacapital.net

 [...]

 diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
 b/drivers/gpu/drm/radeon/radeon_ring.c
 index 1ef5eaa..fb7b3ea 100644
 --- a/drivers/gpu/drm/radeon/radeon_ring.c
 +++ b/drivers/gpu/drm/radeon/radeon_ring.c
 @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring 
 *ring)
   * have CP rptr to a different value of jiffies wrap around which will force
   * initialization of the lockup tracking informations.
   *
 - * A possible false positivie is if we get call after while and 
 last_cp_rptr ==
 - * the current CP rptr, even if it's unlikely it might happen. To avoid this
 - * if the elapsed time since last call is bigger than 2 second than we 
 return
 - * false and update the tracking information. Due to this the caller must 
 call
 - * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
 reported
 - * the fencing code should be cautious about that.
 + * A possible false positive is if we get called after a while and
 + * last_cp_rptr == the current CP rptr, even if it's unlikely it might
 + * happen. To avoid this if the elapsed time since the last call is bigger
 + * than 2 second then we return false and update the tracking
 + * information. Due to this the caller must call radeon_ring_test_lockup
 + * more frequently than once every 2s when waiting.

 Is it guaranteed that radeon_ring_test_lockup will be called more often
 than every 2s when waiting? If not, this change might prevent a real
 lockup from being detected?

Yes it will if you wait for a fence, because the fence timeout wait is
way smaller than 2sec so radeon_ring_is_lockup get call several time,
which call radeon_ring_force_activity and then
radeon_ring_test_lockup.

This also means it very very very unlikely (see below for the likely
case) to have a wrap around that give last rptr same as current one.

The likely case is when you have something like a long compute, then
nothing is lockup but you keep filling ring with
radeon_ring_force_activity but the cp is still stuck on the ib of the
compute stuff so rptr does not progress.

 Either way, I wonder if there might not be a simpler solution to the
 problem, e.g. by updating last_activity when submitting commands to a
 previously empty ring.

Maybe but i still don't think it should matter.

Andy can you test (without your patch) and see if it helps with your issue :
http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-11 Thread Andy Lutomirski
If the device is idle for over ten seconds, then the next attempt to do
anything can race with the lockup detector and cause a bogus lockup
to be detected.

Oddly, the situation is well-described in the lockup detector's comments
and a fix is even described.  This patch implements that fix (and corrects
some typos in the description).

My system has been stable for about a week running this code.  Without this,
my screen would go blank every now and then and, when it came back, everything
would be remarkably slow (the latter is a separate bug).

Signed-off-by: Andy Lutomirski 
---

This may be -stable material.

 drivers/gpu/drm/radeon/radeon.h  |  1 +
 drivers/gpu/drm/radeon/radeon_ring.c | 23 ---
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 8263af3..9de5778 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -652,6 +652,7 @@ struct radeon_ring {
unsignedring_free_dw;
int count_dw;
unsigned long   last_activity;
+   unsigned long   last_test_lockup;
unsignedlast_rptr;
uint64_tgpu_addr;
uint32_talign_mask;
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
b/drivers/gpu/drm/radeon/radeon_ring.c
index 1ef5eaa..fb7b3ea 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
  * have CP rptr to a different value of jiffies wrap around which will force
  * initialization of the lockup tracking informations.
  *
- * A possible false positivie is if we get call after while and last_cp_rptr ==
- * the current CP rptr, even if it's unlikely it might happen. To avoid this
- * if the elapsed time since last call is bigger than 2 second than we return
- * false and update the tracking information. Due to this the caller must call
- * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
reported
- * the fencing code should be cautious about that.
+ * A possible false positive is if we get called after a while and
+ * last_cp_rptr == the current CP rptr, even if it's unlikely it might
+ * happen. To avoid this if the elapsed time since the last call is bigger
+ * than 2 second then we return false and update the tracking
+ * information. Due to this the caller must call radeon_ring_test_lockup
+ * more frequently than once every 2s when waiting.
  *
  * Caller should write to the ring to force CP to do something so we don't get
  * false positive when CP is just gived nothing to do.
@@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
  **/
 bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring 
*ring)
 {
-   unsigned long cjiffies, elapsed;
+   unsigned long cjiffies, elapsed, last_test;
uint32_t rptr;

cjiffies = jiffies;
+
+   last_test = ring->last_test_lockup;
+   ring->last_test_lockup = cjiffies;
+
if (!time_after(cjiffies, ring->last_activity)) {
/* likely a wrap around */
radeon_ring_lockup_update(ring);
@@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, 
struct radeon_ring *rin
radeon_ring_lockup_update(ring);
return false;
}
+   if (cjiffies - last_test > 2 * HZ) {
+   /* Possible race -- see comment above */
+   radeon_ring_lockup_update(ring);
+   return false;
+   }
elapsed = jiffies_to_msecs(cjiffies - ring->last_activity);
if (radeon_lockup_timeout && elapsed >= radeon_lockup_timeout) {
dev_err(rdev->dev, "GPU lockup CP stall for more than 
%lumsec\n", elapsed);
-- 
1.8.1.4



[PATCH] radeon: Fix a false positive lockup after 10s of inactivity

2013-06-11 Thread Andy Lutomirski
If the device is idle for over ten seconds, then the next attempt to do
anything can race with the lockup detector and cause a bogus lockup
to be detected.

Oddly, the situation is well-described in the lockup detector's comments
and a fix is even described.  This patch implements that fix (and corrects
some typos in the description).

My system has been stable for about a week running this code.  Without this,
my screen would go blank every now and then and, when it came back, everything
would be remarkably slow (the latter is a separate bug).

Signed-off-by: Andy Lutomirski l...@amacapital.net
---

This may be -stable material.

 drivers/gpu/drm/radeon/radeon.h  |  1 +
 drivers/gpu/drm/radeon/radeon_ring.c | 23 ---
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 8263af3..9de5778 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -652,6 +652,7 @@ struct radeon_ring {
unsignedring_free_dw;
int count_dw;
unsigned long   last_activity;
+   unsigned long   last_test_lockup;
unsignedlast_rptr;
uint64_tgpu_addr;
uint32_talign_mask;
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
b/drivers/gpu/drm/radeon/radeon_ring.c
index 1ef5eaa..fb7b3ea 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
  * have CP rptr to a different value of jiffies wrap around which will force
  * initialization of the lockup tracking informations.
  *
- * A possible false positivie is if we get call after while and last_cp_rptr ==
- * the current CP rptr, even if it's unlikely it might happen. To avoid this
- * if the elapsed time since last call is bigger than 2 second than we return
- * false and update the tracking information. Due to this the caller must call
- * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
reported
- * the fencing code should be cautious about that.
+ * A possible false positive is if we get called after a while and
+ * last_cp_rptr == the current CP rptr, even if it's unlikely it might
+ * happen. To avoid this if the elapsed time since the last call is bigger
+ * than 2 second then we return false and update the tracking
+ * information. Due to this the caller must call radeon_ring_test_lockup
+ * more frequently than once every 2s when waiting.
  *
  * Caller should write to the ring to force CP to do something so we don't get
  * false positive when CP is just gived nothing to do.
@@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
  **/
 bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring 
*ring)
 {
-   unsigned long cjiffies, elapsed;
+   unsigned long cjiffies, elapsed, last_test;
uint32_t rptr;
 
cjiffies = jiffies;
+
+   last_test = ring-last_test_lockup;
+   ring-last_test_lockup = cjiffies;
+
if (!time_after(cjiffies, ring-last_activity)) {
/* likely a wrap around */
radeon_ring_lockup_update(ring);
@@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, 
struct radeon_ring *rin
radeon_ring_lockup_update(ring);
return false;
}
+   if (cjiffies - last_test  2 * HZ) {
+   /* Possible race -- see comment above */
+   radeon_ring_lockup_update(ring);
+   return false;
+   }
elapsed = jiffies_to_msecs(cjiffies - ring-last_activity);
if (radeon_lockup_timeout  elapsed = radeon_lockup_timeout) {
dev_err(rdev-dev, GPU lockup CP stall for more than 
%lumsec\n, elapsed);
-- 
1.8.1.4

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel