[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Thu, Jun 13, 2013 at 2:22 PM, Andy Lutomirski wrote: > On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse wrote: >> Andy can you test (without your patch) and see if it helps with your issue : >> http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch > > Testing now. I'll report back in a couple of days. > 3.9.4 plus this patch has been completely stable for several days now. Tested-by: Andy Lutomirski Can you send this to Linux and -stable? Thanks, Andy
Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Thu, Jun 13, 2013 at 2:22 PM, Andy Lutomirski l...@amacapital.net wrote: On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse j.gli...@gmail.com wrote: Andy can you test (without your patch) and see if it helps with your issue : http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch Testing now. I'll report back in a couple of days. 3.9.4 plus this patch has been completely stable for several days now. Tested-by: Andy Lutomirski l...@amacapital.net Can you send this to Linux and -stable? Thanks, Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse j.gli...@gmail.com wrote: On Wed, Jun 12, 2013 at 6:26 AM, Michel Dänzer mic...@daenzer.net wrote: On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected. Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description). My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug). Signed-off-by: Andy Lutomirski l...@amacapital.net [...] diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. Is it guaranteed that radeon_ring_test_lockup will be called more often than every 2s when waiting? If not, this change might prevent a real lockup from being detected? Yes it will if you wait for a fence, because the fence timeout wait is way smaller than 2sec so radeon_ring_is_lockup get call several time, which call radeon_ring_force_activity and then radeon_ring_test_lockup. This also means it very very very unlikely (see below for the likely case) to have a wrap around that give last rptr same as current one. The likely case is when you have something like a long compute, then nothing is lockup but you keep filling ring with radeon_ring_force_activity but the cp is still stuck on the ib of the compute stuff so rptr does not progress. Either way, I wonder if there might not be a simpler solution to the problem, e.g. by updating last_activity when submitting commands to a previously empty ring. Maybe but i still don't think it should matter. Andy can you test (without your patch) and see if it helps with your issue : http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch Testing now. I'll report back in a couple of days. I don't think that long computes have anything to do with it. The bogus lockups happen when I look away from my computer for a while and then click something. I thing the graphics are usually completely idle when this happens. AFAIK I've never run an OpenCL or similar application on this system. --Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse wrote: > On Wed, Jun 12, 2013 at 6:26 AM, Michel D?nzer wrote: >> On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: >>> If the device is idle for over ten seconds, then the next attempt to do >>> anything can race with the lockup detector and cause a bogus lockup >>> to be detected. >>> >>> Oddly, the situation is well-described in the lockup detector's comments >>> and a fix is even described. This patch implements that fix (and corrects >>> some typos in the description). >>> >>> My system has been stable for about a week running this code. Without this, >>> my screen would go blank every now and then and, when it came back, >>> everything >>> would be remarkably slow (the latter is a separate bug). >>> >>> Signed-off-by: Andy Lutomirski >> >> [...] >> >>> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c >>> b/drivers/gpu/drm/radeon/radeon_ring.c >>> index 1ef5eaa..fb7b3ea 100644 >>> --- a/drivers/gpu/drm/radeon/radeon_ring.c >>> +++ b/drivers/gpu/drm/radeon/radeon_ring.c >>> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring >>> *ring) >>> * have CP rptr to a different value of jiffies wrap around which will >>> force >>> * initialization of the lockup tracking informations. >>> * >>> - * A possible false positivie is if we get call after while and >>> last_cp_rptr == >>> - * the current CP rptr, even if it's unlikely it might happen. To avoid >>> this >>> - * if the elapsed time since last call is bigger than 2 second than we >>> return >>> - * false and update the tracking information. Due to this the caller must >>> call >>> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be >>> reported >>> - * the fencing code should be cautious about that. >>> + * A possible false positive is if we get called after a while and >>> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might >>> + * happen. To avoid this if the elapsed time since the last call is bigger >>> + * than 2 second then we return false and update the tracking >>> + * information. Due to this the caller must call radeon_ring_test_lockup >>> + * more frequently than once every 2s when waiting. >> >> Is it guaranteed that radeon_ring_test_lockup will be called more often >> than every 2s when waiting? If not, this change might prevent a real >> lockup from being detected? > > Yes it will if you wait for a fence, because the fence timeout wait is > way smaller than 2sec so radeon_ring_is_lockup get call several time, > which call radeon_ring_force_activity and then > radeon_ring_test_lockup. > > This also means it very very very unlikely (see below for the likely > case) to have a wrap around that give last rptr same as current one. > > The likely case is when you have something like a long compute, then > nothing is lockup but you keep filling ring with > radeon_ring_force_activity but the cp is still stuck on the ib of the > compute stuff so rptr does not progress. > >> Either way, I wonder if there might not be a simpler solution to the >> problem, e.g. by updating last_activity when submitting commands to a >> previously empty ring. > > Maybe but i still don't think it should matter. > > Andy can you test (without your patch) and see if it helps with your issue : > http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch Testing now. I'll report back in a couple of days. I don't think that long computes have anything to do with it. The bogus lockups happen when I look away from my computer for a while and then click something. I thing the graphics are usually completely idle when this happens. AFAIK I've never run an OpenCL or similar application on this system. --Andy
[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
Am 12.06.2013 12:26, schrieb Michel D?nzer: > On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: >> If the device is idle for over ten seconds, then the next attempt to do >> anything can race with the lockup detector and cause a bogus lockup >> to be detected. >> >> Oddly, the situation is well-described in the lockup detector's comments >> and a fix is even described. This patch implements that fix (and corrects >> some typos in the description). >> >> My system has been stable for about a week running this code. Without this, >> my screen would go blank every now and then and, when it came back, >> everything >> would be remarkably slow (the latter is a separate bug). >> >> Signed-off-by: Andy Lutomirski > [...] > >> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c >> b/drivers/gpu/drm/radeon/radeon_ring.c >> index 1ef5eaa..fb7b3ea 100644 >> --- a/drivers/gpu/drm/radeon/radeon_ring.c >> +++ b/drivers/gpu/drm/radeon/radeon_ring.c >> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring >> *ring) >>* have CP rptr to a different value of jiffies wrap around which will >> force >>* initialization of the lockup tracking informations. >>* >> - * A possible false positivie is if we get call after while and >> last_cp_rptr == >> - * the current CP rptr, even if it's unlikely it might happen. To avoid this >> - * if the elapsed time since last call is bigger than 2 second than we >> return >> - * false and update the tracking information. Due to this the caller must >> call >> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be >> reported >> - * the fencing code should be cautious about that. >> + * A possible false positive is if we get called after a while and >> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might >> + * happen. To avoid this if the elapsed time since the last call is bigger >> + * than 2 second then we return false and update the tracking >> + * information. Due to this the caller must call radeon_ring_test_lockup >> + * more frequently than once every 2s when waiting. > Is it guaranteed that radeon_ring_test_lockup will be called more often > than every 2s when waiting? If not, this change might prevent a real > lockup from being detected? > > Either way, I wonder if there might not be a simpler solution to the > problem, e.g. by updating last_activity when submitting commands to a > previously empty ring. If I remember correctly that's exactly what I used to do when creating radeon_ring_test_lockup, and now I'm really wondering why that stuff isn't there any more. Anyway the problem you describe should only happen very very rarely in case of a wraparound, so I'm pretty sure that we have a different problem that's just masked by that patch. Christian.
[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: > If the device is idle for over ten seconds, then the next attempt to do > anything can race with the lockup detector and cause a bogus lockup > to be detected. > > Oddly, the situation is well-described in the lockup detector's comments > and a fix is even described. This patch implements that fix (and corrects > some typos in the description). > > My system has been stable for about a week running this code. Without this, > my screen would go blank every now and then and, when it came back, everything > would be remarkably slow (the latter is a separate bug). > > Signed-off-by: Andy Lutomirski [...] > diff --git a/drivers/gpu/drm/radeon/radeon_ring.c > b/drivers/gpu/drm/radeon/radeon_ring.c > index 1ef5eaa..fb7b3ea 100644 > --- a/drivers/gpu/drm/radeon/radeon_ring.c > +++ b/drivers/gpu/drm/radeon/radeon_ring.c > @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) > * have CP rptr to a different value of jiffies wrap around which will force > * initialization of the lockup tracking informations. > * > - * A possible false positivie is if we get call after while and last_cp_rptr > == > - * the current CP rptr, even if it's unlikely it might happen. To avoid this > - * if the elapsed time since last call is bigger than 2 second than we return > - * false and update the tracking information. Due to this the caller must > call > - * radeon_ring_test_lockup several time in less than 2sec for lockup to be > reported > - * the fencing code should be cautious about that. > + * A possible false positive is if we get called after a while and > + * last_cp_rptr == the current CP rptr, even if it's unlikely it might > + * happen. To avoid this if the elapsed time since the last call is bigger > + * than 2 second then we return false and update the tracking > + * information. Due to this the caller must call radeon_ring_test_lockup > + * more frequently than once every 2s when waiting. Is it guaranteed that radeon_ring_test_lockup will be called more often than every 2s when waiting? If not, this change might prevent a real lockup from being detected? Either way, I wonder if there might not be a simpler solution to the problem, e.g. by updating last_activity when submitting commands to a previously empty ring. -- Earthling Michel D?nzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer
[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Wed, Jun 12, 2013 at 6:26 AM, Michel D?nzer wrote: > On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: >> If the device is idle for over ten seconds, then the next attempt to do >> anything can race with the lockup detector and cause a bogus lockup >> to be detected. >> >> Oddly, the situation is well-described in the lockup detector's comments >> and a fix is even described. This patch implements that fix (and corrects >> some typos in the description). >> >> My system has been stable for about a week running this code. Without this, >> my screen would go blank every now and then and, when it came back, >> everything >> would be remarkably slow (the latter is a separate bug). >> >> Signed-off-by: Andy Lutomirski > > [...] > >> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c >> b/drivers/gpu/drm/radeon/radeon_ring.c >> index 1ef5eaa..fb7b3ea 100644 >> --- a/drivers/gpu/drm/radeon/radeon_ring.c >> +++ b/drivers/gpu/drm/radeon/radeon_ring.c >> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring >> *ring) >> * have CP rptr to a different value of jiffies wrap around which will force >> * initialization of the lockup tracking informations. >> * >> - * A possible false positivie is if we get call after while and >> last_cp_rptr == >> - * the current CP rptr, even if it's unlikely it might happen. To avoid this >> - * if the elapsed time since last call is bigger than 2 second than we >> return >> - * false and update the tracking information. Due to this the caller must >> call >> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be >> reported >> - * the fencing code should be cautious about that. >> + * A possible false positive is if we get called after a while and >> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might >> + * happen. To avoid this if the elapsed time since the last call is bigger >> + * than 2 second then we return false and update the tracking >> + * information. Due to this the caller must call radeon_ring_test_lockup >> + * more frequently than once every 2s when waiting. > > Is it guaranteed that radeon_ring_test_lockup will be called more often > than every 2s when waiting? If not, this change might prevent a real > lockup from being detected? Yes it will if you wait for a fence, because the fence timeout wait is way smaller than 2sec so radeon_ring_is_lockup get call several time, which call radeon_ring_force_activity and then radeon_ring_test_lockup. This also means it very very very unlikely (see below for the likely case) to have a wrap around that give last rptr same as current one. The likely case is when you have something like a long compute, then nothing is lockup but you keep filling ring with radeon_ring_force_activity but the cp is still stuck on the ib of the compute stuff so rptr does not progress. > Either way, I wonder if there might not be a simpler solution to the > problem, e.g. by updating last_activity when submitting commands to a > previously empty ring. Maybe but i still don't think it should matter. Andy can you test (without your patch) and see if it helps with your issue : http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch Cheers, Jerome
Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected. Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description). My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug). Signed-off-by: Andy Lutomirski l...@amacapital.net [...] diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. Is it guaranteed that radeon_ring_test_lockup will be called more often than every 2s when waiting? If not, this change might prevent a real lockup from being detected? Either way, I wonder if there might not be a simpler solution to the problem, e.g. by updating last_activity when submitting commands to a previously empty ring. -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity
Am 12.06.2013 12:26, schrieb Michel Dänzer: On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected. Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description). My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug). Signed-off-by: Andy Lutomirski l...@amacapital.net [...] diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. Is it guaranteed that radeon_ring_test_lockup will be called more often than every 2s when waiting? If not, this change might prevent a real lockup from being detected? Either way, I wonder if there might not be a simpler solution to the problem, e.g. by updating last_activity when submitting commands to a previously empty ring. If I remember correctly that's exactly what I used to do when creating radeon_ring_test_lockup, and now I'm really wondering why that stuff isn't there any more. Anyway the problem you describe should only happen very very rarely in case of a wraparound, so I'm pretty sure that we have a different problem that's just masked by that patch. Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Wed, Jun 12, 2013 at 6:26 AM, Michel Dänzer mic...@daenzer.net wrote: On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected. Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description). My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug). Signed-off-by: Andy Lutomirski l...@amacapital.net [...] diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. Is it guaranteed that radeon_ring_test_lockup will be called more often than every 2s when waiting? If not, this change might prevent a real lockup from being detected? Yes it will if you wait for a fence, because the fence timeout wait is way smaller than 2sec so radeon_ring_is_lockup get call several time, which call radeon_ring_force_activity and then radeon_ring_test_lockup. This also means it very very very unlikely (see below for the likely case) to have a wrap around that give last rptr same as current one. The likely case is when you have something like a long compute, then nothing is lockup but you keep filling ring with radeon_ring_force_activity but the cp is still stuck on the ib of the compute stuff so rptr does not progress. Either way, I wonder if there might not be a simpler solution to the problem, e.g. by updating last_activity when submitting commands to a previously empty ring. Maybe but i still don't think it should matter. Andy can you test (without your patch) and see if it helps with your issue : http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch Cheers, Jerome ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected. Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description). My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug). Signed-off-by: Andy Lutomirski --- This may be -stable material. drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_ring.c | 23 --- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 8263af3..9de5778 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -652,6 +652,7 @@ struct radeon_ring { unsignedring_free_dw; int count_dw; unsigned long last_activity; + unsigned long last_test_lockup; unsignedlast_rptr; uint64_tgpu_addr; uint32_talign_mask; diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. * * Caller should write to the ring to force CP to do something so we don't get * false positive when CP is just gived nothing to do. @@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) **/ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *ring) { - unsigned long cjiffies, elapsed; + unsigned long cjiffies, elapsed, last_test; uint32_t rptr; cjiffies = jiffies; + + last_test = ring->last_test_lockup; + ring->last_test_lockup = cjiffies; + if (!time_after(cjiffies, ring->last_activity)) { /* likely a wrap around */ radeon_ring_lockup_update(ring); @@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *rin radeon_ring_lockup_update(ring); return false; } + if (cjiffies - last_test > 2 * HZ) { + /* Possible race -- see comment above */ + radeon_ring_lockup_update(ring); + return false; + } elapsed = jiffies_to_msecs(cjiffies - ring->last_activity); if (radeon_lockup_timeout && elapsed >= radeon_lockup_timeout) { dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed); -- 1.8.1.4
[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected. Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description). My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug). Signed-off-by: Andy Lutomirski l...@amacapital.net --- This may be -stable material. drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_ring.c | 23 --- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 8263af3..9de5778 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -652,6 +652,7 @@ struct radeon_ring { unsignedring_free_dw; int count_dw; unsigned long last_activity; + unsigned long last_test_lockup; unsignedlast_rptr; uint64_tgpu_addr; uint32_talign_mask; diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. * * Caller should write to the ring to force CP to do something so we don't get * false positive when CP is just gived nothing to do. @@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) **/ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *ring) { - unsigned long cjiffies, elapsed; + unsigned long cjiffies, elapsed, last_test; uint32_t rptr; cjiffies = jiffies; + + last_test = ring-last_test_lockup; + ring-last_test_lockup = cjiffies; + if (!time_after(cjiffies, ring-last_activity)) { /* likely a wrap around */ radeon_ring_lockup_update(ring); @@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *rin radeon_ring_lockup_update(ring); return false; } + if (cjiffies - last_test 2 * HZ) { + /* Possible race -- see comment above */ + radeon_ring_lockup_update(ring); + return false; + } elapsed = jiffies_to_msecs(cjiffies - ring-last_activity); if (radeon_lockup_timeout elapsed = radeon_lockup_timeout) { dev_err(rdev-dev, GPU lockup CP stall for more than %lumsec\n, elapsed); -- 1.8.1.4 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel