If the device is idle for over ten seconds, then the next attempt to do
anything can race with the lockup detector and cause a bogus lockup
to be detected.

Oddly, the situation is well-described in the lockup detector's comments
and a fix is even described.  This patch implements that fix (and corrects
some typos in the description).

My system has been stable for about a week running this code.  Without this,
my screen would go blank every now and then and, when it came back, everything
would be remarkably slow (the latter is a separate bug).

Signed-off-by: Andy Lutomirski <luto at amacapital.net>
---

This may be -stable material.

 drivers/gpu/drm/radeon/radeon.h      |  1 +
 drivers/gpu/drm/radeon/radeon_ring.c | 23 ++++++++++++++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 8263af3..9de5778 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -652,6 +652,7 @@ struct radeon_ring {
        unsigned                ring_free_dw;
        int                     count_dw;
        unsigned long           last_activity;
+       unsigned long           last_test_lockup;
        unsigned                last_rptr;
        uint64_t                gpu_addr;
        uint32_t                align_mask;
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c 
b/drivers/gpu/drm/radeon/radeon_ring.c
index 1ef5eaa..fb7b3ea 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
  * have CP rptr to a different value of jiffies wrap around which will force
  * initialization of the lockup tracking informations.
  *
- * A possible false positivie is if we get call after while and last_cp_rptr ==
- * the current CP rptr, even if it's unlikely it might happen. To avoid this
- * if the elapsed time since last call is bigger than 2 second than we return
- * false and update the tracking information. Due to this the caller must call
- * radeon_ring_test_lockup several time in less than 2sec for lockup to be 
reported
- * the fencing code should be cautious about that.
+ * A possible false positive is if we get called after a while and
+ * last_cp_rptr == the current CP rptr, even if it's unlikely it might
+ * happen. To avoid this if the elapsed time since the last call is bigger
+ * than 2 second then we return false and update the tracking
+ * information. Due to this the caller must call radeon_ring_test_lockup
+ * more frequently than once every 2s when waiting.
  *
  * Caller should write to the ring to force CP to do something so we don't get
  * false positive when CP is just gived nothing to do.
@@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
  **/
 bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring 
*ring)
 {
-       unsigned long cjiffies, elapsed;
+       unsigned long cjiffies, elapsed, last_test;
        uint32_t rptr;

        cjiffies = jiffies;
+
+       last_test = ring->last_test_lockup;
+       ring->last_test_lockup = cjiffies;
+
        if (!time_after(cjiffies, ring->last_activity)) {
                /* likely a wrap around */
                radeon_ring_lockup_update(ring);
@@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, 
struct radeon_ring *rin
                radeon_ring_lockup_update(ring);
                return false;
        }
+       if (cjiffies - last_test > 2 * HZ) {
+               /* Possible race -- see comment above */
+               radeon_ring_lockup_update(ring);
+               return false;
+       }
        elapsed = jiffies_to_msecs(cjiffies - ring->last_activity);
        if (radeon_lockup_timeout && elapsed >= radeon_lockup_timeout) {
                dev_err(rdev->dev, "GPU lockup CP stall for more than 
%lumsec\n", elapsed);
-- 
1.8.1.4

Reply via email to