[SPAM] [PATCH 1/3] drm/radeon/kms: fence cleanup + more reliable GPU lockup detection V2

2010-03-02 Thread y
From: Jerome Glisse jgli...@redhat.com

This patch cleanup the fence code, it drops the timeout field of
fence as the time to complete each IB is unpredictable and shouldn't
be bound.

The fence cleanup lead to GPU lockup detection improvement, this
patch introduce a callback, allowing to do asic specific test for
lockup detection. In this patch the CP is use as a first indicator
of GPU lockup. If CP doesn't make progress during 1second we assume
we are facing a GPU lockup.

To avoid overhead of testing GPU lockup frequently due to fence
taking time to be signaled we query the lockup callback every
500msec. There is plenty code comment explaining the design  choise
inside the code.

This have been tested mostly on R3XX/R5XX hw, in normal running
destkop (compiz firefox, quake3 running) the lockup callback wasn't
call once (1 hour session). Also tested with forcing GPU lockup and
lockup was reported after the 1s CP activity timeout.

V2 switch to 500ms timeout so GPU lockup get call at least 2 times
   in less than 2sec.

Signed-off-by: Jerome Glisse jgli...@redhat.com
---
 drivers/gpu/drm/radeon/evergreen.c|6 ++
 drivers/gpu/drm/radeon/r100.c |   84 +++
 drivers/gpu/drm/radeon/r300.c |   27 -
 drivers/gpu/drm/radeon/r600.c |   33 +-
 drivers/gpu/drm/radeon/radeon.h   |  102 ++--
 drivers/gpu/drm/radeon/radeon_asic.h  |   20 ++-
 drivers/gpu/drm/radeon/radeon_fence.c |   75 +---
 drivers/gpu/drm/radeon/rv770.c|6 --
 8 files changed, 248 insertions(+), 105 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c 
b/drivers/gpu/drm/radeon/evergreen.c
index bd2e7aa..8988df7 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -490,6 +490,12 @@ int evergreen_mc_init(struct radeon_device *rdev)
return 0;
 }
 
+bool evergreen_gpu_is_lockup(struct radeon_device *rdev)
+{
+   /* FIXME: implement for evergreen */
+   return false;
+}
+
 int evergreen_gpu_reset(struct radeon_device *rdev)
 {
/* FIXME: implement for evergreen */
diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
index 91eb762..96a6370 100644
--- a/drivers/gpu/drm/radeon/r100.c
+++ b/drivers/gpu/drm/radeon/r100.c
@@ -1772,6 +1772,90 @@ int r100_rb2d_reset(struct radeon_device *rdev)
return -1;
 }
 
+void r100_gpu_lockup_update(struct r100_gpu_lockup *lockup, struct radeon_cp 
*cp)
+{
+   lockup-last_cp_rptr = cp-rptr;
+   lockup-last_jiffies = jiffies;
+}
+
+/**
+ * r100_gpu_cp_is_lockup() - check if CP is lockup by recording information
+ * @rdev:  radeon device structure
+ * @lockup:r100_gpu_lockup structure holding CP lockup tracking 
informations
+ * @cp:radeon_cp structure holding CP information
+ *
+ * We don't need to initialize the lockup tracking information as we will 
either
+ * have CP rptr to a different value of jiffies wrap around which will force
+ * initialization of the lockup tracking informations.
+ *
+ * A possible false positivie is if we get call after while and last_cp_rptr ==
+ * the current CP rptr, even if it's unlikely it might happen. To avoid this
+ * if the elapsed time since last call is bigger than 2 second than we return
+ * false and update the tracking information. Due to this the caller must call
+ * r100_gpu_cp_is_lockup several time in less than 2sec for lockup to be 
reported
+ * the fencing code should be cautious about that.
+ *
+ * Caller should write to the ring to force CP to do something so we don't get
+ * false positive when CP is just gived nothing to do.
+ *
+ **/
+bool r100_gpu_cp_is_lockup(struct radeon_device *rdev, struct r100_gpu_lockup 
*lockup, struct radeon_cp *cp)
+{
+   unsigned long cjiffies, elapsed;
+
+   cjiffies = jiffies;
+   if (!time_after(cjiffies, lockup-last_jiffies)) {
+   /* likely a wrap around */
+   lockup-last_jiffies = jiffies;
+   return false;
+   }
+   if (cp-rptr != lockup-last_cp_rptr) {
+   /* CP is still working no lockup */
+   lockup-last_cp_rptr = cp-rptr;
+   lockup-last_jiffies = jiffies;
+   return false;
+   }
+   elapsed = jiffies_to_msecs(cjiffies - lockup-last_jiffies);
+   if (elapsed = 3000) {
+   /* very likely the improbable case where current
+* rptr is equal to last recorded, a while ago, rptr
+* this is more likely a false positive update tracking
+* information which should force us to be recall at
+* latter point
+*/
+   lockup-last_cp_rptr = cp-rptr;
+   lockup-last_jiffies = jiffies;
+   return false;
+   }
+   if (elapsed = 1000) {
+   dev_err(rdev-dev, GPU lockup CP stall for more than 
%lumsec\n, elapsed);
+

Re: [SPAM] [PATCH 1/3] drm/radeon/kms: fence cleanup + more reliable GPU lockup detection V2

2010-03-02 Thread Erik Andrén
2010/3/1  y:
 From: Jerome Glisse jgli...@redhat.com

 This patch cleanup the fence code, it drops the timeout field of
 fence as the time to complete each IB is unpredictable and shouldn't
 be bound.

 The fence cleanup lead to GPU lockup detection improvement, this
 patch introduce a callback, allowing to do asic specific test for
 lockup detection. In this patch the CP is use as a first indicator
 of GPU lockup. If CP doesn't make progress during 1second we assume
 we are facing a GPU lockup.

 To avoid overhead of testing GPU lockup frequently due to fence
 taking time to be signaled we query the lockup callback every
 500msec. There is plenty code comment explaining the design  choise
 inside the code.

 This have been tested mostly on R3XX/R5XX hw, in normal running
 destkop (compiz firefox, quake3 running) the lockup callback wasn't
 call once (1 hour session). Also tested with forcing GPU lockup and
 lockup was reported after the 1s CP activity timeout.

 V2 switch to 500ms timeout so GPU lockup get call at least 2 times
   in less than 2sec.

 Signed-off-by: Jerome Glisse jgli...@redhat.com
 ---
  drivers/gpu/drm/radeon/evergreen.c    |    6 ++
  drivers/gpu/drm/radeon/r100.c         |   84 +++
  drivers/gpu/drm/radeon/r300.c         |   27 -
  drivers/gpu/drm/radeon/r600.c         |   33 +-
  drivers/gpu/drm/radeon/radeon.h       |  102 ++--
  drivers/gpu/drm/radeon/radeon_asic.h  |   20 ++-
  drivers/gpu/drm/radeon/radeon_fence.c |   75 +---
  drivers/gpu/drm/radeon/rv770.c        |    6 --
  8 files changed, 248 insertions(+), 105 deletions(-)

 diff --git a/drivers/gpu/drm/radeon/evergreen.c 
 b/drivers/gpu/drm/radeon/evergreen.c
 index bd2e7aa..8988df7 100644
 --- a/drivers/gpu/drm/radeon/evergreen.c
 +++ b/drivers/gpu/drm/radeon/evergreen.c
 @@ -490,6 +490,12 @@ int evergreen_mc_init(struct radeon_device *rdev)
        return 0;
  }

 +bool evergreen_gpu_is_lockup(struct radeon_device *rdev)
 +{
 +       /* FIXME: implement for evergreen */
 +       return false;
 +}
 +
  int evergreen_gpu_reset(struct radeon_device *rdev)
  {
        /* FIXME: implement for evergreen */
 diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
 index 91eb762..96a6370 100644
 --- a/drivers/gpu/drm/radeon/r100.c
 +++ b/drivers/gpu/drm/radeon/r100.c
 @@ -1772,6 +1772,90 @@ int r100_rb2d_reset(struct radeon_device *rdev)
        return -1;
  }

 +void r100_gpu_lockup_update(struct r100_gpu_lockup *lockup, struct radeon_cp 
 *cp)
 +{
 +       lockup-last_cp_rptr = cp-rptr;
 +       lockup-last_jiffies = jiffies;
 +}
 +
 +/**
 + * r100_gpu_cp_is_lockup() - check if CP is lockup by recording information
 + * @rdev:      radeon device structure
 + * @lockup:    r100_gpu_lockup structure holding CP lockup tracking 
 informations
 + * @cp:                radeon_cp structure holding CP information
 + *
 + * We don't need to initialize the lockup tracking information as we will 
 either
 + * have CP rptr to a different value of jiffies wrap around which will force
 + * initialization of the lockup tracking informations.
 + *
 + * A possible false positivie is if we get call after while and last_cp_rptr 
 ==
 + * the current CP rptr, even if it's unlikely it might happen. To avoid this
 + * if the elapsed time since last call is bigger than 2 second than we return
 + * false and update the tracking information. Due to this the caller must 
 call
 + * r100_gpu_cp_is_lockup several time in less than 2sec for lockup to be 
 reported
 + * the fencing code should be cautious about that.
 + *
 + * Caller should write to the ring to force CP to do something so we don't 
 get
 + * false positive when CP is just gived nothing to do.
 + *
 + **/
 +bool r100_gpu_cp_is_lockup(struct radeon_device *rdev, struct 
 r100_gpu_lockup *lockup, struct radeon_cp *cp)
 +{
 +       unsigned long cjiffies, elapsed;
 +
 +       cjiffies = jiffies;
 +       if (!time_after(cjiffies, lockup-last_jiffies)) {
 +               /* likely a wrap around */
 +               lockup-last_jiffies = jiffies;
 +               return false;
 +       }
 +       if (cp-rptr != lockup-last_cp_rptr) {
 +               /* CP is still working no lockup */
 +               lockup-last_cp_rptr = cp-rptr;
 +               lockup-last_jiffies = jiffies;
 +               return false;
 +       }
 +       elapsed = jiffies_to_msecs(cjiffies - lockup-last_jiffies);
 +       if (elapsed = 3000) {
 +               /* very likely the improbable case where current
 +                * rptr is equal to last recorded, a while ago, rptr
 +                * this is more likely a false positive update tracking
 +                * information which should force us to be recall at
 +                * latter point
 +                */
 +               lockup-last_cp_rptr = cp-rptr;
 +               lockup-last_jiffies = jiffies;
 +               return false;
 +       }
 +    

Re: [SPAM] [PATCH 1/3] drm/radeon/kms: fence cleanup + more reliable GPU lockup detection V2

2010-03-02 Thread Jerome Glisse
On Tue, Mar 02, 2010 at 10:13:19AM +0100, Erik Andrén wrote:
 2010/3/1  y:
  From: Jerome Glisse jgli...@redhat.com
 
  This patch cleanup the fence code, it drops the timeout field of
  fence as the time to complete each IB is unpredictable and shouldn't
  be bound.
 
  The fence cleanup lead to GPU lockup detection improvement, this
  patch introduce a callback, allowing to do asic specific test for
  lockup detection. In this patch the CP is use as a first indicator
  of GPU lockup. If CP doesn't make progress during 1second we assume
  we are facing a GPU lockup.
 
  To avoid overhead of testing GPU lockup frequently due to fence
  taking time to be signaled we query the lockup callback every
  500msec. There is plenty code comment explaining the design  choise
  inside the code.
 
  This have been tested mostly on R3XX/R5XX hw, in normal running
  destkop (compiz firefox, quake3 running) the lockup callback wasn't
  call once (1 hour session). Also tested with forcing GPU lockup and
  lockup was reported after the 1s CP activity timeout.
 
  V2 switch to 500ms timeout so GPU lockup get call at least 2 times
    in less than 2sec.
 
  Signed-off-by: Jerome Glisse jgli...@redhat.com
  ---
   drivers/gpu/drm/radeon/evergreen.c    |    6 ++
   drivers/gpu/drm/radeon/r100.c         |   84 +++
   drivers/gpu/drm/radeon/r300.c         |   27 -
   drivers/gpu/drm/radeon/r600.c         |   33 +-
   drivers/gpu/drm/radeon/radeon.h       |  102 
  ++--
   drivers/gpu/drm/radeon/radeon_asic.h  |   20 ++-
   drivers/gpu/drm/radeon/radeon_fence.c |   75 +---
   drivers/gpu/drm/radeon/rv770.c        |    6 --
   8 files changed, 248 insertions(+), 105 deletions(-)
 
  diff --git a/drivers/gpu/drm/radeon/evergreen.c 
  b/drivers/gpu/drm/radeon/evergreen.c
  index bd2e7aa..8988df7 100644
  --- a/drivers/gpu/drm/radeon/evergreen.c
  +++ b/drivers/gpu/drm/radeon/evergreen.c
  @@ -490,6 +490,12 @@ int evergreen_mc_init(struct radeon_device *rdev)
         return 0;
   }
 
  +bool evergreen_gpu_is_lockup(struct radeon_device *rdev)
  +{
  +       /* FIXME: implement for evergreen */
  +       return false;
  +}
  +
   int evergreen_gpu_reset(struct radeon_device *rdev)
   {
         /* FIXME: implement for evergreen */
  diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
  index 91eb762..96a6370 100644
  --- a/drivers/gpu/drm/radeon/r100.c
  +++ b/drivers/gpu/drm/radeon/r100.c
  @@ -1772,6 +1772,90 @@ int r100_rb2d_reset(struct radeon_device *rdev)
         return -1;
   }
 
  +void r100_gpu_lockup_update(struct r100_gpu_lockup *lockup, struct 
  radeon_cp *cp)
  +{
  +       lockup-last_cp_rptr = cp-rptr;
  +       lockup-last_jiffies = jiffies;
  +}
  +
  +/**
  + * r100_gpu_cp_is_lockup() - check if CP is lockup by recording information
  + * @rdev:      radeon device structure
  + * @lockup:    r100_gpu_lockup structure holding CP lockup tracking 
  informations
  + * @cp:                radeon_cp structure holding CP information
  + *
  + * We don't need to initialize the lockup tracking information as we will 
  either
  + * have CP rptr to a different value of jiffies wrap around which will 
  force
  + * initialization of the lockup tracking informations.
  + *
  + * A possible false positivie is if we get call after while and 
  last_cp_rptr ==
  + * the current CP rptr, even if it's unlikely it might happen. To avoid 
  this
  + * if the elapsed time since last call is bigger than 2 second than we 
  return
  + * false and update the tracking information. Due to this the caller must 
  call
  + * r100_gpu_cp_is_lockup several time in less than 2sec for lockup to be 
  reported
  + * the fencing code should be cautious about that.
  + *
  + * Caller should write to the ring to force CP to do something so we don't 
  get
  + * false positive when CP is just gived nothing to do.
  + *
  + **/
  +bool r100_gpu_cp_is_lockup(struct radeon_device *rdev, struct 
  r100_gpu_lockup *lockup, struct radeon_cp *cp)
  +{
  +       unsigned long cjiffies, elapsed;
  +
  +       cjiffies = jiffies;
  +       if (!time_after(cjiffies, lockup-last_jiffies)) {
  +               /* likely a wrap around */
  +               lockup-last_jiffies = jiffies;
  +               return false;
  +       }
  +       if (cp-rptr != lockup-last_cp_rptr) {
  +               /* CP is still working no lockup */
  +               lockup-last_cp_rptr = cp-rptr;
  +               lockup-last_jiffies = jiffies;
  +               return false;
  +       }
  +       elapsed = jiffies_to_msecs(cjiffies - lockup-last_jiffies);
  +       if (elapsed = 3000) {
  +               /* very likely the improbable case where current
  +                * rptr is equal to last recorded, a while ago, rptr
  +                * this is more likely a false positive update tracking
  +                * information which should force us to be recall at
  +