Re: [Intel-gfx] [PATCH 1/3] drm/i915/perf: rework aging tail workaround

2020-03-24 Thread Dixit, Ashutosh
On Tue, 24 Mar 2020 11:54:55 -0700, Umesh Nerlige Ramappa wrote:
>
> From: Lionel Landwerlin 
>
> We're about to introduce an options to open the perf stream, giving
> the user ability to configure how often it wants the kernel to poll
> the OA registers for available data.
>
> Right now the workaround against the OA tail pointer race condition
> requires at least twice the internal kernel polling timer to make any
> data available.
>
> This changes introduce checks on the OA data written into the circular
> buffer to make as much data as possible available on the first
> iteration of the polling timer.

Reviewed-by: Ashutosh Dixit 

>
> v2: Use OA_TAKEN macro without the gtt_offset (Lionel)
> v3: (Umesh)
> - Rebase
> - Change report to report32 from below review
>   https://patchwork.freedesktop.org/patch/330704/?series=66697=1
> v4: (Ashutosh, Lionel)
> - Fix checkpatch errors
> - Fix aging_timestamp initialization
> - Check for only one valid landed report
> - Fix check for unlanded report
> v5: (Ashutosh)
> - Fix bug in accurately determining landed report.
> - Optimize the check for landed reports by going as far as the
>   previously determined aged tail.
>
> Signed-off-by: Lionel Landwerlin 
> Signed-off-by: Umesh Nerlige Ramappa 
> ---
>  drivers/gpu/drm/i915/i915_perf.c   | 204 ++---
>  drivers/gpu/drm/i915/i915_perf_types.h |  28 ++--
>  2 files changed, 97 insertions(+), 135 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
> b/drivers/gpu/drm/i915/i915_perf.c
> index 3222f6cd8255..4583ae9b77be 100644
> --- a/drivers/gpu/drm/i915/i915_perf.c
> +++ b/drivers/gpu/drm/i915/i915_perf.c
> @@ -223,26 +223,17 @@
>   *
>   * Although this can be observed explicitly while copying reports to 
> userspace
>   * by checking for a zeroed report-id field in tail reports, we want to 
> account
> - * for this earlier, as part of the oa_buffer_check to avoid lots of 
> redundant
> - * read() attempts.
> - *
> - * In effect we define a tail pointer for reading that lags the real tail
> - * pointer by at least %OA_TAIL_MARGIN_NSEC nanoseconds, which gives enough
> - * time for the corresponding reports to become visible to the CPU.
> - *
> - * To manage this we actually track two tail pointers:
> - *  1) An 'aging' tail with an associated timestamp that is tracked until we
> - * can trust the corresponding data is visible to the CPU; at which point
> - * it is considered 'aged'.
> - *  2) An 'aged' tail that can be used for read()ing.
> - *
> - * The two separate pointers let us decouple read()s from tail pointer aging.
> - *
> - * The tail pointers are checked and updated at a limited rate within a 
> hrtimer
> - * callback (the same callback that is used for delivering EPOLLIN events)
> - *
> - * Initially the tails are marked invalid with %INVALID_TAIL_PTR which
> - * indicates that an updated tail pointer is needed.
> + * for this earlier, as part of the oa_buffer_check_unlocked to avoid lots of
> + * redundant read() attempts.
> + *
> + * We workaround this issue in oa_buffer_check_unlocked() by reading the 
> reports
> + * in the OA buffer, starting from the tail reported by the HW until we find 
> a
> + * report with its first 2 dwords not 0 meaning its previous report is
> + * completely in memory and ready to be read. Those dwords are also set to 0
> + * once read and the whole buffer is cleared upon OA buffer initialization. 
> The
> + * first dword is the reason for this report while the second is the 
> timestamp,
> + * making the chances of having those 2 fields at 0 fairly unlikely. A more
> + * detailed explanation is available in oa_buffer_check_unlocked().
>   *
>   * Most of the implementation details for this workaround are in
>   * oa_buffer_check_unlocked() and _append_oa_reports()
> @@ -454,8 +445,8 @@ static u32 gen7_oa_hw_tail_read(struct i915_perf_stream 
> *stream)
>   * (See description of OA_TAIL_MARGIN_NSEC above for further details.)
>   *
>   * Besides returning true when there is data available to read() this 
> function
> - * also has the side effect of updating the oa_buffer.tails[], 
> .aging_timestamp
> - * and .aged_tail_idx state used for reading.
> + * also updates the tail, aging_tail and aging_timestamp in the oa_buffer
> + * object.
>   *
>   * Note: It's safe to read OA config state here unlocked, assuming that this 
> is
>   * only called while the stream is enabled, while the global OA configuration
> @@ -465,28 +456,18 @@ static u32 gen7_oa_hw_tail_read(struct i915_perf_stream 
> *stream)
>   */
>  static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream)
>  {
> + u32 gtt_offset = i915_ggtt_offset(stream->oa_buffer.vma);
>   int report_size = stream->oa_buffer.format_size;
>   unsigned long flags;
> - unsigned int aged_idx;
> - u32 head, hw_tail, aged_tail, aging_tail;
> + u32 hw_tail;
>   u64 now;
>
>   /* We have to consider the (unlikely) possibility that read() errors
> - 

[Intel-gfx] [PATCH 1/3] drm/i915/perf: rework aging tail workaround

2020-03-24 Thread Umesh Nerlige Ramappa
From: Lionel Landwerlin 

We're about to introduce an options to open the perf stream, giving
the user ability to configure how often it wants the kernel to poll
the OA registers for available data.

Right now the workaround against the OA tail pointer race condition
requires at least twice the internal kernel polling timer to make any
data available.

This changes introduce checks on the OA data written into the circular
buffer to make as much data as possible available on the first
iteration of the polling timer.

v2: Use OA_TAKEN macro without the gtt_offset (Lionel)
v3: (Umesh)
- Rebase
- Change report to report32 from below review
  https://patchwork.freedesktop.org/patch/330704/?series=66697=1
v4: (Ashutosh, Lionel)
- Fix checkpatch errors
- Fix aging_timestamp initialization
- Check for only one valid landed report
- Fix check for unlanded report
v5: (Ashutosh)
- Fix bug in accurately determining landed report.
- Optimize the check for landed reports by going as far as the
  previously determined aged tail.

Signed-off-by: Lionel Landwerlin 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/i915_perf.c   | 204 ++---
 drivers/gpu/drm/i915/i915_perf_types.h |  28 ++--
 2 files changed, 97 insertions(+), 135 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 3222f6cd8255..4583ae9b77be 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -223,26 +223,17 @@
  *
  * Although this can be observed explicitly while copying reports to userspace
  * by checking for a zeroed report-id field in tail reports, we want to account
- * for this earlier, as part of the oa_buffer_check to avoid lots of redundant
- * read() attempts.
- *
- * In effect we define a tail pointer for reading that lags the real tail
- * pointer by at least %OA_TAIL_MARGIN_NSEC nanoseconds, which gives enough
- * time for the corresponding reports to become visible to the CPU.
- *
- * To manage this we actually track two tail pointers:
- *  1) An 'aging' tail with an associated timestamp that is tracked until we
- * can trust the corresponding data is visible to the CPU; at which point
- * it is considered 'aged'.
- *  2) An 'aged' tail that can be used for read()ing.
- *
- * The two separate pointers let us decouple read()s from tail pointer aging.
- *
- * The tail pointers are checked and updated at a limited rate within a hrtimer
- * callback (the same callback that is used for delivering EPOLLIN events)
- *
- * Initially the tails are marked invalid with %INVALID_TAIL_PTR which
- * indicates that an updated tail pointer is needed.
+ * for this earlier, as part of the oa_buffer_check_unlocked to avoid lots of
+ * redundant read() attempts.
+ *
+ * We workaround this issue in oa_buffer_check_unlocked() by reading the 
reports
+ * in the OA buffer, starting from the tail reported by the HW until we find a
+ * report with its first 2 dwords not 0 meaning its previous report is
+ * completely in memory and ready to be read. Those dwords are also set to 0
+ * once read and the whole buffer is cleared upon OA buffer initialization. The
+ * first dword is the reason for this report while the second is the timestamp,
+ * making the chances of having those 2 fields at 0 fairly unlikely. A more
+ * detailed explanation is available in oa_buffer_check_unlocked().
  *
  * Most of the implementation details for this workaround are in
  * oa_buffer_check_unlocked() and _append_oa_reports()
@@ -454,8 +445,8 @@ static u32 gen7_oa_hw_tail_read(struct i915_perf_stream 
*stream)
  * (See description of OA_TAIL_MARGIN_NSEC above for further details.)
  *
  * Besides returning true when there is data available to read() this function
- * also has the side effect of updating the oa_buffer.tails[], .aging_timestamp
- * and .aged_tail_idx state used for reading.
+ * also updates the tail, aging_tail and aging_timestamp in the oa_buffer
+ * object.
  *
  * Note: It's safe to read OA config state here unlocked, assuming that this is
  * only called while the stream is enabled, while the global OA configuration
@@ -465,28 +456,18 @@ static u32 gen7_oa_hw_tail_read(struct i915_perf_stream 
*stream)
  */
 static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream)
 {
+   u32 gtt_offset = i915_ggtt_offset(stream->oa_buffer.vma);
int report_size = stream->oa_buffer.format_size;
unsigned long flags;
-   unsigned int aged_idx;
-   u32 head, hw_tail, aged_tail, aging_tail;
+   u32 hw_tail;
u64 now;
 
/* We have to consider the (unlikely) possibility that read() errors
-* could result in an OA buffer reset which might reset the head,
-* tails[] and aged_tail state.
+* could result in an OA buffer reset which might reset the head and
+* tail state.
 */
spin_lock_irqsave(>oa_buffer.ptr_lock, flags);
 
-   /* NB: The head we observe here 

Re: [Intel-gfx] [PATCH 1/3] drm/i915/perf: rework aging tail workaround

2020-03-23 Thread Umesh Nerlige Ramappa

On Sat, Mar 21, 2020 at 09:44:43PM -0700, Dixit, Ashutosh wrote:

On Sat, 21 Mar 2020 16:26:42 -0700, Dixit, Ashutosh wrote:


On Thu, 19 Mar 2020 15:52:01 -0700, Umesh Nerlige Ramappa wrote:
>
> From: Lionel Landwerlin 

> @@ -477,16 +468,6 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)
> */
>spin_lock_irqsave(>oa_buffer.ptr_lock, flags);
>
>hw_tail = stream->perf->ops.oa_hw_tail_read(stream);
>
>hw_tail &= ~(report_size - 1);
>
> @@ -496,64 +477,64 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)
>
>now = ktime_get_mono_fast_ns();
>
> +  if (hw_tail == stream->oa_buffer.aging_tail &&
> + (now - stream->oa_buffer.aging_timestamp) > OA_TAIL_MARGIN_NSEC) {
> +  /* If the HW tail hasn't move since the last check and the HW
> +   * tail has been aging for long enough, declare it the new
> +   * tail.
> +   */
> +  stream->oa_buffer.tail = stream->oa_buffer.aging_tail;
> +  } else {
> +  u32 head, tail;
>
> +  /* NB: The head we observe here might effectively be a little
> +   * out of date. If a read() is in progress, the head could be
> +   * anywhere between this head and stream->oa_buffer.tail.
> +   */
> +  head = stream->oa_buffer.head - gtt_offset;
>
> +  hw_tail -= gtt_offset;
> +  tail = hw_tail;
>
> +  /* Walk the stream backward until we find a report with dword 0
> +   * & 1 not at 0. Since the circular buffer pointers progress by
> +   * increments of 64 bytes and that reports can be up to 256
> +   * bytes long, we can't tell whether a report has fully landed
> +   * in memory before the first 2 dwords of the following report
> +   * have effectively landed.
> +   *
> +   * This is assuming that the writes of the OA unit land in
> +   * memory in the order they were written to.
> +   * If not : (╯°□°)╯︵ ┻━┻
> */
> +  while (OA_TAKEN(tail, head) >= report_size) {
> +  u32 previous_tail = (tail - report_size) & (OA_BUFFER_SIZE 
- 1);
> +  u32 *report32 = (void *)(stream->oa_buffer.vaddr + 
previous_tail);

Sorry, this is wrong. This should just be:

tail = (tail - report_size) & (OA_BUFFER_SIZE - 1);
report32 = (void *)(stream->oa_buffer.vaddr + tail);

Otherwise when we break out of the loop below tail is still set one
report_size ahead. previous_tail is not needed. (In the previous version of
the patch this used to work out correctly).

> +
> +  /* Head of the report indicated by the HW tail register has
> +   * indeed landed into memory.
> +   */
> +  if (report32[0] != 0 || report32[1] != 0)
> +  break;
> +
> +  tail = previous_tail;
>}


Actually a couple of further improvements to the loop above are
possible. First there is no reason to start at previous_tail, we can just
start at the aligned hw_tail itself. Therefore the loop becomes:

while (OA_TAKEN(tail, head) >= report_size) {
u32 *report32 = (void *)(stream->oa_buffer.vaddr + 
tail);

if (report32[0] != 0 || report32[1] != 0)
break;

tail = (tail - report_size) & (OA_BUFFER_SIZE - 1);
}

Further, there is no reason to go back to the head but only to the old
tail. Therefore:

head = stream->oa_buffer.head - gtt_offset;
old_tail = stream->oa_buffer.tail - gtt_offset;

hw_tail -= gtt_offset;
tail = hw_tail;

while (OA_TAKEN(tail, old_tail) >= report_size) {
u32 *report32 = (void *)(stream->oa_buffer.vaddr + 
tail);

if (report32[0] != 0 || report32[1] != 0)
break;

tail = (tail - report_size) & (OA_BUFFER_SIZE - 1);
}

Please review and see if these two improvements are possible. Thanks!


I think that this is possible. We could check reports between old_tail 
and tail to optimize the number of reports we read. I will give this a 
try.


Thanks,
Umesh
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/3] drm/i915/perf: rework aging tail workaround

2020-03-23 Thread Umesh Nerlige Ramappa

On Sat, Mar 21, 2020 at 04:26:42PM -0700, Dixit, Ashutosh wrote:

On Thu, 19 Mar 2020 15:52:01 -0700, Umesh Nerlige Ramappa wrote:


From: Lionel Landwerlin 

We're about to introduce an options to open the perf stream, giving
the user ability to configure how often it wants the kernel to poll
the OA registers for available data.

Right now the workaround against the OA tail pointer race condition
requires at least twice the internal kernel polling timer to make any
data available.

This changes introduce checks on the OA data written into the circular
buffer to make as much data as possible available on the first
iteration of the polling timer.


/snip/


diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 3222f6cd8255..c1429d3acaf9 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -223,26 +223,17 @@
  *
  * Although this can be observed explicitly while copying reports to userspace
  * by checking for a zeroed report-id field in tail reports, we want to account
- * for this earlier, as part of the oa_buffer_check to avoid lots of redundant
- * read() attempts.
- *
- * In effect we define a tail pointer for reading that lags the real tail
- * pointer by at least %OA_TAIL_MARGIN_NSEC nanoseconds, which gives enough
- * time for the corresponding reports to become visible to the CPU.
- *
- * To manage this we actually track two tail pointers:
- *  1) An 'aging' tail with an associated timestamp that is tracked until we
- * can trust the corresponding data is visible to the CPU; at which point
- * it is considered 'aged'.
- *  2) An 'aged' tail that can be used for read()ing.
- *
- * The two separate pointers let us decouple read()s from tail pointer aging.
- *
- * The tail pointers are checked and updated at a limited rate within a hrtimer
- * callback (the same callback that is used for delivering EPOLLIN events)
- *
- * Initially the tails are marked invalid with %INVALID_TAIL_PTR which
- * indicates that an updated tail pointer is needed.
+ * for this earlier, as part of the oa_buffer_check_unlocked to avoid lots of
+ * redundant read() attempts.
+ *
+ * We workaround this issue in oa_buffer_check_unlocked() by reading the 
reports
+ * in the OA buffer, starting from the tail reported by the HW until we find 2
+ * consecutive reports with their first 2 dwords of not at 0. Those dwords are


until we find a report with its first 2 dwords not 0 meaning its previous
report is completely in memory and ready to be read.


+ * also set to 0 once read and the whole buffer is cleared upon OA buffer
+ * initialization. The first dword is the reason for this report while the
+ * second is the timestamp, making the chances of having those 2 fields at 0
+ * fairly unlikely. A more detailed explanation is available in
+ * oa_buffer_check_unlocked().



@@ -477,16 +468,6 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)
 */
spin_lock_irqsave(>oa_buffer.ptr_lock, flags);

hw_tail = stream->perf->ops.oa_hw_tail_read(stream);

hw_tail &= ~(report_size - 1);

@@ -496,64 +477,64 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)

now = ktime_get_mono_fast_ns();

+   if (hw_tail == stream->oa_buffer.aging_tail &&
+  (now - stream->oa_buffer.aging_timestamp) > OA_TAIL_MARGIN_NSEC) {
+   /* If the HW tail hasn't move since the last check and the HW
+* tail has been aging for long enough, declare it the new
+* tail.
+*/
+   stream->oa_buffer.tail = stream->oa_buffer.aging_tail;
+   } else {
+   u32 head, tail;

+   /* NB: The head we observe here might effectively be a little
+* out of date. If a read() is in progress, the head could be
+* anywhere between this head and stream->oa_buffer.tail.
+*/
+   head = stream->oa_buffer.head - gtt_offset;

+   hw_tail -= gtt_offset;
+   tail = hw_tail;

+   /* Walk the stream backward until we find a report with dword 0
+* & 1 not at 0. Since the circular buffer pointers progress by
+* increments of 64 bytes and that reports can be up to 256
+* bytes long, we can't tell whether a report has fully landed
+* in memory before the first 2 dwords of the following report
+* have effectively landed.
+*
+* This is assuming that the writes of the OA unit land in
+* memory in the order they were written to.
+* If not : (╯°□°)╯︵ ┻━┻
 */
-   if (hw_tail >= gtt_offset &&
-   hw_tail < (gtt_offset + OA_BUFFER_SIZE)) {
-   stream->oa_buffer.tails[!aged_idx].offset =
-   aging_tail = hw_tail;
-

Re: [Intel-gfx] [PATCH 1/3] drm/i915/perf: rework aging tail workaround

2020-03-22 Thread Dixit, Ashutosh
On Sat, 21 Mar 2020 21:44:43 -0700, Dixit, Ashutosh wrote:
>
> Actually a couple of further improvements to the loop above are
> possible. First there is no reason to start at previous_tail, we can just
> start at the aligned hw_tail itself.

Unless we deliberately want to wait and delay declaring the data as valid,
i.e. we are really not sure what is happening and it seems "safer" to wait
for an extra report.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/3] drm/i915/perf: rework aging tail workaround

2020-03-21 Thread Dixit, Ashutosh
On Sat, 21 Mar 2020 16:26:42 -0700, Dixit, Ashutosh wrote:
>
> On Thu, 19 Mar 2020 15:52:01 -0700, Umesh Nerlige Ramappa wrote:
> >
> > From: Lionel Landwerlin 
>
> > @@ -477,16 +468,6 @@ static bool oa_buffer_check_unlocked(struct 
> > i915_perf_stream *stream)
> >  */
> > spin_lock_irqsave(>oa_buffer.ptr_lock, flags);
> >
> > hw_tail = stream->perf->ops.oa_hw_tail_read(stream);
> >
> > hw_tail &= ~(report_size - 1);
> >
> > @@ -496,64 +477,64 @@ static bool oa_buffer_check_unlocked(struct 
> > i915_perf_stream *stream)
> >
> > now = ktime_get_mono_fast_ns();
> >
> > +   if (hw_tail == stream->oa_buffer.aging_tail &&
> > +  (now - stream->oa_buffer.aging_timestamp) > OA_TAIL_MARGIN_NSEC) {
> > +   /* If the HW tail hasn't move since the last check and the HW
> > +* tail has been aging for long enough, declare it the new
> > +* tail.
> > +*/
> > +   stream->oa_buffer.tail = stream->oa_buffer.aging_tail;
> > +   } else {
> > +   u32 head, tail;
> >
> > +   /* NB: The head we observe here might effectively be a little
> > +* out of date. If a read() is in progress, the head could be
> > +* anywhere between this head and stream->oa_buffer.tail.
> > +*/
> > +   head = stream->oa_buffer.head - gtt_offset;
> >
> > +   hw_tail -= gtt_offset;
> > +   tail = hw_tail;
> >
> > +   /* Walk the stream backward until we find a report with dword 0
> > +* & 1 not at 0. Since the circular buffer pointers progress by
> > +* increments of 64 bytes and that reports can be up to 256
> > +* bytes long, we can't tell whether a report has fully landed
> > +* in memory before the first 2 dwords of the following report
> > +* have effectively landed.
> > +*
> > +* This is assuming that the writes of the OA unit land in
> > +* memory in the order they were written to.
> > +* If not : (╯°□°)╯︵ ┻━┻
> >  */
> > +   while (OA_TAKEN(tail, head) >= report_size) {
> > +   u32 previous_tail = (tail - report_size) & 
> > (OA_BUFFER_SIZE - 1);
> > +   u32 *report32 = (void *)(stream->oa_buffer.vaddr + 
> > previous_tail);
>
> Sorry, this is wrong. This should just be:
>
>   tail = (tail - report_size) & (OA_BUFFER_SIZE - 1);
>   report32 = (void *)(stream->oa_buffer.vaddr + tail);
>
> Otherwise when we break out of the loop below tail is still set one
> report_size ahead. previous_tail is not needed. (In the previous version of
> the patch this used to work out correctly).
>
> > +
> > +   /* Head of the report indicated by the HW tail register 
> > has
> > +* indeed landed into memory.
> > +*/
> > +   if (report32[0] != 0 || report32[1] != 0)
> > +   break;
> > +
> > +   tail = previous_tail;
> > }

Actually a couple of further improvements to the loop above are
possible. First there is no reason to start at previous_tail, we can just
start at the aligned hw_tail itself. Therefore the loop becomes:

while (OA_TAKEN(tail, head) >= report_size) {
u32 *report32 = (void *)(stream->oa_buffer.vaddr + 
tail);

if (report32[0] != 0 || report32[1] != 0)
break;

tail = (tail - report_size) & (OA_BUFFER_SIZE - 1);
}

Further, there is no reason to go back to the head but only to the old
tail. Therefore:

head = stream->oa_buffer.head - gtt_offset;
old_tail = stream->oa_buffer.tail - gtt_offset;

hw_tail -= gtt_offset;
tail = hw_tail;

while (OA_TAKEN(tail, old_tail) >= report_size) {
u32 *report32 = (void *)(stream->oa_buffer.vaddr + 
tail);

if (report32[0] != 0 || report32[1] != 0)
break;

tail = (tail - report_size) & (OA_BUFFER_SIZE - 1);
}

Please review and see if these two improvements are possible. Thanks!
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/3] drm/i915/perf: rework aging tail workaround

2020-03-21 Thread Dixit, Ashutosh
On Thu, 19 Mar 2020 15:52:01 -0700, Umesh Nerlige Ramappa wrote:
>
> From: Lionel Landwerlin 
>
> We're about to introduce an options to open the perf stream, giving
> the user ability to configure how often it wants the kernel to poll
> the OA registers for available data.
>
> Right now the workaround against the OA tail pointer race condition
> requires at least twice the internal kernel polling timer to make any
> data available.
>
> This changes introduce checks on the OA data written into the circular
> buffer to make as much data as possible available on the first
> iteration of the polling timer.

/snip/

> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
> b/drivers/gpu/drm/i915/i915_perf.c
> index 3222f6cd8255..c1429d3acaf9 100644
> --- a/drivers/gpu/drm/i915/i915_perf.c
> +++ b/drivers/gpu/drm/i915/i915_perf.c
> @@ -223,26 +223,17 @@
>   *
>   * Although this can be observed explicitly while copying reports to 
> userspace
>   * by checking for a zeroed report-id field in tail reports, we want to 
> account
> - * for this earlier, as part of the oa_buffer_check to avoid lots of 
> redundant
> - * read() attempts.
> - *
> - * In effect we define a tail pointer for reading that lags the real tail
> - * pointer by at least %OA_TAIL_MARGIN_NSEC nanoseconds, which gives enough
> - * time for the corresponding reports to become visible to the CPU.
> - *
> - * To manage this we actually track two tail pointers:
> - *  1) An 'aging' tail with an associated timestamp that is tracked until we
> - * can trust the corresponding data is visible to the CPU; at which point
> - * it is considered 'aged'.
> - *  2) An 'aged' tail that can be used for read()ing.
> - *
> - * The two separate pointers let us decouple read()s from tail pointer aging.
> - *
> - * The tail pointers are checked and updated at a limited rate within a 
> hrtimer
> - * callback (the same callback that is used for delivering EPOLLIN events)
> - *
> - * Initially the tails are marked invalid with %INVALID_TAIL_PTR which
> - * indicates that an updated tail pointer is needed.
> + * for this earlier, as part of the oa_buffer_check_unlocked to avoid lots of
> + * redundant read() attempts.
> + *
> + * We workaround this issue in oa_buffer_check_unlocked() by reading the 
> reports
> + * in the OA buffer, starting from the tail reported by the HW until we find 
> 2
> + * consecutive reports with their first 2 dwords of not at 0. Those dwords 
> are

until we find a report with its first 2 dwords not 0 meaning its previous
report is completely in memory and ready to be read.

> + * also set to 0 once read and the whole buffer is cleared upon OA buffer
> + * initialization. The first dword is the reason for this report while the
> + * second is the timestamp, making the chances of having those 2 fields at 0
> + * fairly unlikely. A more detailed explanation is available in
> + * oa_buffer_check_unlocked().

> @@ -477,16 +468,6 @@ static bool oa_buffer_check_unlocked(struct 
> i915_perf_stream *stream)
>*/
>   spin_lock_irqsave(>oa_buffer.ptr_lock, flags);
>
>   hw_tail = stream->perf->ops.oa_hw_tail_read(stream);
>
>   hw_tail &= ~(report_size - 1);
>
> @@ -496,64 +477,64 @@ static bool oa_buffer_check_unlocked(struct 
> i915_perf_stream *stream)
>
>   now = ktime_get_mono_fast_ns();
>
> + if (hw_tail == stream->oa_buffer.aging_tail &&
> +(now - stream->oa_buffer.aging_timestamp) > OA_TAIL_MARGIN_NSEC) {
> + /* If the HW tail hasn't move since the last check and the HW
> +  * tail has been aging for long enough, declare it the new
> +  * tail.
> +  */
> + stream->oa_buffer.tail = stream->oa_buffer.aging_tail;
> + } else {
> + u32 head, tail;
>
> + /* NB: The head we observe here might effectively be a little
> +  * out of date. If a read() is in progress, the head could be
> +  * anywhere between this head and stream->oa_buffer.tail.
> +  */
> + head = stream->oa_buffer.head - gtt_offset;
>
> + hw_tail -= gtt_offset;
> + tail = hw_tail;
>
> + /* Walk the stream backward until we find a report with dword 0
> +  * & 1 not at 0. Since the circular buffer pointers progress by
> +  * increments of 64 bytes and that reports can be up to 256
> +  * bytes long, we can't tell whether a report has fully landed
> +  * in memory before the first 2 dwords of the following report
> +  * have effectively landed.
> +  *
> +  * This is assuming that the writes of the OA unit land in
> +  * memory in the order they were written to.
> +  * If not : (╯°□°)╯︵ ┻━┻
>*/
> - if (hw_tail >= gtt_offset &&
> - hw_tail < (gtt_offset + OA_BUFFER_SIZE)) {
> - 

[Intel-gfx] [PATCH 1/3] drm/i915/perf: rework aging tail workaround

2020-03-19 Thread Umesh Nerlige Ramappa
From: Lionel Landwerlin 

We're about to introduce an options to open the perf stream, giving
the user ability to configure how often it wants the kernel to poll
the OA registers for available data.

Right now the workaround against the OA tail pointer race condition
requires at least twice the internal kernel polling timer to make any
data available.

This changes introduce checks on the OA data written into the circular
buffer to make as much data as possible available on the first
iteration of the polling timer.

v2: Use OA_TAKEN macro without the gtt_offset (Lionel)
v3: (Umesh)
- Rebase
- Change report to report32 from below review
  https://patchwork.freedesktop.org/patch/330704/?series=66697=1
v4: (Ashutosh, Lionel)
- Fix checkpatch errors
- Fix aging_timestamp initialization
- Check for only one valid landed report
- Fix check for unlanded report

Signed-off-by: Lionel Landwerlin 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/i915_perf.c   | 197 ++---
 drivers/gpu/drm/i915/i915_perf_types.h |  29 ++--
 2 files changed, 96 insertions(+), 130 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 3222f6cd8255..c1429d3acaf9 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -223,26 +223,17 @@
  *
  * Although this can be observed explicitly while copying reports to userspace
  * by checking for a zeroed report-id field in tail reports, we want to account
- * for this earlier, as part of the oa_buffer_check to avoid lots of redundant
- * read() attempts.
- *
- * In effect we define a tail pointer for reading that lags the real tail
- * pointer by at least %OA_TAIL_MARGIN_NSEC nanoseconds, which gives enough
- * time for the corresponding reports to become visible to the CPU.
- *
- * To manage this we actually track two tail pointers:
- *  1) An 'aging' tail with an associated timestamp that is tracked until we
- * can trust the corresponding data is visible to the CPU; at which point
- * it is considered 'aged'.
- *  2) An 'aged' tail that can be used for read()ing.
- *
- * The two separate pointers let us decouple read()s from tail pointer aging.
- *
- * The tail pointers are checked and updated at a limited rate within a hrtimer
- * callback (the same callback that is used for delivering EPOLLIN events)
- *
- * Initially the tails are marked invalid with %INVALID_TAIL_PTR which
- * indicates that an updated tail pointer is needed.
+ * for this earlier, as part of the oa_buffer_check_unlocked to avoid lots of
+ * redundant read() attempts.
+ *
+ * We workaround this issue in oa_buffer_check_unlocked() by reading the 
reports
+ * in the OA buffer, starting from the tail reported by the HW until we find 2
+ * consecutive reports with their first 2 dwords of not at 0. Those dwords are
+ * also set to 0 once read and the whole buffer is cleared upon OA buffer
+ * initialization. The first dword is the reason for this report while the
+ * second is the timestamp, making the chances of having those 2 fields at 0
+ * fairly unlikely. A more detailed explanation is available in
+ * oa_buffer_check_unlocked().
  *
  * Most of the implementation details for this workaround are in
  * oa_buffer_check_unlocked() and _append_oa_reports()
@@ -465,10 +456,10 @@ static u32 gen7_oa_hw_tail_read(struct i915_perf_stream 
*stream)
  */
 static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream)
 {
+   u32 gtt_offset = i915_ggtt_offset(stream->oa_buffer.vma);
int report_size = stream->oa_buffer.format_size;
unsigned long flags;
-   unsigned int aged_idx;
-   u32 head, hw_tail, aged_tail, aging_tail;
+   u32 hw_tail;
u64 now;
 
/* We have to consider the (unlikely) possibility that read() errors
@@ -477,16 +468,6 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)
 */
spin_lock_irqsave(>oa_buffer.ptr_lock, flags);
 
-   /* NB: The head we observe here might effectively be a little out of
-* date (between head and tails[aged_idx].offset if there is currently
-* a read() in progress.
-*/
-   head = stream->oa_buffer.head;
-
-   aged_idx = stream->oa_buffer.aged_tail_idx;
-   aged_tail = stream->oa_buffer.tails[aged_idx].offset;
-   aging_tail = stream->oa_buffer.tails[!aged_idx].offset;
-
hw_tail = stream->perf->ops.oa_hw_tail_read(stream);
 
/* The tail pointer increases in 64 byte increments,
@@ -496,64 +477,64 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)
 
now = ktime_get_mono_fast_ns();
 
-   /* Update the aged tail
-*
-* Flip the tail pointer available for read()s once the aging tail is
-* old enough to trust that the corresponding data will be visible to
-* the CPU...
-*
-* Do this before updating the aging pointer in case we may be able to
-  

Re: [Intel-gfx] [PATCH 1/3] drm/i915/perf: rework aging tail workaround

2019-09-15 Thread Lionel Landwerlin

On 14/09/2019 02:06, Umesh Nerlige Ramappa wrote:

From: Lionel Landwerlin 

Right now the workaround against the OA tail pointer race condition
requires at least twice the internal kernel polling timer to make any
data available.

This changes introduce checks on the OA data written into the circular
buffer to make as much data as possible available on the first
iteration of the polling timer.

v2: Use OA_TAKEN macro without the gtt_offset (Lionel)

Signed-off-by: Lionel Landwerlin 
---
  drivers/gpu/drm/i915/i915_drv.h  |  30 ++---
  drivers/gpu/drm/i915/i915_perf.c | 200 ++-
  2 files changed, 103 insertions(+), 127 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index bf600888b3f1..876aeaf3568e 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1180,21 +1180,11 @@ struct i915_perf_stream {
spinlock_t ptr_lock;
  
  		/**

-* One 'aging' tail pointer and one 'aged' tail pointer ready to
-* used for reading.
-*
-* Initial values of 0x are invalid and imply that an
-* update is required (and should be ignored by an attempted
-* read)
-*/
-   struct {
-   u32 offset;
-   } tails[2];
-
-   /**
-* Index for the aged tail ready to read() data up to.
+* The last HW tail reported by HW. The data
+* might not have made it to memory yet
+* though.
 */
-   unsigned int aged_tail_idx;
+   u32 aging_tail;
  
  		/**

 * A monotonic timestamp for when the current aging tail pointer
@@ -1210,6 +1200,12 @@ struct i915_perf_stream {
 * OA buffer data to userspace.
 */
u32 head;
+
+   /**
+* The last verified tail that can be read
+* by user space
+*/
+   u32 tail;
} oa_buffer;
  };
  
@@ -1693,6 +1689,12 @@ struct drm_i915_private {

 */
struct ratelimit_state spurious_report_rs;
  
+		/**

+* For rate limiting any notifications of tail pointer
+* race.
+*/
+   struct ratelimit_state tail_pointer_race;
+
struct i915_oa_config test_config;
  
  		u32 gen7_latched_oastatus1;

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index c1b764233761..50b6d154fd46 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -237,23 +237,14 @@
   * for this earlier, as part of the oa_buffer_check to avoid lots of redundant
   * read() attempts.
   *
- * In effect we define a tail pointer for reading that lags the real tail
- * pointer by at least %OA_TAIL_MARGIN_NSEC nanoseconds, which gives enough
- * time for the corresponding reports to become visible to the CPU.
- *
- * To manage this we actually track two tail pointers:
- *  1) An 'aging' tail with an associated timestamp that is tracked until we
- * can trust the corresponding data is visible to the CPU; at which point
- * it is considered 'aged'.
- *  2) An 'aged' tail that can be used for read()ing.
- *
- * The two separate pointers let us decouple read()s from tail pointer aging.
- *
- * The tail pointers are checked and updated at a limited rate within a hrtimer
- * callback (the same callback that is used for delivering EPOLLIN events)
- *
- * Initially the tails are marked invalid with %INVALID_TAIL_PTR which
- * indicates that an updated tail pointer is needed.
+ * We workaround this issue in oa_buffer_check() by reading the reports in the
+ * OA buffer, starting from the tail reported by the HW until we find 2
+ * consecutive reports with their first 2 dwords of not at 0. Those dwords are
+ * also set to 0 once read and the whole buffer is cleared upon OA buffer
+ * initialization. The first dword is the reason for this report while the
+ * second is the timestamp, making the chances of having those 2 fields at 0
+ * fairly unlikely. A more detailed explanation is available in
+ * oa_buffer_check().
   *
   * Most of the implementation details for this workaround are in
   * oa_buffer_check_unlocked() and _append_oa_reports()
@@ -266,7 +257,6 @@
   * enabled without any periodic sampling.
   */
  #define OA_TAIL_MARGIN_NSEC   10ULL
-#define INVALID_TAIL_PTR   0x
  
  /* frequency for checking whether the OA unit has written new reports to the

   * circular OA buffer...
@@ -457,10 +447,10 @@ static u32 gen7_oa_hw_tail_read(struct i915_perf_stream 
*stream)
  static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream)
  {
struct drm_i915_private *dev_priv = stream->dev_priv;
+   u32 gtt_offset = 

[Intel-gfx] [PATCH 1/3] drm/i915/perf: rework aging tail workaround

2019-09-13 Thread Umesh Nerlige Ramappa
From: Lionel Landwerlin 

Right now the workaround against the OA tail pointer race condition
requires at least twice the internal kernel polling timer to make any
data available.

This changes introduce checks on the OA data written into the circular
buffer to make as much data as possible available on the first
iteration of the polling timer.

v2: Use OA_TAKEN macro without the gtt_offset (Lionel)

Signed-off-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/i915_drv.h  |  30 ++---
 drivers/gpu/drm/i915/i915_perf.c | 200 ++-
 2 files changed, 103 insertions(+), 127 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index bf600888b3f1..876aeaf3568e 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1180,21 +1180,11 @@ struct i915_perf_stream {
spinlock_t ptr_lock;
 
/**
-* One 'aging' tail pointer and one 'aged' tail pointer ready to
-* used for reading.
-*
-* Initial values of 0x are invalid and imply that an
-* update is required (and should be ignored by an attempted
-* read)
-*/
-   struct {
-   u32 offset;
-   } tails[2];
-
-   /**
-* Index for the aged tail ready to read() data up to.
+* The last HW tail reported by HW. The data
+* might not have made it to memory yet
+* though.
 */
-   unsigned int aged_tail_idx;
+   u32 aging_tail;
 
/**
 * A monotonic timestamp for when the current aging tail pointer
@@ -1210,6 +1200,12 @@ struct i915_perf_stream {
 * OA buffer data to userspace.
 */
u32 head;
+
+   /**
+* The last verified tail that can be read
+* by user space
+*/
+   u32 tail;
} oa_buffer;
 };
 
@@ -1693,6 +1689,12 @@ struct drm_i915_private {
 */
struct ratelimit_state spurious_report_rs;
 
+   /**
+* For rate limiting any notifications of tail pointer
+* race.
+*/
+   struct ratelimit_state tail_pointer_race;
+
struct i915_oa_config test_config;
 
u32 gen7_latched_oastatus1;
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index c1b764233761..50b6d154fd46 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -237,23 +237,14 @@
  * for this earlier, as part of the oa_buffer_check to avoid lots of redundant
  * read() attempts.
  *
- * In effect we define a tail pointer for reading that lags the real tail
- * pointer by at least %OA_TAIL_MARGIN_NSEC nanoseconds, which gives enough
- * time for the corresponding reports to become visible to the CPU.
- *
- * To manage this we actually track two tail pointers:
- *  1) An 'aging' tail with an associated timestamp that is tracked until we
- * can trust the corresponding data is visible to the CPU; at which point
- * it is considered 'aged'.
- *  2) An 'aged' tail that can be used for read()ing.
- *
- * The two separate pointers let us decouple read()s from tail pointer aging.
- *
- * The tail pointers are checked and updated at a limited rate within a hrtimer
- * callback (the same callback that is used for delivering EPOLLIN events)
- *
- * Initially the tails are marked invalid with %INVALID_TAIL_PTR which
- * indicates that an updated tail pointer is needed.
+ * We workaround this issue in oa_buffer_check() by reading the reports in the
+ * OA buffer, starting from the tail reported by the HW until we find 2
+ * consecutive reports with their first 2 dwords of not at 0. Those dwords are
+ * also set to 0 once read and the whole buffer is cleared upon OA buffer
+ * initialization. The first dword is the reason for this report while the
+ * second is the timestamp, making the chances of having those 2 fields at 0
+ * fairly unlikely. A more detailed explanation is available in
+ * oa_buffer_check().
  *
  * Most of the implementation details for this workaround are in
  * oa_buffer_check_unlocked() and _append_oa_reports()
@@ -266,7 +257,6 @@
  * enabled without any periodic sampling.
  */
 #define OA_TAIL_MARGIN_NSEC10ULL
-#define INVALID_TAIL_PTR   0x
 
 /* frequency for checking whether the OA unit has written new reports to the
  * circular OA buffer...
@@ -457,10 +447,10 @@ static u32 gen7_oa_hw_tail_read(struct i915_perf_stream 
*stream)
 static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream)
 {
struct drm_i915_private *dev_priv = stream->dev_priv;
+   u32 gtt_offset = i915_ggtt_offset(stream->oa_buffer.vma);
int