Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator

2017-04-20 Thread Chris Wilson
On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> +static void
> +run_workload(unsigned int id, struct workload *wrk, unsigned int repeat,
> +  enum intel_engine_id (*balance)(struct workload *wrk,
> +  struct w_step *w),
> +  unsigned int flags)
> +{
> + struct timespec t_start, t_end;
> + struct w_step *w;
> + double t;
> + int i, j;
> +
> + clock_gettime(CLOCK_MONOTONIC, _start);
> +
> + srand(t_start.tv_nsec);

Let's supply the seed with the workload specification. And use a
portable prng so we can be sure we can reproduce results from one system
to the next.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator

2017-04-20 Thread Chris Wilson
On Thu, Apr 20, 2017 at 03:34:56PM +0100, Tvrtko Ursulin wrote:
> 
> On 20/04/2017 15:23, Chris Wilson wrote:
> >On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:
> >>+static void
> >>+alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb 
> >>*b,
> >>+enum intel_engine_id engine, unsigned int flags)
> >>+{
> >>+   unsigned int bb_i, j = 0;
> >>+
> >>+   b->obj[j].handle = gem_create(fd, 4096);
> >>+   b->obj[j].flags = EXEC_OBJECT_WRITE;
> >>+   j++;
> >>+
> >>+   if (flags & SEQNO) {
> >>+   b->obj[j].handle = wrk->status_page_handle;
> >>+   j++;
> >>+   }
> >>+
> >>+   bb_i = j++;
> >>+   b->bb_sz = get_bb_sz(w->duration.max);
> >>+   b->bb_handle = b->obj[bb_i].handle = gem_create(fd, b->bb_sz);
> >>+   terminate_bb(w, b, engine, flags);
> >>+
> >>+   igt_assert(w->dependency <= 0);
> >>+   if (w->dependency) {
> >>+   int dep_idx = w->idx + w->dependency;
> >>+
> >>+   igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
> >>+   igt_assert(wrk->steps[dep_idx].type == BATCH);
> >>+
> >>+   b->obj[j].handle = b->obj[bb_i].handle;
> >>+   bb_i = j;
> >>+   b->obj[j - 1].handle = wrk->steps[dep_idx].b[0].obj[0].handle;
> >>+   j++;
> >>+
> >>+   if (wrk->steps[dep_idx].b[1].obj[0].handle) {
> >>+   b->obj[j].handle = b->obj[bb_i].handle;
> >>+   bb_i = j;
> >>+   b->obj[j - 1].handle =
> >>+   wrk->steps[dep_idx].b[1].obj[0].handle;
> >>+   j++;
> >>+   }
> >>+   }
> >>+
> >>+   if (flags & SEQNO) {
> >>+   b->reloc.presumed_offset = -1;
> >
> >So as I understand it, you are caching the execbuf/obj/reloc for the
> >workload and then may reissue later with different seqno on different
> >rings? In which case we have a problem as the kernel will write back the >
> >
> >updated offsets to b->reloc.presumed_offset and b->obj[].offset and in
> >future passes they will match and the seqno write will go into the wrong
> >slot (if it swaps rings).
> >
> >You either want to reset presumed_offset=-1 each time, or better for all
> >concerned write the correct address alongside the seqno (which also
> >enables NORELOC).
> >
> >Delta incoming.
> 
> Only the seqno changes, but each engine has its own eb/obj/reloc. So
> I think there is no problem. Or is there still?

bb_handle is per engine as well. Ugh. No, that seems like you are
self-consistent, you just need to remove the -1 and your code is NORELOC
correct.

I wouldn't go so far as having entirely separate batches for each engine
though :)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator

2017-04-20 Thread Tvrtko Ursulin


On 20/04/2017 15:52, Chris Wilson wrote:

On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:

+   wrk->nr_bb[engine]++;
+
+   if (engine == VCS && balance) {
+   engine = balance(wrk, w);
+   wrk->nr_bb[engine]++;
+   b = >b[engine - VCS1];
+
+   if (flags & SEQNO)
+   update_bb_seqno(b, engine,
+   ++wrk->seqno[engine]);
+   }
+
+   if (w->duration.min != w->duration.max) {
+   unsigned int d = get_duration(>duration);
+   unsigned long offset;
+
+   offset = ALIGN(b->bb_sz - get_bb_sz(d),
+  2 * sizeof(uint32_t));
+   b->eb.batch_start_offset = offset;
+   }
+
+   gem_execbuf(fd, >eb);


Likely double counting wrk->nr_bb. I suggest placing it next to the
gem_execbuf().


Just convenience in balancing mode so that nr(VCS) = nr(VCS1) + 
nr(VCS2). Also from a different angle, if the sum does not hold, that 
means workload had auto-balancing and explicit VCS1/2 batches. It's only 
used to print out the stats at the end of the run.


Regards,

Tvrtko

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator

2017-04-20 Thread Chris Wilson
On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:
> + wrk->nr_bb[engine]++;
> +
> + if (engine == VCS && balance) {
> + engine = balance(wrk, w);
> + wrk->nr_bb[engine]++;
> + b = >b[engine - VCS1];
> +
> + if (flags & SEQNO)
> + update_bb_seqno(b, engine,
> + ++wrk->seqno[engine]);
> + }
> +
> + if (w->duration.min != w->duration.max) {
> + unsigned int d = get_duration(>duration);
> + unsigned long offset;
> +
> + offset = ALIGN(b->bb_sz - get_bb_sz(d),
> +2 * sizeof(uint32_t));
> + b->eb.batch_start_offset = offset;
> + }
> +
> + gem_execbuf(fd, >eb);

Likely double counting wrk->nr_bb. I suggest placing it next to the
gem_execbuf().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator

2017-04-20 Thread Tvrtko Ursulin


On 20/04/2017 15:33, Chris Wilson wrote:

On Thu, Apr 20, 2017 at 03:23:27PM +0100, Chris Wilson wrote:

You either want to reset presumed_offset=-1 each time, or better for all
concerned write the correct address alongside the seqno (which also
enables NORELOC).

Delta incoming.


See attached.

Next concern is that I have full rings which implies that we are not
waiting on each batch before resubmitting with a new seqno?

If I throw a assert(!busy(batch_bo)) before the *b->mapped_seqno am I
going to be upset?


Yes you would. :) I had a sync (as a move to cpu domain) before seqno 
update in the last version but it disappeared as I was fixing the whole 
area of seqno tracking. So the balancing results in the patch are bogus 
since the seqno can jump to latest ahead of the time...


Regards,

Tvrtko

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator

2017-04-20 Thread Tvrtko Ursulin


On 20/04/2017 15:23, Chris Wilson wrote:

On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:

+static void
+alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b,
+enum intel_engine_id engine, unsigned int flags)
+{
+   unsigned int bb_i, j = 0;
+
+   b->obj[j].handle = gem_create(fd, 4096);
+   b->obj[j].flags = EXEC_OBJECT_WRITE;
+   j++;
+
+   if (flags & SEQNO) {
+   b->obj[j].handle = wrk->status_page_handle;
+   j++;
+   }
+
+   bb_i = j++;
+   b->bb_sz = get_bb_sz(w->duration.max);
+   b->bb_handle = b->obj[bb_i].handle = gem_create(fd, b->bb_sz);
+   terminate_bb(w, b, engine, flags);
+
+   igt_assert(w->dependency <= 0);
+   if (w->dependency) {
+   int dep_idx = w->idx + w->dependency;
+
+   igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
+   igt_assert(wrk->steps[dep_idx].type == BATCH);
+
+   b->obj[j].handle = b->obj[bb_i].handle;
+   bb_i = j;
+   b->obj[j - 1].handle = wrk->steps[dep_idx].b[0].obj[0].handle;
+   j++;
+
+   if (wrk->steps[dep_idx].b[1].obj[0].handle) {
+   b->obj[j].handle = b->obj[bb_i].handle;
+   bb_i = j;
+   b->obj[j - 1].handle =
+   wrk->steps[dep_idx].b[1].obj[0].handle;
+   j++;
+   }
+   }
+
+   if (flags & SEQNO) {
+   b->reloc.presumed_offset = -1;


So as I understand it, you are caching the execbuf/obj/reloc for the
workload and then may reissue later with different seqno on different
rings? In which case we have a problem as the kernel will write back the >

>

updated offsets to b->reloc.presumed_offset and b->obj[].offset and in
future passes they will match and the seqno write will go into the wrong
slot (if it swaps rings).

You either want to reset presumed_offset=-1 each time, or better for all
concerned write the correct address alongside the seqno (which also
enables NORELOC).

Delta incoming.


Only the seqno changes, but each engine has its own eb/obj/reloc. So I 
think there is no problem. Or is there still?


Regards,

Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator

2017-04-20 Thread Chris Wilson
On Thu, Apr 20, 2017 at 03:23:27PM +0100, Chris Wilson wrote:
> You either want to reset presumed_offset=-1 each time, or better for all
> concerned write the correct address alongside the seqno (which also
> enables NORELOC).
> 
> Delta incoming.

See attached.

Next concern is that I have full rings which implies that we are not
waiting on each batch before resubmitting with a new seqno?

If I throw a assert(!busy(batch_bo)) before the *b->mapped_seqno am I
going to be upset?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
>From 5bf424c2719e81f926b74f4136610cbdfd26a4d8 Mon Sep 17 00:00:00 2001
From: Chris Wilson 
Date: Thu, 20 Apr 2017 15:30:07 +0100
Subject: [PATCH] seqno-reloc

---
 benchmarks/gem_wsim.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c
index adf2d6de..e616335b 100644
--- a/benchmarks/gem_wsim.c
+++ b/benchmarks/gem_wsim.c
@@ -45,6 +45,8 @@
 #include "drmtest.h"
 #include "intel_io.h"
 
+#define LOCAL_I915_GEM_DOMAIN_WC 0x80
+
 enum intel_engine_id {
 	RCS,
 	BCS,
@@ -86,7 +88,9 @@ struct w_step
 		struct drm_i915_gem_relocation_entry reloc;
 		unsigned long bb_sz;
 		uint32_t bb_handle;
-		uint32_t *mapped_batch, *mapped_seqno;
+		uint32_t *mapped_batch;
+		uint64_t *mapped_address;
+		uint32_t *mapped_seqno;
 		unsigned int mapped_len;
 	} b[2]; /* One for each VCS when load balancing */
 };
@@ -405,7 +409,8 @@ terminate_bb(struct w_step *w, struct w_step_eb *b, enum intel_engine_id engine,
 	mmap_len += cmd_offset - mmap_start;
 
 	gem_set_domain(fd, b->bb_handle,
-		   I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
+		   LOCAL_I915_GEM_DOMAIN_WC,
+		   LOCAL_I915_GEM_DOMAIN_WC);
 
 	ptr = gem_mmap__wc(fd, b->bb_handle, mmap_start, mmap_len, PROT_WRITE);
 	cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
@@ -415,6 +420,7 @@ terminate_bb(struct w_step *w, struct w_step_eb *b, enum intel_engine_id engine,
 		b->reloc.delta = (engine - VCS1) * sizeof(uint32_t);
 
 		*cs++ = MI_STORE_DWORD_IMM;
+		b->mapped_address = (uint64_t *)cs;
 		*cs++ = 0;
 		*cs++ = 0;
 		b->mapped_seqno = cs;
@@ -469,7 +475,6 @@ alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b,
 	}
 
 	if (flags & SEQNO) {
-		b->reloc.presumed_offset = -1;
 		b->reloc.target_handle = 1;
 		b->obj[bb_i].relocs_ptr = to_user_pointer(>reloc);
 		b->obj[bb_i].relocation_count = 1;
@@ -485,8 +490,7 @@ alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b,
 		engine = VCS1;
 	b->eb.flags = eb_engine_map[engine];
 	b->eb.flags |= I915_EXEC_HANDLE_LUT;
-	if (!(flags & SEQNO))
-		b->eb.flags |= I915_EXEC_NO_RELOC;
+	b->eb.flags |= I915_EXEC_NO_RELOC;
 #ifdef DEBUG
 	printf("%u: %u:%x|%x|%x|%x %10lu flags=%llx bb=%x[%u] ctx[%u]=%u\n",
 		w->idx, b->eb.buffer_count, b->obj[0].handle,
@@ -628,8 +632,9 @@ static void
 update_bb_seqno(struct w_step_eb *b, enum intel_engine_id engine,
 		uint32_t seqno)
 {
-	*b->mapped_seqno = seqno;
 	b->reloc.delta = (engine - VCS1) * sizeof(uint32_t);
+	*b->mapped_address = b->reloc.presumed_offset + b->reloc.delta;
+	*b->mapped_seqno = seqno;
 }
 
 static void
-- 
2.11.0

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator

2017-04-20 Thread Chris Wilson
On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:
> +static void
> +alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b,
> +  enum intel_engine_id engine, unsigned int flags)
> +{
> + unsigned int bb_i, j = 0;
> +
> + b->obj[j].handle = gem_create(fd, 4096);
> + b->obj[j].flags = EXEC_OBJECT_WRITE;
> + j++;
> +
> + if (flags & SEQNO) {
> + b->obj[j].handle = wrk->status_page_handle;
> + j++;
> + }
> +
> + bb_i = j++;
> + b->bb_sz = get_bb_sz(w->duration.max);
> + b->bb_handle = b->obj[bb_i].handle = gem_create(fd, b->bb_sz);
> + terminate_bb(w, b, engine, flags);
> +
> + igt_assert(w->dependency <= 0);
> + if (w->dependency) {
> + int dep_idx = w->idx + w->dependency;
> +
> + igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
> + igt_assert(wrk->steps[dep_idx].type == BATCH);
> +
> + b->obj[j].handle = b->obj[bb_i].handle;
> + bb_i = j;
> + b->obj[j - 1].handle = wrk->steps[dep_idx].b[0].obj[0].handle;
> + j++;
> +
> + if (wrk->steps[dep_idx].b[1].obj[0].handle) {
> + b->obj[j].handle = b->obj[bb_i].handle;
> + bb_i = j;
> + b->obj[j - 1].handle =
> + wrk->steps[dep_idx].b[1].obj[0].handle;
> + j++;
> + }
> + }
> +
> + if (flags & SEQNO) {
> + b->reloc.presumed_offset = -1;

So as I understand it, you are caching the execbuf/obj/reloc for the
workload and then may reissue later with different seqno on different
rings? In which case we have a problem as the kernel will write back the
updated offsets to b->reloc.presumed_offset and b->obj[].offset and in
future passes they will match and the seqno write will go into the wrong
slot (if it swaps rings).

You either want to reset presumed_offset=-1 each time, or better for all
concerned write the correct address alongside the seqno (which also
enables NORELOC).

Delta incoming.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator

2017-04-20 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Tool which emits batch buffers to engines with configurable
sequences, durations, contexts, dependencies and userspace waits.

Unfinished but shows promise so sending out for early feedback.

v2:
 * Load workload descriptors from files. (also -w)
 * Help text.
 * Calibration control if needed. (-t)
 * NORELOC | LUT to eb flags.
 * Added sample workload to wsim/workload1.

v3:
 * Multiple parallel different workloads (-w -w ...).
 * Multi-context workloads.
 * Variable (random) batch length.
 * Load balancing (round robin and queue depth estimation).
 * Workloads delays and explicit sync steps.
 * Workload frequency (period) control.

v4:
 * Fixed queue-depth estimation by creating separate batches
   per engine when qd load balancing is on.
 * Dropped separate -s cmd line option. It can turn itself on
   automatically when needed.
 * Keep a single status page and lie about the write hazard
   as suggested by Chris.
 * Use batch_start_offset for controlling the batch duration.
   (Chris)
 * Set status page object cache level. (Chris)
 * Moved workload description to a README.
 * Tidied example workloads.
 * Some other cleanups and refactorings.

TODO list:

 * Fence support.
 * Better error handling.
 * Less 1980's workload parsing.
 * Proper workloads.
 * Threads?
 * ... ?

Signed-off-by: Tvrtko Ursulin 
Cc: Chris Wilson 
Cc: "Rogozhkin, Dmitry V" 
---

Comparing some test workloads under load balancing it seems that it is starting
to work, but it still needs more thorough verification. For example, round-
robin balancing:

# benchmarks/gem_wsim -n 585341 \
  -w benchmarks/wsim/vcs1.wsim \
  -w benchmarks/wsim/vcs_balanced.wsim \
  -r 100 -b 0
Using 585341 nop calibration for 1000us delay.
2 clients.
1: 3.008s elapsed (33.243 workloads/s). 2500 (1250 + 1250) total VCS batches.
0: 4.455s elapsed (22.449 workloads/s). 0 (2500 + 0) total VCS batches.
4.455s elapsed (44.889 workloads/s)


Versus the queue-depth estimation:

# benchmarks/gem_wsim -n 585341 \
  -w benchmarks/wsim/vcs1.wsim \
  -w benchmarks/wsim/vcs_balanced.wsim \
  -r 100 -b 1
Using 585341 nop calibration for 1000us delay.
2 clients.
1: 2.239s elapsed (44.659 workloads/s). 2500 (837 + 1663) total VCS batches. 
Average queue depths 27.575, 19.285.
0: 4.012s elapsed (24.928 workloads/s). 0 (2500 + 0) total VCS batches. Average 
queue depths -nan, -nan.
4.012s elapsed (49.845 workloads/s)

In both cases we run two workloads, one which only submits to VCS1 and one which
can be load-balanced. The latter gets a ~33% boost with queue-depth estimation,
and the non-balancing workload ~10%.

---
 benchmarks/Makefile.sources  |1 +
 benchmarks/gem_wsim.c| 1014 ++
 benchmarks/wsim/README   |   54 ++
 benchmarks/wsim/media_17i7.wsim  |7 +
 benchmarks/wsim/media_load_balance_17i7.wsim |7 +
 benchmarks/wsim/vcs1.wsim|   25 +
 benchmarks/wsim/vcs_balanced.wsim|   25 +
 7 files changed, 1133 insertions(+)
 create mode 100644 benchmarks/gem_wsim.c
 create mode 100644 benchmarks/wsim/README
 create mode 100644 benchmarks/wsim/media_17i7.wsim
 create mode 100644 benchmarks/wsim/media_load_balance_17i7.wsim
 create mode 100644 benchmarks/wsim/vcs1.wsim
 create mode 100644 benchmarks/wsim/vcs_balanced.wsim

diff --git a/benchmarks/Makefile.sources b/benchmarks/Makefile.sources
index 3af54ebe36f2..3a941150abb3 100644
--- a/benchmarks/Makefile.sources
+++ b/benchmarks/Makefile.sources
@@ -14,6 +14,7 @@ benchmarks_prog_list =\
gem_prw \
gem_set_domain  \
gem_syslatency  \
+   gem_wsim\
kms_vblank  \
prime_lookup\
vgem_mmap   \
diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c
new file mode 100644
index ..adf2d6decf12
--- /dev/null
+++ b/benchmarks/gem_wsim.c
@@ -0,0 +1,1014 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ *