Re: [Intel-gfx] [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator

2017-04-07 Thread Chris Wilson
On Fri, Apr 07, 2017 at 09:53:05AM +0100, Tvrtko Ursulin wrote:
> 
> On 06/04/2017 09:55, Chris Wilson wrote:
> >On Thu, Apr 06, 2017 at 09:18:36AM +0100, Tvrtko Ursulin wrote:
> 
> [snip]
[snip]

> + if (swap_vcs && engine == VCS1)
> + engine = VCS2;
> + else if (swap_vcs && engine == VCS2)
> + engine = VCS1;
> + w->eb.flags = eb_engine_map[engine];
> + w->eb.flags |= I915_EXEC_HANDLE_LUT;
> + if (!seqnos)
> + w->eb.flags |= I915_EXEC_NO_RELOC;
> >>>
> >>>Doesn't look too hard to get the relocation right. Forcing relocations
> >>>between batches is probably a good one to check (just to say don't do
> >>>that)
> >>
> >>I am not following here? You are saying don't do relocations at all?
> >>How do I make sure things stay fixed and even how to find out where
> >>they are in the first pass?
> >
> >Depending on the workload, it may be informative to also do comparisons
> >between NORELOC and always RELOC. Personally I would make sure we were
> >using NORELOC as this should be a simulator/example.
> 
> How do I use NORELOC? I mean, I have to know where to objects will
> be pinned, or be able to pin them first and know they will remain
> put. What am I not understanding here?

It will be assigned an address on first execution. Can I quote the spiel
I wrote for i915_gem_execbuffer.c and see if that answers how to use
NORELOC:

 * Reserving resources for the execbuf is the most complicated phase. We
 * neither want to have to migrate the object in the address space, nor do
 * we want to have to update any relocations pointing to this object. Ideally,
 * we want to leave the object where it is and for all the existing relocations
 * to match. If the object is given a new address, or if userspace thinks the
 * object is elsewhere, we have to parse all the relocation entries and update
 * the addresses. Userspace can set the I915_EXEC_NORELOC flag to hint that
 * all the target addresses in all of its objects match the value in the
 * relocation entries and that they all match the presumed offsets given by the
 * list of execbuffer objects. Using this knowledge, we know that if we haven't
 * moved any buffers, all the relocation entries are valid and we can skip
 * the update. (If userspace is wrong, the likely outcome is an impromptu GPU
 * hang.) The requirement for using I915_EXEC_NO_RELOC are:
 *
 *  The addresses written in the objects must match the corresponding
 *  reloc.presumed_offset which in turn must match the corresponding
 *  execobject.offset.
 *
 *  Any render targets written to in the batch must be flagged with
 *  EXEC_OBJECT_WRITE.
 *
 *  To avoid stalling, execobject.offset should match the current
 *  address of that object within the active context.
 *

Does that make sense? How questions remain unanswered?

Hmm, I usually sum it up as

batch[reloc.offset] == reloc.presumed_offset + reloc.delta;

and

execobj.offset == reloc.presumed_offset

must be true at the time of execbuf. Note that upon relocation,
batch[reloc.offset], reloc.presumed_offset and execobj.offset are
updated. This is important to remember if you are prerecording the
reloc/execobj arrays, and not feeding back the results of execbuf
between phases.

> But in general is this correctly implementing your idea for queue
> depth estimation?

From my rough checklist:

* writes engine->next_seqno++ after each op (in this case end of batch)
* qlen[engine] = engine->next_seqno - *engine->current_seqno;

Design looks right. Implementation requires checking... I'll be back.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator

2017-04-07 Thread Tvrtko Ursulin


On 06/04/2017 09:55, Chris Wilson wrote:

On Thu, Apr 06, 2017 at 09:18:36AM +0100, Tvrtko Ursulin wrote:


[snip]


+   j++;
+   }
+
+   bb_i = j++;
+   w->duration.cur = w->duration.max;
+   w->bb_sz = get_bb_sz(>duration);
+   w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
+   terminate_bb(w, seqnos, 0);
+   if (seqnos) {
+   w->reloc.presumed_offset = -1;
+   w->reloc.target_handle = 1;
+   w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
+   w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;


Ugh. That's a magic w/a value for pipecontrols. Fortunately we don't want
to set write_domain here anyway.


I think I copy-pasted this from another IGT. So you say cheat here
as well and set zero for both domains?


Technically the MI is outside of all the GPU cache domains we have :)
Which you pick is immaterial, aside from understanding that
(INSTRUCTION, INSTRUCTION) is special ;)

If you were to drop EXEC_OBJECT_WRITE, you would also drop
reloc.write_domain.


Okay, I will try the cheating approach then.


+   }
+
+   igt_assert(w->dependency <= 0);
+   if (w->dependency) {
+   int dep_idx = i + w->dependency;
+
+   igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
+   igt_assert(wrk->steps[dep_idx].type == BATCH);
+
+   w->obj[j].handle = w->obj[bb_i].handle;
+   bb_i = j;
+   w->obj[j - 1].handle =
+   wrk->steps[dep_idx].obj[0].handle;
+   j++;
+   }
+
+   if (seqnos) {
+   w->obj[bb_i].relocs_ptr = to_user_pointer(>reloc);
+   w->obj[bb_i].relocation_count = 1;
+   }
+
+   w->eb.buffers_ptr = to_user_pointer(w->obj);
+   w->eb.buffer_count = j;
+   w->eb.rsvd1 = wrk->ctx_id[w->context];
+
+   if (swap_vcs && engine == VCS1)
+   engine = VCS2;
+   else if (swap_vcs && engine == VCS2)
+   engine = VCS1;
+   w->eb.flags = eb_engine_map[engine];
+   w->eb.flags |= I915_EXEC_HANDLE_LUT;
+   if (!seqnos)
+   w->eb.flags |= I915_EXEC_NO_RELOC;


Doesn't look too hard to get the relocation right. Forcing relocations
between batches is probably a good one to check (just to say don't do
that)


I am not following here? You are saying don't do relocations at all?
How do I make sure things stay fixed and even how to find out where
they are in the first pass?


Depending on the workload, it may be informative to also do comparisons
between NORELOC and always RELOC. Personally I would make sure we were
using NORELOC as this should be a simulator/example.


How do I use NORELOC? I mean, I have to know where to objects will be 
pinned, or be able to pin them first and know they will remain put. What 
am I not understanding here?



+static void update_bb_seqno(struct w_step *w, uint32_t seqno)
+{
+   unsigned long mmap_start, mmap_offset, mmap_len;
+   void *ptr;
+
+   mmap_start = rounddown(w->seqno_offset, PAGE_SIZE);
+   mmap_offset = w->seqno_offset - mmap_start;
+   mmap_len = sizeof(uint32_t) + mmap_offset;
+
+   gem_set_domain(fd, w->bb_handle,
+  I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
+
+   ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
+
+   *(uint32_t *)((char *)ptr + mmap_offset) = seqno;


Uh oh. I hope this isn't called inside any loop. Note this is
unsynchronized to the gpu so I wonder what this is for.


To update the seqno inside the store_dword_imm. It is called every
time before a batch is executed so I was thinking whether a gem_sync
should be preceding it. But then I was thinking it is problematic in
general if we queue up multiple same batches before they get
executed. :( Sounds like I would need a separate batch for every
iteration for this to work correctly. But that sounds too costly. So
I don't know at the moment.


mmap/munmap, especially munmap, is not free. The munmap will do a
tlb_flush across all cores -- though maybe that's batched and the
munmaps I do all tend to be large enough to trigger every time.

Since you are using a CPU write, on !llc this will be clflushing
everytime. I would suggest stashing the gem_mmap__wc for updating the
seqno between repeats.


Ok, I can try that approach.


[snip]


I need to study this a bit more...


Yes please, especially the bit about how to get accurate seqnos
written out in each step without needing separate execbuf batches.

I've heard recursive batches mentioned in the past so maybe each
iteration could have it's own small 

Re: [Intel-gfx] [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator

2017-04-06 Thread Chris Wilson
On Thu, Apr 06, 2017 at 09:18:36AM +0100, Tvrtko Ursulin wrote:
> 
> On 05/04/2017 17:48, Chris Wilson wrote:
> >On Wed, Apr 05, 2017 at 05:14:01PM +0100, Tvrtko Ursulin wrote:
> >>+static void
> >>+__emit_bb_end(struct w_step *w, bool terminate, bool seqnos, uint32_t 
> >>seqno)
> >>+{
> >>+   const uint32_t bbe = 0xa << 23;
> >>+   unsigned long bb_sz = get_bb_sz(>duration);
> >>+   unsigned long mmap_start, cmd_offset, mmap_len;
> >>+   uint32_t *ptr, *cs;
> >>+
> >>+   mmap_len = (seqnos ? 5 : 1) * sizeof(uint32_t);
> >>+   cmd_offset = bb_sz - mmap_len;
> >>+   mmap_start = rounddown(cmd_offset, PAGE_SIZE);
> >>+   mmap_len += cmd_offset - mmap_start;
> >>+
> >>+   gem_set_domain(fd, w->bb_handle,
> >>+  I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
> >>+
> >>+   ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
> >>+   cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
> >>+
> >>+   if (seqnos) {
> >>+   const int gen = intel_gen(intel_get_drm_devid(fd));
> >>+
> >>+   igt_assert(gen >= 8);
> >>+
> >>+   w->reloc.offset = bb_sz - 4 * sizeof(uint32_t);
> >>+   w->seqno_offset = bb_sz - 2 * sizeof(uint32_t);
> >>+
> >>+   *cs++ = terminate ? MI_STORE_DWORD_IMM : 0;
> >>+   *cs++ = 0;
> >>+   *cs++ = 0;
> >>+   *cs++ = seqno;
> >>+   }
> >>+
> >>+   *cs = terminate ? bbe : 0;
> >>+
> >>+   munmap(ptr, mmap_len);
> >>+}
> >>+
> >>+static void terminate_bb(struct w_step *w, bool seqnos, uint32_t seqno)
> >>+{
> >>+   __emit_bb_end(w, true, seqnos, seqno);
> >>+}
> >>+
> >>+static void unterminate_bb(struct w_step *w, bool seqnos)
> >>+{
> >>+   __emit_bb_end(w, false, seqnos, 0);
> >>+}
> >>+
> >>+static void
> >>+prepare_workload(struct workload *wrk, bool swap_vcs, bool seqnos)
> >>+{
> >>+   int max_ctx = -1;
> >>+   struct w_step *w;
> >>+   int i;
> >>+
> >>+   if (seqnos) {
> >>+   const unsigned int status_sz = sizeof(uint32_t);
> >>+
> >>+   for (i = 0; i < NUM_ENGINES; i++) {
> >>+   wrk->status_page_handle[i] = gem_create(fd, status_sz);
> >
> >Need to set_cache_level(CACHED) for llc.
> >
> >You can use one page for all engines. Just use a different cacheline
> >for each, for safety.
> >
> >>+   wrk->status_page[i] =
> >>+   gem_mmap__cpu(fd, wrk->status_page_handle[i],
> >>+ 0, status_sz, PROT_READ);
> >>+   }
> >>+   }
> >>+
> >>+   for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
> >>+   if ((int)w->context > max_ctx) {
> >>+   int delta = w->context + 1 - wrk->nr_ctxs;
> >>+
> >>+   wrk->nr_ctxs += delta;
> >>+   wrk->ctx_id = realloc(wrk->ctx_id,
> >>+ wrk->nr_ctxs * sizeof(uint32_t));
> >>+   memset(>ctx_id[wrk->nr_ctxs - delta], 0,
> >>+  delta * sizeof(uint32_t));
> >>+
> >>+   max_ctx = w->context;
> >>+   }
> >>+
> >>+   if (!wrk->ctx_id[w->context]) {
> >>+   struct drm_i915_gem_context_create arg = {};
> >>+
> >>+   drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, );
> >>+   igt_assert(arg.ctx_id);
> >>+
> >>+   wrk->ctx_id[w->context] = arg.ctx_id;
> >>+   }
> >>+   }
> >>+
> >>+   for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
> >>+   enum intel_engine_id engine = w->engine;
> >>+   unsigned int bb_i, j = 0;
> >>+
> >>+   if (w->type != BATCH)
> >>+   continue;
> >>+
> >>+   w->obj[j].handle = gem_create(fd, 4096);
> >>+   w->obj[j].flags = EXEC_OBJECT_WRITE;
> >>+   j++;
> >>+
> >>+   if (seqnos) {
> >>+   w->obj[j].handle = wrk->status_page_handle[engine];
> >>+   w->obj[j].flags = EXEC_OBJECT_WRITE;
> >
> >The trick for sharing between engines is to not mark this as a WRITE.
> >Fun little lies.
> 
> Yeah thats why I have per-engine objects. Which I don't mind since
> it is not like they are wasting any resources compared to everything
> else. But not admitting the write sounds still interesting. What
> would the repercussions of that be - limit us to llc platforms or
> something?

It used to be that if we evicted the object (e.g. mempressure/suspend),
then we would not save the contents since it was never marked as dirty.
However, between libva being buggy and Vk deliberately eskewing write
hazards, we had to always mark GPU usage as dirtying the buffers. So
nowadays, EXEC_OBJECT_WRITE only means to track the implicit write
hazard.

> 
> >>+   j++;
> >>+   }
> >>+
> >>+   bb_i = j++;
> >>+   w->duration.cur = w->duration.max;
> >>+   w->bb_sz = get_bb_sz(>duration);
> >>+   w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
> >>+  

Re: [Intel-gfx] [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator

2017-04-06 Thread Tvrtko Ursulin


On 05/04/2017 17:48, Chris Wilson wrote:

On Wed, Apr 05, 2017 at 05:14:01PM +0100, Tvrtko Ursulin wrote:

+static void
+__emit_bb_end(struct w_step *w, bool terminate, bool seqnos, uint32_t seqno)
+{
+   const uint32_t bbe = 0xa << 23;
+   unsigned long bb_sz = get_bb_sz(>duration);
+   unsigned long mmap_start, cmd_offset, mmap_len;
+   uint32_t *ptr, *cs;
+
+   mmap_len = (seqnos ? 5 : 1) * sizeof(uint32_t);
+   cmd_offset = bb_sz - mmap_len;
+   mmap_start = rounddown(cmd_offset, PAGE_SIZE);
+   mmap_len += cmd_offset - mmap_start;
+
+   gem_set_domain(fd, w->bb_handle,
+  I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
+
+   ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
+   cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
+
+   if (seqnos) {
+   const int gen = intel_gen(intel_get_drm_devid(fd));
+
+   igt_assert(gen >= 8);
+
+   w->reloc.offset = bb_sz - 4 * sizeof(uint32_t);
+   w->seqno_offset = bb_sz - 2 * sizeof(uint32_t);
+
+   *cs++ = terminate ? MI_STORE_DWORD_IMM : 0;
+   *cs++ = 0;
+   *cs++ = 0;
+   *cs++ = seqno;
+   }
+
+   *cs = terminate ? bbe : 0;
+
+   munmap(ptr, mmap_len);
+}
+
+static void terminate_bb(struct w_step *w, bool seqnos, uint32_t seqno)
+{
+   __emit_bb_end(w, true, seqnos, seqno);
+}
+
+static void unterminate_bb(struct w_step *w, bool seqnos)
+{
+   __emit_bb_end(w, false, seqnos, 0);
+}
+
+static void
+prepare_workload(struct workload *wrk, bool swap_vcs, bool seqnos)
+{
+   int max_ctx = -1;
+   struct w_step *w;
+   int i;
+
+   if (seqnos) {
+   const unsigned int status_sz = sizeof(uint32_t);
+
+   for (i = 0; i < NUM_ENGINES; i++) {
+   wrk->status_page_handle[i] = gem_create(fd, status_sz);


Need to set_cache_level(CACHED) for llc.

You can use one page for all engines. Just use a different cacheline
for each, for safety.


+   wrk->status_page[i] =
+   gem_mmap__cpu(fd, wrk->status_page_handle[i],
+ 0, status_sz, PROT_READ);
+   }
+   }
+
+   for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+   if ((int)w->context > max_ctx) {
+   int delta = w->context + 1 - wrk->nr_ctxs;
+
+   wrk->nr_ctxs += delta;
+   wrk->ctx_id = realloc(wrk->ctx_id,
+ wrk->nr_ctxs * sizeof(uint32_t));
+   memset(>ctx_id[wrk->nr_ctxs - delta], 0,
+  delta * sizeof(uint32_t));
+
+   max_ctx = w->context;
+   }
+
+   if (!wrk->ctx_id[w->context]) {
+   struct drm_i915_gem_context_create arg = {};
+
+   drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, );
+   igt_assert(arg.ctx_id);
+
+   wrk->ctx_id[w->context] = arg.ctx_id;
+   }
+   }
+
+   for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+   enum intel_engine_id engine = w->engine;
+   unsigned int bb_i, j = 0;
+
+   if (w->type != BATCH)
+   continue;
+
+   w->obj[j].handle = gem_create(fd, 4096);
+   w->obj[j].flags = EXEC_OBJECT_WRITE;
+   j++;
+
+   if (seqnos) {
+   w->obj[j].handle = wrk->status_page_handle[engine];
+   w->obj[j].flags = EXEC_OBJECT_WRITE;


The trick for sharing between engines is to not mark this as a WRITE.
Fun little lies.


Yeah thats why I have per-engine objects. Which I don't mind since it is 
not like they are wasting any resources compared to everything else. But 
not admitting the write sounds still interesting. What would the 
repercussions of that be - limit us to llc platforms or something?



+   j++;
+   }
+
+   bb_i = j++;
+   w->duration.cur = w->duration.max;
+   w->bb_sz = get_bb_sz(>duration);
+   w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
+   terminate_bb(w, seqnos, 0);
+   if (seqnos) {
+   w->reloc.presumed_offset = -1;
+   w->reloc.target_handle = 1;
+   w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
+   w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;


Ugh. That's a magic w/a value for pipecontrols. Fortunately we don't want
to set write_domain here anyway.


I think I copy-pasted this from another IGT. So you say cheat here as 
well and set zero for both domains?





+   }
+
+   igt_assert(w->dependency <= 

Re: [Intel-gfx] [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator

2017-04-05 Thread Chris Wilson
On Wed, Apr 05, 2017 at 05:14:01PM +0100, Tvrtko Ursulin wrote:
> +static void
> +__emit_bb_end(struct w_step *w, bool terminate, bool seqnos, uint32_t seqno)
> +{
> + const uint32_t bbe = 0xa << 23;
> + unsigned long bb_sz = get_bb_sz(>duration);
> + unsigned long mmap_start, cmd_offset, mmap_len;
> + uint32_t *ptr, *cs;
> +
> + mmap_len = (seqnos ? 5 : 1) * sizeof(uint32_t);
> + cmd_offset = bb_sz - mmap_len;
> + mmap_start = rounddown(cmd_offset, PAGE_SIZE);
> + mmap_len += cmd_offset - mmap_start;
> +
> + gem_set_domain(fd, w->bb_handle,
> +I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
> +
> + ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
> + cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
> +
> + if (seqnos) {
> + const int gen = intel_gen(intel_get_drm_devid(fd));
> +
> + igt_assert(gen >= 8);
> +
> + w->reloc.offset = bb_sz - 4 * sizeof(uint32_t);
> + w->seqno_offset = bb_sz - 2 * sizeof(uint32_t);
> +
> + *cs++ = terminate ? MI_STORE_DWORD_IMM : 0;
> + *cs++ = 0;
> + *cs++ = 0;
> + *cs++ = seqno;
> + }
> +
> + *cs = terminate ? bbe : 0;
> +
> + munmap(ptr, mmap_len);
> +}
> +
> +static void terminate_bb(struct w_step *w, bool seqnos, uint32_t seqno)
> +{
> + __emit_bb_end(w, true, seqnos, seqno);
> +}
> +
> +static void unterminate_bb(struct w_step *w, bool seqnos)
> +{
> + __emit_bb_end(w, false, seqnos, 0);
> +}
> +
> +static void
> +prepare_workload(struct workload *wrk, bool swap_vcs, bool seqnos)
> +{
> + int max_ctx = -1;
> + struct w_step *w;
> + int i;
> +
> + if (seqnos) {
> + const unsigned int status_sz = sizeof(uint32_t);
> +
> + for (i = 0; i < NUM_ENGINES; i++) {
> + wrk->status_page_handle[i] = gem_create(fd, status_sz);

Need to set_cache_level(CACHED) for llc.

You can use one page for all engines. Just use a different cacheline
for each, for safety.

> + wrk->status_page[i] =
> + gem_mmap__cpu(fd, wrk->status_page_handle[i],
> +   0, status_sz, PROT_READ);
> + }
> + }
> +
> + for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
> + if ((int)w->context > max_ctx) {
> + int delta = w->context + 1 - wrk->nr_ctxs;
> +
> + wrk->nr_ctxs += delta;
> + wrk->ctx_id = realloc(wrk->ctx_id,
> +   wrk->nr_ctxs * sizeof(uint32_t));
> + memset(>ctx_id[wrk->nr_ctxs - delta], 0,
> +delta * sizeof(uint32_t));
> +
> + max_ctx = w->context;
> + }
> +
> + if (!wrk->ctx_id[w->context]) {
> + struct drm_i915_gem_context_create arg = {};
> +
> + drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, );
> + igt_assert(arg.ctx_id);
> +
> + wrk->ctx_id[w->context] = arg.ctx_id;
> + }
> + }
> +
> + for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
> + enum intel_engine_id engine = w->engine;
> + unsigned int bb_i, j = 0;
> +
> + if (w->type != BATCH)
> + continue;
> +
> + w->obj[j].handle = gem_create(fd, 4096);
> + w->obj[j].flags = EXEC_OBJECT_WRITE;
> + j++;
> +
> + if (seqnos) {
> + w->obj[j].handle = wrk->status_page_handle[engine];
> + w->obj[j].flags = EXEC_OBJECT_WRITE;

The trick for sharing between engines is to not mark this as a WRITE.
Fun little lies.

> + j++;
> + }
> +
> + bb_i = j++;
> + w->duration.cur = w->duration.max;
> + w->bb_sz = get_bb_sz(>duration);
> + w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
> + terminate_bb(w, seqnos, 0);
> + if (seqnos) {
> + w->reloc.presumed_offset = -1;
> + w->reloc.target_handle = 1;
> + w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
> + w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;

Ugh. That's a magic w/a value for pipecontrols. Fortunately we don't want
to set write_domain here anyway.

> + }
> +
> + igt_assert(w->dependency <= 0);
> + if (w->dependency) {
> + int dep_idx = i + w->dependency;
> +
> + igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
> + igt_assert(wrk->steps[dep_idx].type == BATCH);
> +
> + w->obj[j].handle = w->obj[bb_i].handle;
> + bb_i = j;
> +

[Intel-gfx] [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator

2017-04-05 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Tool which emits batch buffers to engines with configurable
sequences, durations, contexts, dependencies and userspace waits.

Unfinished but shows promise so sending out for early feedback.

v2:
 * Load workload descriptors from files. (also -w)
 * Help text.
 * Calibration control if needed. (-t)
 * NORELOC | LUT to eb flags.
 * Added sample workload to wsim/workload1.

v3:
 * Multiple parallel different workloads (-w -w ...).
 * Multi-context workloads.
 * Variable (random) batch length.
 * Load balancing (round robin and queue depth estimation).
 * Workloads delays and explicit sync steps.
 * Workload frequency (period) control.

TODO list:

 * Fence support.
 * Move majority of help text to README.
 * Better error handling.
 * Less 1980's workload parsing.
 * Proper workloads.
 * Explicit waits?
 * Threads?
 * ... ?

Signed-off-by: Tvrtko Ursulin 
Cc: Chris Wilson 
Cc: "Rogozhkin, Dmitry V" 
---
 benchmarks/Makefile.sources |1 +
 benchmarks/gem_wsim.c   | 1053 +++
 benchmarks/wsim/workload1   |7 +
 benchmarks/wsim/workload2   |7 +
 benchmarks/wsim/workload3   |7 +
 benchmarks/wsim/workload4   |8 +
 benchmarks/wsim/workload5   |8 +
 benchmarks/wsim/workload6   |8 +
 8 files changed, 1099 insertions(+)
 create mode 100644 benchmarks/gem_wsim.c
 create mode 100644 benchmarks/wsim/workload1
 create mode 100644 benchmarks/wsim/workload2
 create mode 100644 benchmarks/wsim/workload3
 create mode 100644 benchmarks/wsim/workload4
 create mode 100644 benchmarks/wsim/workload5
 create mode 100644 benchmarks/wsim/workload6

diff --git a/benchmarks/Makefile.sources b/benchmarks/Makefile.sources
index 3af54ebe36f2..3a941150abb3 100644
--- a/benchmarks/Makefile.sources
+++ b/benchmarks/Makefile.sources
@@ -14,6 +14,7 @@ benchmarks_prog_list =\
gem_prw \
gem_set_domain  \
gem_syslatency  \
+   gem_wsim\
kms_vblank  \
prime_lookup\
vgem_mmap   \
diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c
new file mode 100644
index ..38041da1f6e3
--- /dev/null
+++ b/benchmarks/gem_wsim.c
@@ -0,0 +1,1053 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+
+#include "intel_chipset.h"
+#include "drm.h"
+#include "ioctl_wrappers.h"
+#include "drmtest.h"
+#include "intel_io.h"
+
+enum intel_engine_id {
+   RCS,
+   BCS,
+   VCS,
+   VCS1,
+   VCS2,
+   VECS,
+   NUM_ENGINES
+};
+
+struct duration {
+   unsigned int min, max, cur;
+};
+
+enum w_type
+{
+   BATCH,
+   SYNC,
+   DELAY,
+   PERIOD
+};
+
+struct w_step
+{
+   /* Workload step metadata */
+   enum w_type type;
+   unsigned int context;
+   unsigned int engine;
+   struct duration duration;
+   int dependency;
+   int wait;
+
+   /* Implementation details */
+   struct drm_i915_gem_execbuffer2 eb;
+   struct drm_i915_gem_exec_object2 obj[4];
+   struct drm_i915_gem_relocation_entry reloc;
+   unsigned long bb_sz;
+   uint32_t bb_handle;
+   uint64_t seqno_offset;
+};
+
+struct workload
+{
+   unsigned int nr_steps;
+   struct w_step *steps;
+
+   struct timespec repeat_start;
+
+   unsigned int nr_ctxs;
+   uint32_t *ctx_id;
+
+   unsigned long seqno[NUM_ENGINES];
+   uint32_t status_page_handle[NUM_ENGINES];
+   uint32_t