Re: [Intel-gfx] [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator

Tvrtko Ursulin Fri, 07 Apr 2017 01:53:42 -0700


On 06/04/2017 09:55, Chris Wilson wrote:

On Thu, Apr 06, 2017 at 09:18:36AM +0100, Tvrtko Ursulin wrote:


[snip]

+                       j++;
+               }
+
+               bb_i = j++;
+               w->duration.cur = w->duration.max;
+               w->bb_sz = get_bb_sz(&w->duration);
+               w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
+               terminate_bb(w, seqnos, 0);
+               if (seqnos) {
+                       w->reloc.presumed_offset = -1;
+                       w->reloc.target_handle = 1;
+                       w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
+                       w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;


Ugh. That's a magic w/a value for pipecontrols. Fortunately we don't want
to set write_domain here anyway.


I think I copy-pasted this from another IGT. So you say cheat here
as well and set zero for both domains?


Technically the MI is outside of all the GPU cache domains we have :)
Which you pick is immaterial, aside from understanding that
(INSTRUCTION, INSTRUCTION) is special ;)

If you were to drop EXEC_OBJECT_WRITE, you would also drop
reloc.write_domain.


Okay, I will try the cheating approach then.

+               }
+
+               igt_assert(w->dependency <= 0);
+               if (w->dependency) {
+                       int dep_idx = i + w->dependency;
+
+                       igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
+                       igt_assert(wrk->steps[dep_idx].type == BATCH);
+
+                       w->obj[j].handle = w->obj[bb_i].handle;
+                       bb_i = j;
+                       w->obj[j - 1].handle =
+                                       wrk->steps[dep_idx].obj[0].handle;
+                       j++;
+               }
+
+               if (seqnos) {
+                       w->obj[bb_i].relocs_ptr = to_user_pointer(&w->reloc);
+                       w->obj[bb_i].relocation_count = 1;
+               }
+
+               w->eb.buffers_ptr = to_user_pointer(w->obj);
+               w->eb.buffer_count = j;
+               w->eb.rsvd1 = wrk->ctx_id[w->context];
+
+               if (swap_vcs && engine == VCS1)
+                       engine = VCS2;
+               else if (swap_vcs && engine == VCS2)
+                       engine = VCS1;
+               w->eb.flags = eb_engine_map[engine];
+               w->eb.flags |= I915_EXEC_HANDLE_LUT;
+               if (!seqnos)
+                       w->eb.flags |= I915_EXEC_NO_RELOC;


Doesn't look too hard to get the relocation right. Forcing relocations
between batches is probably a good one to check (just to say don't do
that)


I am not following here? You are saying don't do relocations at all?
How do I make sure things stay fixed and even how to find out where
they are in the first pass?


Depending on the workload, it may be informative to also do comparisons
between NORELOC and always RELOC. Personally I would make sure we were
using NORELOC as this should be a simulator/example.

How do I use NORELOC? I mean, I have to know where to objects will bepinned, or be able to pin them first and know they will remain put. Whatam I not understanding here?

+static void update_bb_seqno(struct w_step *w, uint32_t seqno)
+{
+       unsigned long mmap_start, mmap_offset, mmap_len;
+       void *ptr;
+
+       mmap_start = rounddown(w->seqno_offset, PAGE_SIZE);
+       mmap_offset = w->seqno_offset - mmap_start;
+       mmap_len = sizeof(uint32_t) + mmap_offset;
+
+       gem_set_domain(fd, w->bb_handle,
+                      I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
+
+       ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
+
+       *(uint32_t *)((char *)ptr + mmap_offset) = seqno;


Uh oh. I hope this isn't called inside any loop. Note this is
unsynchronized to the gpu so I wonder what this is for.


To update the seqno inside the store_dword_imm. It is called every
time before a batch is executed so I was thinking whether a gem_sync
should be preceding it. But then I was thinking it is problematic in
general if we queue up multiple same batches before they get
executed. :( Sounds like I would need a separate batch for every
iteration for this to work correctly. But that sounds too costly. So
I don't know at the moment.


mmap/munmap, especially munmap, is not free. The munmap will do a
tlb_flush across all cores -- though maybe that's batched and the
munmaps I do all tend to be large enough to trigger every time.

Since you are using a CPU write, on !llc this will be clflushing
everytime. I would suggest stashing the gem_mmap__wc for updating the
seqno between repeats.


Ok, I can try that approach.

[snip]

I need to study this a bit more...


Yes please, especially the bit about how to get accurate seqnos
written out in each step without needing separate execbuf batches.

I've heard recursive batches mentioned in the past so maybe each
iteration could have it's own small batch which would jump to the
nop/delay one (shared between all iterations) and write the unique
seqno. No idea if that is possible/supported at the moment - I'll go
and dig a bit.


You end up with the same problem of having the reloc change and need to
update every cycle. You could use a fresh batch to rewrite the seqno
values... However, now that you explained what you want, just keep the
WC mmap.


Hm not sure without researching that approach first.

But in general is this correctly implementing your idea for queue depthestimation?


Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator

Reply via email to