Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin> +static void > +run_workload(unsigned int id, struct workload *wrk, unsigned int repeat, > + enum intel_engine_id (*balance)(struct workload *wrk, > + struct w_step *w), > + unsigned int flags) > +{ > + struct timespec t_start, t_end; > + struct w_step *w; > + double t; > + int i, j; > + > + clock_gettime(CLOCK_MONOTONIC, _start); > + > + srand(t_start.tv_nsec); Let's supply the seed with the workload specification. And use a portable prng so we can be sure we can reproduce results from one system to the next. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
On Thu, Apr 20, 2017 at 03:34:56PM +0100, Tvrtko Ursulin wrote: > > On 20/04/2017 15:23, Chris Wilson wrote: > >On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote: > >>+static void > >>+alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb > >>*b, > >>+enum intel_engine_id engine, unsigned int flags) > >>+{ > >>+ unsigned int bb_i, j = 0; > >>+ > >>+ b->obj[j].handle = gem_create(fd, 4096); > >>+ b->obj[j].flags = EXEC_OBJECT_WRITE; > >>+ j++; > >>+ > >>+ if (flags & SEQNO) { > >>+ b->obj[j].handle = wrk->status_page_handle; > >>+ j++; > >>+ } > >>+ > >>+ bb_i = j++; > >>+ b->bb_sz = get_bb_sz(w->duration.max); > >>+ b->bb_handle = b->obj[bb_i].handle = gem_create(fd, b->bb_sz); > >>+ terminate_bb(w, b, engine, flags); > >>+ > >>+ igt_assert(w->dependency <= 0); > >>+ if (w->dependency) { > >>+ int dep_idx = w->idx + w->dependency; > >>+ > >>+ igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps); > >>+ igt_assert(wrk->steps[dep_idx].type == BATCH); > >>+ > >>+ b->obj[j].handle = b->obj[bb_i].handle; > >>+ bb_i = j; > >>+ b->obj[j - 1].handle = wrk->steps[dep_idx].b[0].obj[0].handle; > >>+ j++; > >>+ > >>+ if (wrk->steps[dep_idx].b[1].obj[0].handle) { > >>+ b->obj[j].handle = b->obj[bb_i].handle; > >>+ bb_i = j; > >>+ b->obj[j - 1].handle = > >>+ wrk->steps[dep_idx].b[1].obj[0].handle; > >>+ j++; > >>+ } > >>+ } > >>+ > >>+ if (flags & SEQNO) { > >>+ b->reloc.presumed_offset = -1; > > > >So as I understand it, you are caching the execbuf/obj/reloc for the > >workload and then may reissue later with different seqno on different > >rings? In which case we have a problem as the kernel will write back the > > > > >updated offsets to b->reloc.presumed_offset and b->obj[].offset and in > >future passes they will match and the seqno write will go into the wrong > >slot (if it swaps rings). > > > >You either want to reset presumed_offset=-1 each time, or better for all > >concerned write the correct address alongside the seqno (which also > >enables NORELOC). > > > >Delta incoming. > > Only the seqno changes, but each engine has its own eb/obj/reloc. So > I think there is no problem. Or is there still? bb_handle is per engine as well. Ugh. No, that seems like you are self-consistent, you just need to remove the -1 and your code is NORELOC correct. I wouldn't go so far as having entirely separate batches for each engine though :) -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
On 20/04/2017 15:52, Chris Wilson wrote: On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote: + wrk->nr_bb[engine]++; + + if (engine == VCS && balance) { + engine = balance(wrk, w); + wrk->nr_bb[engine]++; + b = >b[engine - VCS1]; + + if (flags & SEQNO) + update_bb_seqno(b, engine, + ++wrk->seqno[engine]); + } + + if (w->duration.min != w->duration.max) { + unsigned int d = get_duration(>duration); + unsigned long offset; + + offset = ALIGN(b->bb_sz - get_bb_sz(d), + 2 * sizeof(uint32_t)); + b->eb.batch_start_offset = offset; + } + + gem_execbuf(fd, >eb); Likely double counting wrk->nr_bb. I suggest placing it next to the gem_execbuf(). Just convenience in balancing mode so that nr(VCS) = nr(VCS1) + nr(VCS2). Also from a different angle, if the sum does not hold, that means workload had auto-balancing and explicit VCS1/2 batches. It's only used to print out the stats at the end of the run. Regards, Tvrtko ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote: > + wrk->nr_bb[engine]++; > + > + if (engine == VCS && balance) { > + engine = balance(wrk, w); > + wrk->nr_bb[engine]++; > + b = >b[engine - VCS1]; > + > + if (flags & SEQNO) > + update_bb_seqno(b, engine, > + ++wrk->seqno[engine]); > + } > + > + if (w->duration.min != w->duration.max) { > + unsigned int d = get_duration(>duration); > + unsigned long offset; > + > + offset = ALIGN(b->bb_sz - get_bb_sz(d), > +2 * sizeof(uint32_t)); > + b->eb.batch_start_offset = offset; > + } > + > + gem_execbuf(fd, >eb); Likely double counting wrk->nr_bb. I suggest placing it next to the gem_execbuf(). -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
On 20/04/2017 15:33, Chris Wilson wrote: On Thu, Apr 20, 2017 at 03:23:27PM +0100, Chris Wilson wrote: You either want to reset presumed_offset=-1 each time, or better for all concerned write the correct address alongside the seqno (which also enables NORELOC). Delta incoming. See attached. Next concern is that I have full rings which implies that we are not waiting on each batch before resubmitting with a new seqno? If I throw a assert(!busy(batch_bo)) before the *b->mapped_seqno am I going to be upset? Yes you would. :) I had a sync (as a move to cpu domain) before seqno update in the last version but it disappeared as I was fixing the whole area of seqno tracking. So the balancing results in the patch are bogus since the seqno can jump to latest ahead of the time... Regards, Tvrtko ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
On 20/04/2017 15:23, Chris Wilson wrote: On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote: +static void +alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b, +enum intel_engine_id engine, unsigned int flags) +{ + unsigned int bb_i, j = 0; + + b->obj[j].handle = gem_create(fd, 4096); + b->obj[j].flags = EXEC_OBJECT_WRITE; + j++; + + if (flags & SEQNO) { + b->obj[j].handle = wrk->status_page_handle; + j++; + } + + bb_i = j++; + b->bb_sz = get_bb_sz(w->duration.max); + b->bb_handle = b->obj[bb_i].handle = gem_create(fd, b->bb_sz); + terminate_bb(w, b, engine, flags); + + igt_assert(w->dependency <= 0); + if (w->dependency) { + int dep_idx = w->idx + w->dependency; + + igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps); + igt_assert(wrk->steps[dep_idx].type == BATCH); + + b->obj[j].handle = b->obj[bb_i].handle; + bb_i = j; + b->obj[j - 1].handle = wrk->steps[dep_idx].b[0].obj[0].handle; + j++; + + if (wrk->steps[dep_idx].b[1].obj[0].handle) { + b->obj[j].handle = b->obj[bb_i].handle; + bb_i = j; + b->obj[j - 1].handle = + wrk->steps[dep_idx].b[1].obj[0].handle; + j++; + } + } + + if (flags & SEQNO) { + b->reloc.presumed_offset = -1; So as I understand it, you are caching the execbuf/obj/reloc for the workload and then may reissue later with different seqno on different rings? In which case we have a problem as the kernel will write back the > > updated offsets to b->reloc.presumed_offset and b->obj[].offset and in future passes they will match and the seqno write will go into the wrong slot (if it swaps rings). You either want to reset presumed_offset=-1 each time, or better for all concerned write the correct address alongside the seqno (which also enables NORELOC). Delta incoming. Only the seqno changes, but each engine has its own eb/obj/reloc. So I think there is no problem. Or is there still? Regards, Tvrtko ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
On Thu, Apr 20, 2017 at 03:23:27PM +0100, Chris Wilson wrote: > You either want to reset presumed_offset=-1 each time, or better for all > concerned write the correct address alongside the seqno (which also > enables NORELOC). > > Delta incoming. See attached. Next concern is that I have full rings which implies that we are not waiting on each batch before resubmitting with a new seqno? If I throw a assert(!busy(batch_bo)) before the *b->mapped_seqno am I going to be upset? -Chris -- Chris Wilson, Intel Open Source Technology Centre >From 5bf424c2719e81f926b74f4136610cbdfd26a4d8 Mon Sep 17 00:00:00 2001 From: Chris WilsonDate: Thu, 20 Apr 2017 15:30:07 +0100 Subject: [PATCH] seqno-reloc --- benchmarks/gem_wsim.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c index adf2d6de..e616335b 100644 --- a/benchmarks/gem_wsim.c +++ b/benchmarks/gem_wsim.c @@ -45,6 +45,8 @@ #include "drmtest.h" #include "intel_io.h" +#define LOCAL_I915_GEM_DOMAIN_WC 0x80 + enum intel_engine_id { RCS, BCS, @@ -86,7 +88,9 @@ struct w_step struct drm_i915_gem_relocation_entry reloc; unsigned long bb_sz; uint32_t bb_handle; - uint32_t *mapped_batch, *mapped_seqno; + uint32_t *mapped_batch; + uint64_t *mapped_address; + uint32_t *mapped_seqno; unsigned int mapped_len; } b[2]; /* One for each VCS when load balancing */ }; @@ -405,7 +409,8 @@ terminate_bb(struct w_step *w, struct w_step_eb *b, enum intel_engine_id engine, mmap_len += cmd_offset - mmap_start; gem_set_domain(fd, b->bb_handle, - I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU); + LOCAL_I915_GEM_DOMAIN_WC, + LOCAL_I915_GEM_DOMAIN_WC); ptr = gem_mmap__wc(fd, b->bb_handle, mmap_start, mmap_len, PROT_WRITE); cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start); @@ -415,6 +420,7 @@ terminate_bb(struct w_step *w, struct w_step_eb *b, enum intel_engine_id engine, b->reloc.delta = (engine - VCS1) * sizeof(uint32_t); *cs++ = MI_STORE_DWORD_IMM; + b->mapped_address = (uint64_t *)cs; *cs++ = 0; *cs++ = 0; b->mapped_seqno = cs; @@ -469,7 +475,6 @@ alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b, } if (flags & SEQNO) { - b->reloc.presumed_offset = -1; b->reloc.target_handle = 1; b->obj[bb_i].relocs_ptr = to_user_pointer(>reloc); b->obj[bb_i].relocation_count = 1; @@ -485,8 +490,7 @@ alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b, engine = VCS1; b->eb.flags = eb_engine_map[engine]; b->eb.flags |= I915_EXEC_HANDLE_LUT; - if (!(flags & SEQNO)) - b->eb.flags |= I915_EXEC_NO_RELOC; + b->eb.flags |= I915_EXEC_NO_RELOC; #ifdef DEBUG printf("%u: %u:%x|%x|%x|%x %10lu flags=%llx bb=%x[%u] ctx[%u]=%u\n", w->idx, b->eb.buffer_count, b->obj[0].handle, @@ -628,8 +632,9 @@ static void update_bb_seqno(struct w_step_eb *b, enum intel_engine_id engine, uint32_t seqno) { - *b->mapped_seqno = seqno; b->reloc.delta = (engine - VCS1) * sizeof(uint32_t); + *b->mapped_address = b->reloc.presumed_offset + b->reloc.delta; + *b->mapped_seqno = seqno; } static void -- 2.11.0 ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote: > +static void > +alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b, > + enum intel_engine_id engine, unsigned int flags) > +{ > + unsigned int bb_i, j = 0; > + > + b->obj[j].handle = gem_create(fd, 4096); > + b->obj[j].flags = EXEC_OBJECT_WRITE; > + j++; > + > + if (flags & SEQNO) { > + b->obj[j].handle = wrk->status_page_handle; > + j++; > + } > + > + bb_i = j++; > + b->bb_sz = get_bb_sz(w->duration.max); > + b->bb_handle = b->obj[bb_i].handle = gem_create(fd, b->bb_sz); > + terminate_bb(w, b, engine, flags); > + > + igt_assert(w->dependency <= 0); > + if (w->dependency) { > + int dep_idx = w->idx + w->dependency; > + > + igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps); > + igt_assert(wrk->steps[dep_idx].type == BATCH); > + > + b->obj[j].handle = b->obj[bb_i].handle; > + bb_i = j; > + b->obj[j - 1].handle = wrk->steps[dep_idx].b[0].obj[0].handle; > + j++; > + > + if (wrk->steps[dep_idx].b[1].obj[0].handle) { > + b->obj[j].handle = b->obj[bb_i].handle; > + bb_i = j; > + b->obj[j - 1].handle = > + wrk->steps[dep_idx].b[1].obj[0].handle; > + j++; > + } > + } > + > + if (flags & SEQNO) { > + b->reloc.presumed_offset = -1; So as I understand it, you are caching the execbuf/obj/reloc for the workload and then may reissue later with different seqno on different rings? In which case we have a problem as the kernel will write back the updated offsets to b->reloc.presumed_offset and b->obj[].offset and in future passes they will match and the seqno write will go into the wrong slot (if it swaps rings). You either want to reset presumed_offset=-1 each time, or better for all concerned write the correct address alongside the seqno (which also enables NORELOC). Delta incoming. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
[Intel-gfx] [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
From: Tvrtko UrsulinTool which emits batch buffers to engines with configurable sequences, durations, contexts, dependencies and userspace waits. Unfinished but shows promise so sending out for early feedback. v2: * Load workload descriptors from files. (also -w) * Help text. * Calibration control if needed. (-t) * NORELOC | LUT to eb flags. * Added sample workload to wsim/workload1. v3: * Multiple parallel different workloads (-w -w ...). * Multi-context workloads. * Variable (random) batch length. * Load balancing (round robin and queue depth estimation). * Workloads delays and explicit sync steps. * Workload frequency (period) control. v4: * Fixed queue-depth estimation by creating separate batches per engine when qd load balancing is on. * Dropped separate -s cmd line option. It can turn itself on automatically when needed. * Keep a single status page and lie about the write hazard as suggested by Chris. * Use batch_start_offset for controlling the batch duration. (Chris) * Set status page object cache level. (Chris) * Moved workload description to a README. * Tidied example workloads. * Some other cleanups and refactorings. TODO list: * Fence support. * Better error handling. * Less 1980's workload parsing. * Proper workloads. * Threads? * ... ? Signed-off-by: Tvrtko Ursulin Cc: Chris Wilson Cc: "Rogozhkin, Dmitry V" --- Comparing some test workloads under load balancing it seems that it is starting to work, but it still needs more thorough verification. For example, round- robin balancing: # benchmarks/gem_wsim -n 585341 \ -w benchmarks/wsim/vcs1.wsim \ -w benchmarks/wsim/vcs_balanced.wsim \ -r 100 -b 0 Using 585341 nop calibration for 1000us delay. 2 clients. 1: 3.008s elapsed (33.243 workloads/s). 2500 (1250 + 1250) total VCS batches. 0: 4.455s elapsed (22.449 workloads/s). 0 (2500 + 0) total VCS batches. 4.455s elapsed (44.889 workloads/s) Versus the queue-depth estimation: # benchmarks/gem_wsim -n 585341 \ -w benchmarks/wsim/vcs1.wsim \ -w benchmarks/wsim/vcs_balanced.wsim \ -r 100 -b 1 Using 585341 nop calibration for 1000us delay. 2 clients. 1: 2.239s elapsed (44.659 workloads/s). 2500 (837 + 1663) total VCS batches. Average queue depths 27.575, 19.285. 0: 4.012s elapsed (24.928 workloads/s). 0 (2500 + 0) total VCS batches. Average queue depths -nan, -nan. 4.012s elapsed (49.845 workloads/s) In both cases we run two workloads, one which only submits to VCS1 and one which can be load-balanced. The latter gets a ~33% boost with queue-depth estimation, and the non-balancing workload ~10%. --- benchmarks/Makefile.sources |1 + benchmarks/gem_wsim.c| 1014 ++ benchmarks/wsim/README | 54 ++ benchmarks/wsim/media_17i7.wsim |7 + benchmarks/wsim/media_load_balance_17i7.wsim |7 + benchmarks/wsim/vcs1.wsim| 25 + benchmarks/wsim/vcs_balanced.wsim| 25 + 7 files changed, 1133 insertions(+) create mode 100644 benchmarks/gem_wsim.c create mode 100644 benchmarks/wsim/README create mode 100644 benchmarks/wsim/media_17i7.wsim create mode 100644 benchmarks/wsim/media_load_balance_17i7.wsim create mode 100644 benchmarks/wsim/vcs1.wsim create mode 100644 benchmarks/wsim/vcs_balanced.wsim diff --git a/benchmarks/Makefile.sources b/benchmarks/Makefile.sources index 3af54ebe36f2..3a941150abb3 100644 --- a/benchmarks/Makefile.sources +++ b/benchmarks/Makefile.sources @@ -14,6 +14,7 @@ benchmarks_prog_list =\ gem_prw \ gem_set_domain \ gem_syslatency \ + gem_wsim\ kms_vblank \ prime_lookup\ vgem_mmap \ diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c new file mode 100644 index ..adf2d6decf12 --- /dev/null +++ b/benchmarks/gem_wsim.c @@ -0,0 +1,1014 @@ +/* + * Copyright © 2017 Intel Corporation + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice (including the next + * paragraph) shall be included in all copies or substantial portions of the + * Software. + * + *