On 07.04.2018 04:31, Marek Olšák wrote:
From: Marek Olšák <marek.ol...@amd.com>

(This patch doesn't enable the behavior. It will be enabled in a later

Draw calls from multiple IBs can be executed in parallel.

v2: do emit partial flushes on SI
v3: invalidate all shader caches at the beginning of IBs
v4: don't call si_emit_cache_flush in si_flush_gfx_cs if not needed,
     only do this for flushes invoked internally
v5: empty IBs should wait for idle if the flush requires it
v6: split the commit

If we artificially limit the number of draw calls per IB to 5, we'll get
a lot more IBs, leading to a lot more partial flushes. Let's see how
the removal of partial flushes changes GPU utilization in that scenario:

With partial flushes (time busy):
     CP: 99%
     SPI: 86%
     CB: 73:

Without partial flushes (time busy):
     CP: 99%
     SPI: 93%
     CB: 81%
  src/gallium/drivers/radeon/radeon_winsys.h |  7 ++++
  src/gallium/drivers/radeonsi/si_gfx_cs.c   | 52 ++++++++++++++++++++++--------
  src/gallium/drivers/radeonsi/si_pipe.h     |  1 +
  3 files changed, 46 insertions(+), 14 deletions(-)
+       /* Always invalidate caches at the beginning of IBs, because external
+        * users (e.g. BO evictions and SDMA/UVD/VCE IBs) can modify our
+        * buffers.
+        *
+        * Note that the cache flush done by the kernel at the end of GFX IBs
+        * isn't useful here, because that flush can finish after the following
+        * IB starts drawing.
+        *
+        * TODO: Do we also need to invalidate CB & DB caches?

I don't think so.

Kernel buffer move: CB & DB caches use logical addressing, so should be unaffected.

UVD: APIs should forbid writing to the currently bound framebuffer.

CPU: Shouldn't be writing directly to the framebuffer, and even if it does (linear framebuffer?), I believe OpenGL requires re-binding the framebuffer.


+        */
+       ctx->flags |= SI_CONTEXT_INV_ICACHE |
+                     SI_CONTEXT_INV_SMEM_L1 |
+                     SI_CONTEXT_INV_VMEM_L1 |
+                     SI_CONTEXT_INV_GLOBAL_L2 |
+                     SI_CONTEXT_START_PIPELINE_STATS;
/* set all valid group as dirty so they get reemited on
         * next draw command
/* The CS initialization should be emitted before everything else. */
        si_pm4_emit(ctx, ctx->init_config);
        if (ctx->init_config_gs_rings)
                si_pm4_emit(ctx, ctx->init_config_gs_rings);
diff --git a/src/gallium/drivers/radeonsi/si_pipe.h 
index 0c90a6c6e46..f0f323ff3a7 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.h
+++ b/src/gallium/drivers/radeonsi/si_pipe.h
@@ -540,20 +540,21 @@ struct si_context {
        void                            *vs_blit_texcoord;
        struct si_screen                *screen;
        struct pipe_debug_callback      debug;
        LLVMTargetMachineRef            tm; /* only non-threaded compilation */
        struct si_shader_ctx_state      fixed_func_tcs_shader;
        struct r600_resource            *wait_mem_scratch;
        unsigned                        wait_mem_number;
        uint16_t                        prefetch_L2_mask;
bool gfx_flush_in_progress:1;
+       bool                            gfx_last_ib_is_busy:1;
        bool                            compute_is_busy:1;
unsigned num_gfx_cs_flushes;
        unsigned                        initial_gfx_cs_size;
        unsigned                        gpu_reset_counter;
        unsigned                        last_dirty_tex_counter;
        unsigned                        last_compressed_colortex_counter;
        unsigned                        last_num_draw_calls;
        unsigned                        flags; /* flush flags */
        /* Current unaccounted memory usage. */

