fs: SIMD32 support for fragment shaders

Eero Tamminen Tue, 29 May 2018 08:48:02 -0700

Hi,

On 25.05.2018 00:55, Jason Ekstrand wrote:

This patch series adds back-end compiler support for SIMD32 fragment
shaders.  Support is added and everything works but it's currently hidden
behind INTEL_DEBUG=do32.  We know that it improves performance in some
cases but we do not yet have a good enough heuristic to start turning it on
by default.  The objective of this series is to just to get the compiler
infrastructure landed so that it stops bit-rotting in Curro's branch.


Tested v3 on BXT & SKL.  Everything seems to work otherwise fine.

Tested-by Eero Tamminen <eero.t.tammi...@intel.com>

Figuring out a good heuristic is left as an exercise to the reader. :-)


Simple heuristic that just enables SIMD32 for everything that isn't
MRT shader, gives nice perf improvements on BXT J4205:
* +30% GfxBench ALU2
* +25% SynMark PSPom
* +10% GpuTest Julia32
* +9% GfxBench CarChase
* +7% GfxBench Manhattan 3.0
* +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle
* +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark

* -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*,VSInstancing & ZBuffer

* -2-3% GLB 2.7 Fill
* -4-5% MemBW Blend

On SKL, perf differences are smaller.

SIMD32 can cause write bound tests to trash, which is visible as perf
regression in fully write bound tests above (that's also the reason
why SIMD32 is good to disable with MRT shaders).

As to reads, SIMD32 improves cache locality until it starts trashing.
In above GfxBench tests, and amount of texture sampling they do, this
shows in HW counters as increased texture cache misses (trashing), but
less L3 misses (better locality).  Along with (more important) better
latency compensation, these explain why SIMD32 improves performance in
them.


More advanced heuristics that try to avoid the SIMD32 performance
regressions, unfortunately also get rid of clear part of the above
improvements.  Such heuristics would need improved instruction scheduler
that provides feedback on which shaders have latency issues where SIMD32
would help.

(A potential run-time heuristics would be disabling SIMD32 when too
large textures are bound for draw.)


        - Eero

Francisco Jerez (34):
   intel/eu: Remove brw_codegen::compressed_stack.
   intel/fs: Rename a local variable so it doesn't shadow component()
   intel/fs: Use the ATTR file for FS inputs
   intel/fs: Replace the CINTERP opcode with a simple MOV
   intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
   intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
     writes.
   intel/eu: Return new instruction to caller from brw_fb_WRITE().
   intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
   intel/fs: Fix implied_mrf_writes() for headerless FB writes.
   intel/fs: Remove program key argument from generator.
   intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
   intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
   intel/eu: Fix pixel interpolator queries for SIMD32.
   intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
   intel/fs: Don't enable dual source blend if no outputs are written
   intel/fs: Fix FB write message control codegen for SIMD32.
   intel/fs: Fix logical FB write lowering for SIMD32
   intel/fs: Fix FB read header setup for SIMD32.
   intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
   intel/fs: Mark LINTERP opcode as writing accumulator implicitly on
     pre-Gen7.
   intel/fs: Disable opt_sampler_eot() in 32-wide dispatch.
   i965: Add plumbing for shader time in 32-wide FS dispatch mode.
   intel/fs: Simplify fs_visitor::emit_samplepos_setup
   intel/fs: Use fs_regs instead of brw_regs in the unlit centroid
     workaround
   intel/fs: Wrap FS payload register look-up in a helper function.
   intel/fs: Extend thread payload layout to SIMD32
   intel/fs: Implement 32-wide FS payload setup on Gen6+
   intel/fs: Fix Gen7 compressed source region alignment restriction for
     SIMD32
   intel/fs: Fix sample id setup for SIMD32.
   intel/fs: Generalize the unlit centroid workaround
   intel/fs: Fix Gen6+ interpolation setup for SIMD32
   intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch.
   intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32.
   intel/fs: Build 32-wide FS shaders.

Jason Ekstrand (19):
   intel/fs: Assert that the gen4-6 plane restrictions are followed
   intel/fs: Use groups for SIMD16 LINTERP on gen11+
   intel/fs: FS_OPCODE_REP_FB_WRITE has side effects
   intel/fs: Properly track implied header regs read by FB writes
   intel/fs: Pull FB write implied headers from src[0]
   intel/fs: Set up FB write message headers in the visitor
   i965: Re-arrange shader kernel setup in WM state
   intel/compiler: Add and use helpers for working with KSP indices
   intel/fs: Rework KSP data to be SIMD width-based
   intel/fs: Split instructions low to high in lower_simd_width
   intel/fs: Properly copy default flag reg for 3src instrucitons
   intel/fs: Add the group to the flag subreg number on SNB and older
   intel/fs: Emit LINE+MAC for LINTERP with unaligned coordinates
   intel/fs: Emit MOV_DISPATCH_TO_FLAGS once for the centroid workaround
   intel/fs: Get rid of MOV_DISPATCH_TO_FLAGS
   intel/fs: Add fields to wm_prog_data for SIMD32 dispatch
   intel/anv,blorp,i965: Implement the SKL 16x MSAA SIMD32 workaround
   intel/fs: Remove support push constants in repclear shaders
   intel/fs: Support SIMD32 repclear shaders

  src/intel/blorp/blorp.c                       |   2 +-
  src/intel/blorp/blorp_genX_exec.h             |  82 +++-
  src/intel/compiler/brw_compiler.h             |  98 +++-
  src/intel/compiler/brw_eu.h                   |  21 +-
  src/intel/compiler/brw_eu_defines.h           |   2 -
  src/intel/compiler/brw_eu_emit.c              |  39 +-
  src/intel/compiler/brw_fs.cpp                 | 666 ++++++++++++++++----------
  src/intel/compiler/brw_fs.h                   |  53 +-
  src/intel/compiler/brw_fs_builder.h           |   6 +-
  src/intel/compiler/brw_fs_cse.cpp             |   1 -
  src/intel/compiler/brw_fs_generator.cpp       | 318 ++++++------
  src/intel/compiler/brw_fs_nir.cpp             |  57 ++-
  src/intel/compiler/brw_fs_visitor.cpp         | 193 ++++----
  src/intel/compiler/brw_ir_fs.h                |   1 +
  src/intel/compiler/brw_shader.cpp             |  12 +-
  src/intel/compiler/brw_vec4.cpp               |   2 +-
  src/intel/compiler/brw_vec4_gs_visitor.cpp    |   2 +-
  src/intel/compiler/brw_vec4_tcs.cpp           |   2 +-
  src/intel/compiler/brw_wm_iz.cpp              |  11 +-
  src/intel/vulkan/anv_pipeline.c               |   2 +-
  src/intel/vulkan/genX_pipeline.c              |  40 +-
  src/mesa/drivers/dri/i965/brw_context.h       |   1 +
  src/mesa/drivers/dri/i965/brw_program.c       |   6 +
  src/mesa/drivers/dri/i965/brw_wm.c            |   6 +-
  src/mesa/drivers/dri/i965/gen4_blorp_exec.h   |  17 +-
  src/mesa/drivers/dri/i965/genX_state_upload.c | 144 ++++--
  26 files changed, 1101 insertions(+), 683 deletions(-)


_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

Reply via email to