On Thu, 2017-09-28 at 21:55 -0700, Kenneth Graunke wrote: > On Thursday, September 28, 2017 3:33:12 AM PDT Iago Toral Quiroga > wrote: > > we can skip these slots when they are not read in the fragment > > shader > > and they are positioned right after a VUE header that we are > > already > > skipping. We also need to ensure that we are passing at least one > > other > > varying, since that is a hardware requirement. > > > > This helps alleviate a problem introduced with 99df02ca26f61 for > > separate shader objects: without separate shader objects we assign > > locations sequentially, however, since that commit we have changed > > the > > method for SSO so that the VUE slot assigned depends on the number > > of > > builtin slots plus the location assigned to the varying. This fixed > > layout is intended to help SSO programs by avoiding on-the-fly > > recompiles > > when swapping out shaders, however, it also means that if a varying > > uses > > a large location number close to the maximum allowed by the SF/FS > > units > > (31), then the offset introduced by the number of builtin slots can > > push > > the location outside the range and trigger an assertion. > > > > This problem is affecting at least the following CTS tests for > > enhanced layouts: > > > > KHR-GL45.enhanced_layouts.varying_array_components > > KHR-GL45.enhanced_layouts.varying_array_locations > > KHR-GL45.enhanced_layouts.varying_components > > KHR-GL45.enhanced_layouts.varying_locations > > > > which use SSO and the the location layout qualifier to select such > > location numbers explicitly. > > > > This change helps these tests because for SSO we always have to > > include > > VARYING_SLOT_CLIP_DIST{0,1} even if the fragment shader is very > > unlikely > > to read them, so by doing this we free builtin slots from the fixed > > VUE > > layout and we avoid the tests to crash in this scenario. > > > > Of course, this is not a proper fix, we'd still run into problems > > if someone > > tries to use an explicit max location and read gl_ViewportIndex, > > gl_LayerID or > > gl_CullDistancein in the FS, but that would be a much less common > > bug and we > > can probably wait to see if anyone actually runs into that > > situation in a real > > world scenario before making the decision that more aggresive > > changes are > > required to support this without reverting 99df02ca26f61. > > > > v2: > > - Add a debug message when we skip clip distances (Ilia) > > - we also need to account for this when we compute the urb setup > > for the fragment shader stage, so add a compiler util to compute > > the first slot that we need to read from the URB instead of > > replicating the logic in both places. > > > > Suggested-by: Kenneth Graunke <kenn...@whitecape.org> > > --- > > src/intel/compiler/brw_compiler.h | 50 > > +++++++++++++++++++++++++++ > > src/intel/compiler/brw_fs.cpp | 7 ++-- > > src/mesa/drivers/dri/i965/genX_state_upload.c | 16 ++++----- > > 3 files changed, 59 insertions(+), 14 deletions(-) > > > > diff --git a/src/intel/compiler/brw_compiler.h > > b/src/intel/compiler/brw_compiler.h > > index 6753a8daf0..411919d829 100644 > > --- a/src/intel/compiler/brw_compiler.h > > +++ b/src/intel/compiler/brw_compiler.h > > @@ -25,6 +25,7 @@ > > #define BRW_COMPILER_H > > > > #include <stdio.h> > > +#include "common/gen_debug.h" > > #include "common/gen_device_info.h" > > #include "main/mtypes.h" > > #include "main/macros.h" > > @@ -1222,6 +1223,55 @@ brw_stage_has_packed_dispatch(const struct > > gen_device_info *devinfo, > > } > > } > > > > +static inline int > > +brw_compute_sf_first_urb_slot(uint64_t inputs_read, > > + const struct brw_vue_map > > *prev_stage_vue_map) > > +{ > > + /* If the fragment shader reads VARYING_SLOT_LAYER, then we > > need to pass in > > + * the full vertex header. Otherwise, we can program the SF to > > start > > + * reading at an offset of 1 (2 varying slots) to skip > > unnecessary data: > > + * - VARYING_SLOT_PSIZ and BRW_VARYING_SLOT_NDC on gen4-5 > > + * - VARYING_SLOT_{PSIZ,LAYER} and VARYING_SLOT_POS on gen6+ > > + */ > > + bool include_vue_header = > > + inputs_read & (VARYING_BIT_LAYER | VARYING_BIT_VIEWPORT); > > + > > + int first_slot = > > + include_vue_header ? 0 : 2 * BRW_SF_URB_ENTRY_READ_OFFSET; > > + > > + /* If we are already skipping the first 2 slots for the VUE > > header then we > > + * can also skip clip distances if they are located right after > > the header > > + * and the FS doesn't read them. Only do this is there are any > > user varyings > > + * though, since the hardware requires that we pass at least > > one varying > > + * between stages. > > + */ > > + uint64_t var_bits = inputs_read & (~(VARYING_BIT_VAR(0) - 1)); > > + uint64_t clip_dist_bits = VARYING_BIT_CLIP_DIST0 | > > VARYING_BIT_CLIP_DIST1; > > + > > + if (!include_vue_header && var_bits && > > + (inputs_read & clip_dist_bits) == 0 && > > + (prev_stage_vue_map->slots_valid & clip_dist_bits)) { > > + > > + uint32_t clip_dist0_slot = > > + prev_stage_vue_map- > > >varying_to_slot[VARYING_SLOT_CLIP_DIST0]; > > + > > + uint32_t clip_dist1_slot = > > + prev_stage_vue_map- > > >varying_to_slot[VARYING_SLOT_CLIP_DIST1]; > > + > > + if (clip_dist0_slot >= 2 && clip_dist0_slot <= 3 && > > + clip_dist1_slot >= 2 && clip_dist1_slot <= 3) { > > + first_slot += 2 * BRW_SF_URB_ENTRY_READ_OFFSET; > > Do we actually really care about clip distance, > specifically? Couldn't > we basically just do: > > for (int i = 0; i < prev_stage_vue_map->num_slots; i++) { > int varying = prev_stage_vue_map->slot_to_varying[i]; > if (varying > 0 && (inputs_read & BITFIELD64_BIT(varying)) != > 0) > return ROUND_DOWN_TO(i, 2); > } > > return 0; > > to find the first slot occupied by something, or the header slot if > there are no varyings read at all? That would take care of skipping > over the header, clip distance, gl_Color, and any other junk slots.
Yes, that is a lot more useful, I don't know why I focused so much in clip distances specifically. I'll send a v3, thanks Ken! > > + > > + if (unlikely(INTEL_DEBUG & DEBUG_URB)) { > > + fprintf(stderr, "SF unit skips unused clip distance > > slots after " > > + "the VUE header.\n"); > > + } > > + } > > + } > > + > > + return first_slot; > > +} > > + > > #ifdef __cplusplus > > } /* extern "C" */ > > #endif > > diff --git a/src/intel/compiler/brw_fs.cpp > > b/src/intel/compiler/brw_fs.cpp > > index b13b56cfba..5b444f7baa 100644 > > --- a/src/intel/compiler/brw_fs.cpp > > +++ b/src/intel/compiler/brw_fs.cpp > > @@ -1481,9 +1481,6 @@ fs_visitor::calculate_urb_setup() > > } > > } > > } else { > > - bool include_vue_header = > > - nir->info.inputs_read & (VARYING_BIT_LAYER | > > VARYING_BIT_VIEWPORT); > > - > > /* We have enough input varyings that the SF/SBE pipeline > > stage can't > > * arbitrarily rearrange them to suit our whim; we have > > to put them > > * in an order that matches the output of the previous > > pipeline stage > > @@ -1493,8 +1490,10 @@ fs_visitor::calculate_urb_setup() > > brw_compute_vue_map(devinfo, &prev_stage_vue_map, > > key->input_slots_valid, > > nir->info.separate_shader); > > + > > int first_slot = > > - include_vue_header ? 0 : 2 * > > BRW_SF_URB_ENTRY_READ_OFFSET; > > + brw_compute_sf_first_urb_slot(nir->info.inputs_read, > > + &prev_stage_vue_map); > > > > assert(prev_stage_vue_map.num_slots <= first_slot + 32); > > for (int slot = first_slot; slot < > > prev_stage_vue_map.num_slots; > > diff --git a/src/mesa/drivers/dri/i965/genX_state_upload.c > > b/src/mesa/drivers/dri/i965/genX_state_upload.c > > index 2a99376e3c..a3c1cf5166 100644 > > --- a/src/mesa/drivers/dri/i965/genX_state_upload.c > > +++ b/src/mesa/drivers/dri/i965/genX_state_upload.c > > @@ -1029,17 +1029,13 @@ genX(calculate_attr_overrides)(const struct > > brw_context *brw, > > > > *point_sprite_enables = 0; > > > > - /* If the fragment shader reads VARYING_SLOT_LAYER, then we > > need to pass in > > - * the full vertex header. Otherwise, we can program the SF to > > start > > - * reading at an offset of 1 (2 varying slots) to skip > > unnecessary data: > > - * - VARYING_SLOT_PSIZ and BRW_VARYING_SLOT_NDC on gen4-5 > > - * - VARYING_SLOT_{PSIZ,LAYER} and VARYING_SLOT_POS on gen6+ > > - */ > > - > > - bool fs_needs_vue_header = fp->info.inputs_read & > > - (VARYING_BIT_LAYER | VARYING_BIT_VIEWPORT); > > + int first_slot = > > + brw_compute_sf_first_urb_slot(fp->info.inputs_read, > > + &brw->vue_map_geom_out); > > > > - *urb_entry_read_offset = fs_needs_vue_header ? 0 : 1; > > + /* There are 2 slots in each offset */ > > + assert(first_slot % 2 == 0); > > + *urb_entry_read_offset = first_slot / 2; > > > > /* From the Ivybridge PRM, Vol 2 Part 1, 3DSTATE_SBE, > > * description of dw10 Point Sprite Texture Coordinate Enable: > > > > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev