Generally, instructions in Align16 mode only ever write to a single register and don't need any form of SIMD splitting, that's why we have never had a SIMD splitting pass in the vec4 backend. However, double-precision instructions typically write 2 registers and in some cases they run into certain hardware bugs and limitations that we need to work around by splitting the instructions so we only write to 1 register at a time. This patch implements a SIMD splitting pass similar to the one in the scalar backend.
Because we only use double-precision instructions in Align16 mode in gen7 (gen8+ is fully scalar and gens < 7 do not implement fp64) the pass should be a no-op on any other generation. For now the pass only handles the gen7 restriction where any instruction that writes 2 registers also needs to read 2 registers. This affects double-precision instructions reading uniforms, for example. Later patches will extend the lowering pass adding a few more cases. v2: - Compute number of registers written instead of fixing it to 1 (Iago) - Use group from backend_instruction (Iago) - Drop assertion that checked that we only split 8-wide instructions into 4-wide. (Curro) - Don't assume that instructions can only be 8-wide, we might want to use 16-wide instructions in the future too (Curro) - Wrap gen7 workarounds in a conditional to ease adding workarounds for other gens in the future (Curro) - Handle dst/src overlap hazard (Curro) - Use the horiz_offset() helper to simplify the implementation (Curro) - Drop the assertion that checks that each split instruction writes exactly one register (Curro) - Use the copy constructor to generate split instructions with all the relevant fields initialized to the values in the original instruction instead of copying only a handful of them manually (Curro) --- Curro: I think this version addresses all the feedback you had for the v1 (of course, changes to the semantics of reg_offset pending). I don't expect an Rb yet, I am only sending this now to see if there is anything else that you think should be improved. This version uses a horiz_offset() helper and a is_align1_partial_write() method that are new to the series and I have not posted in the list yet but I think their doings are clear and you can still have a look at the patch without the code for them. They are available in our i965-fp64-gen7-scalar-vec4-rc2 brach though, the specific commits were they are introduced are these: https://github.com/Igalia/mesa/commit/1f57bc038217a3f4eddcaf2ed53462d318100cd2 https://github.com/Igalia/mesa/commit/8f0afe449d61ac1031dc868aef3a5b879ef06290 src/mesa/drivers/dri/i965/brw_vec4.cpp | 157 ++++++++++++++++++++++++++++++++- src/mesa/drivers/dri/i965/brw_vec4.h | 2 + 2 files changed, 158 insertions(+), 1 deletion(-) diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp index 29919fd..c9aafdc 100644 --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp @@ -1949,6 +1949,158 @@ vec4_visitor::convert_to_hw_regs() } } +/** + * Get the closest native SIMD width supported by the hardware for instruction + * \p inst. The instruction will be left untouched by + * vec4_visitor::lower_simd_width() if the returned value matches the + * instruction's original execution size. + */ +static unsigned +get_lowered_simd_width(const struct brw_device_info *devinfo, + const vec4_instruction *inst) +{ + unsigned lowered_width = MIN2(16, inst->exec_size); + + /* We need to split some cases of double-precision instructions that write + * 2 registers. We only need to care about this in gen7 because that is the + * only hardware that implements fp64 in Align16. + */ + if (devinfo->gen == 7 && inst->regs_written > 1) { + /* HSW PRM, 3D Media GPGPU Engine, Region Alignment Rules for Direct + * Register Addressing: + * + * "When destination spans two registers, the source MUST span two + * registers." + */ + for (unsigned i = 0; i < 3; i++) { + if (inst->src[i].file == BAD_FILE) + continue; + if (inst->regs_read(i) < 2) + lowered_width = MIN2(lowered_width, 4); + } + } + + return lowered_width; +} + +static bool +dst_src_regions_overlap(vec4_instruction *inst) +{ + if (inst->regs_written == 0) + return false; + + unsigned dst_start = inst->dst.reg_offset; + unsigned dst_end = dst_start + inst->regs_written - 1; + for (int i = 0; i < 3; i++) { + if (inst->src[i].file == BAD_FILE) + continue; + + if (inst->dst.file != inst->src[i].file || + inst->dst.nr != inst->src[i].nr) + continue; + + unsigned src_start = inst->src[i].reg_offset; + unsigned src_end = src_start + inst->regs_read(i) - 1; + + if ((dst_start >= src_start && dst_start <= src_end) || + (dst_end >= src_start && dst_end <= src_end) || + (dst_start <= src_start && dst_end >= src_end)) { + return true; + } + } + + return false; +} + +bool +vec4_visitor::lower_simd_width() +{ + bool progress = false; + + foreach_block_and_inst_safe(block, vec4_instruction, inst, cfg) { + const unsigned lowered_width = get_lowered_simd_width(devinfo, inst); + assert(lowered_width <= inst->exec_size); + if (lowered_width == inst->exec_size) + continue; + + /* We need to deal with source / destination overlaps when splitting. + * The hardware supports reading from and writing to the same register + * in the same instruction, but we need to be careful that each split + * instruction we produce does not corrupt the source of the next. + * + * The easiest way to handle this is to make the split instructions write + * to temporaries if there is an src/dst overlap and then move from the + * temporaries to the original destination. We also need to consider + * instructions that do partial writes via align1 opcodes, in which case + * we need to make sure that the we initialize the temporary with the + * value of the instruction's dst. + */ + bool needs_temp = dst_src_regions_overlap(inst); + for (unsigned n = 0; n < inst->exec_size / lowered_width; n++) { + unsigned channel_offset = lowered_width * n; + + unsigned regs_written = + DIV_ROUND_UP(lowered_width * type_sz(inst->dst.type), REG_SIZE); + + /* Create the split instruction from the original so that we copy all + * relevant instruction fields, then set the width and calculate the + * new dst/src regions. + */ + vec4_instruction *linst = new(mem_ctx) vec4_instruction(*inst); + linst->exec_size = lowered_width; + linst->group = channel_offset; + linst->regs_written = regs_written; + + /* Compute split dst region */ + dst_reg dst; + if (needs_temp) { + dst = retype(dst_reg(VGRF, alloc.allocate(1)), inst->dst.type); + if (inst->is_align1_partial_write()) { + vec4_instruction *copy = MOV(dst, src_reg(inst->dst)); + copy->exec_size = lowered_width; + copy->group = channel_offset; + copy->regs_written = regs_written; + inst->insert_before(block, copy); + } + } else { + dst = horiz_offset(inst->dst, channel_offset); + } + linst->dst = dst; + + /* Compute split source regions */ + for (int i = 0; i < 3; i++) { + if (linst->src[i].file == BAD_FILE) + continue; + + if (!is_uniform(linst->src[i])) + linst->src[i] = horiz_offset(linst->src[i], channel_offset); + } + + inst->insert_before(block, linst); + + /* If we used a temporary to store the result of the split + * instruction, copy the result to the original destination + */ + if (needs_temp) { + vec4_instruction *mov = MOV(offset(inst->dst, n), src_reg(dst)); + mov->exec_size = lowered_width; + mov->group = channel_offset; + mov->regs_written = regs_written; + mov->predicate = inst->predicate; + inst->insert_before(block, mov); + } + } + + inst->remove(block); + progress = true; + } + + if (progress) + invalidate_live_intervals(); + + return progress; +} + bool vec4_visitor::run() { @@ -2004,9 +2156,12 @@ vec4_visitor::run() backend_shader::dump_instructions(filename); } - bool progress; + bool progress = false; int iteration = 0; int pass_num = 0; + + OPT(lower_simd_width); + do { progress = false; pass_num = 0; diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h b/src/mesa/drivers/dri/i965/brw_vec4.h index 6063bee..3f7045e 100644 --- a/src/mesa/drivers/dri/i965/brw_vec4.h +++ b/src/mesa/drivers/dri/i965/brw_vec4.h @@ -162,6 +162,8 @@ public: void opt_schedule_instructions(); void convert_to_hw_regs(); + bool lower_simd_width(); + vec4_instruction *emit(vec4_instruction *inst); vec4_instruction *emit(enum opcode opcode); -- 2.7.4 _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev