This is not how the final patch would look. Rather, we'd remove the flatten the if (post_reg_alloc) block and remove the else clause. This patch just aims to prove that we're choosing instructions in a bad order.
On Sandybridge GLB2.5 C24Z16_DXT1 1600x900 non-composited: x before + after +------------------------------------------------------------------------------+ | + | | x + + | | xxxxx ++++x x x + ++ | |x xxxxx x ++*++xx*xx x + ++++++| | |___M______________A_____________|___|_M_____________A__________________| | +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 23 8025.58 8203.4 8048.86 8105.5061 72.50085 + 23 8156.34 8323.38 8185.55 8236.8326 74.079214 Difference at 95.0% confidence 131.327 +/- 43.5508 1.62021% +/- 0.537299% (Student's t, pooled s = 73.2943) The original goal of pre-register allocation scheduling was to reduce live ranges so we'd use fewer registers and hopefully fit into 16-wide. In shader-db, this change causes us to lose 30 16-wide programs, but we gain 29... so it's a toss-up. At least by choosing instructions in a better order all programs should be slightly faster. Consider the trivial case of uniform float a, b; void main() { gl_FragColor = vec4(cross(a, b)); } Before the patch we compile this to mov.sat(8) m4<1>F 0F mul(8) g3<1>F g2.4<0,1,0>F g2<0,1,0>F mad.sat(8) m3<1>F -g3<4,1,1>F g2.3<4,1,1>F.x g2.1<4,1,1>F.x mul(8) g3<1>F g2.3<0,1,0>F g2.2<0,1,0>F mad.sat(8) m2<1>F -g3<4,1,1>F g2.5<4,1,1>F.x g2<4,1,1>F.x mul(8) g3<1>F g2.5<0,1,0>F g2.1<0,1,0>F mad.sat(8) m1<1>F -g3<4,1,1>F g2.4<4,1,1>F.x g2.2<4,1,1>F.x sendc(8) null m1<8,8,1>F where we stall on each mad.sat waiting for the mul to finish. The sendc is issued cycle 66. After the patch it compiles to mul(8) g3<1>F g2.5<0,1,0>F g2.1<0,1,0>F mul(8) g4<1>F g2.3<0,1,0>F g2.2<0,1,0>F mul(8) g5<1>F g2.4<0,1,0>F g2<0,1,0>F mov.sat(8) m4<1>F 0F mad.sat(8) m1<1>F -g3<4,1,1>F g2.4<4,1,1>F.x g2.2<4,1,1>F.x mad.sat(8) m2<1>F -g4<4,1,1>F g2.5<4,1,1>F.x g2<4,1,1>F.x mad.sat(8) m3<1>F -g5<4,1,1>F g2.3<4,1,1>F.x g2.1<4,1,1>F.x sendc(8) null m1<8,8,1>F By hiding much of the latency, the sendc instruction is issued by cycle 32. --- .../dri/i965/brw_fs_schedule_instructions.cpp | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/src/mesa/drivers/dri/i965/brw_fs_schedule_instructions.cpp b/src/mesa/drivers/dri/i965/brw_fs_schedule_instructions.cpp index 90f1a16..4d2dbe8 100644 --- a/src/mesa/drivers/dri/i965/brw_fs_schedule_instructions.cpp +++ b/src/mesa/drivers/dri/i965/brw_fs_schedule_instructions.cpp @@ -753,9 +753,9 @@ instruction_scheduler::schedule_instructions(fs_inst *next_block_header) * but also the MRF setup for the next sampler message, which in turn * unblocks the next sampler message). */ - for (schedule_node *node = (schedule_node *)instructions.get_tail(); - node != instructions.get_head()->prev; - node = (schedule_node *)node->prev) { + for (schedule_node *node = (schedule_node *)instructions.get_head(); + node != instructions.get_tail()->next; + node = (schedule_node *)node->next) { schedule_node *n = (schedule_node *)node; chosen = n; -- 1.7.8.6 _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev