Kenneth Graunke <kenn...@whitecape.org> writes: > On 12/07/2012 02:58 PM, Eric Anholt wrote:
>> + case SHADER_OPCODE_TEX: >> + case SHADER_OPCODE_TXD: >> + case SHADER_OPCODE_TXF: >> + case SHADER_OPCODE_TXL: >> + case SHADER_OPCODE_TXS: >> + /* 18 cycles: >> + * mov(8) g115<1>F 0F { align1 >> WE_normal 1Q }; >> + * mov(8) g114<1>F 0F { align1 >> WE_normal 1Q }; >> + * send(8) g4<1>UW g114<8,8,1>F >> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 >> WE_normal 1Q }; >> + * >> + * 697 +/-49 cycles (min 610, n=26): >> + * mov(8) g115<1>F 0F { align1 >> WE_normal 1Q }; >> + * mov(8) g114<1>F 0F { align1 >> WE_normal 1Q }; >> + * send(8) g4<1>UW g114<8,8,1>F >> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 >> WE_normal 1Q }; >> + * mov(8) null g4<8,8,1>F { align1 >> WE_normal 1Q }; >> + * >> + * So the latency on our first texture load of the batchbuffer takes >> + * ~700 cycles, since the caches are cold at that point. >> + * >> + * 840 +/- 92 cycles (min 720, n=25): >> + * mov(8) g115<1>F 0F { align1 >> WE_normal 1Q }; >> + * mov(8) g114<1>F 0F { align1 >> WE_normal 1Q }; >> + * send(8) g4<1>UW g114<8,8,1>F >> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 >> WE_normal 1Q }; >> + * mov(8) null g4<8,8,1>F { align1 >> WE_normal 1Q }; >> + * send(8) g4<1>UW g114<8,8,1>F >> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 >> WE_normal 1Q }; >> + * mov(8) null g4<8,8,1>F { align1 >> WE_normal 1Q }; >> + * >> + * On the second load, it takes just an extra ~140 cycles, and after >> + * accounting for the 14 cycles of the MOV's latency, that makes ~130. >> + * >> + * 683 +/- 49 cycles (min = 602, n=47): >> + * mov(8) g115<1>F 0F { align1 >> WE_normal 1Q }; >> + * mov(8) g114<1>F 0F { align1 >> WE_normal 1Q }; >> + * send(8) g4<1>UW g114<8,8,1>F >> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 >> WE_normal 1Q }; >> + * send(8) g50<1>UW g114<8,8,1>F >> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 >> WE_normal 1Q }; >> + * mov(8) null g4<8,8,1>F { align1 >> WE_normal 1Q }; >> + * >> + * The unit appears to be pipelined, since this matches up with the >> + * cache-cold case, despite there being two loads here. If you >> replace >> + * the g4 in the MOV to null with g50, it's still 693 +/- 52 (n=39). >> + * >> + * So, take some number between the cache-hot 140 cycles and the >> + * cache-cold 700 cycles. No particular tuning was done on this. >> + * >> + * I haven't done significant testing of the non-TEX opcodes. TXL at >> + * least looked about the same as TEX. >> + */ >> + latency = 200; >> + break; >> + >> + case FS_OPCODE_VARYING_PULL_CONSTANT_LOAD: >> + case FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD: >> + /* testing using varying-index pull constants: >> + * >> + * 16 cycles: >> + * mov(8) g4<1>D g2.1<0,1,0>F { align1 WE_normal >> 1Q }; >> + * send(8) g4<1>F g4<8,8,1>D >> + * data (9, 2, 3) mlen 1 rlen 1 { align1 WE_normal >> 1Q }; >> + * >> + * ~480 cycles: >> + * mov(8) g4<1>D g2.1<0,1,0>F { align1 WE_normal >> 1Q }; >> + * send(8) g4<1>F g4<8,8,1>D >> + * data (9, 2, 3) mlen 1 rlen 1 { align1 WE_normal >> 1Q }; >> + * mov(8) null g4<8,8,1>F { align1 WE_normal >> 1Q }; >> + * >> + * ~620 cycles: >> + * mov(8) g4<1>D g2.1<0,1,0>F { align1 WE_normal >> 1Q }; >> + * send(8) g4<1>F g4<8,8,1>D >> + * data (9, 2, 3) mlen 1 rlen 1 { align1 WE_normal >> 1Q }; >> + * mov(8) null g4<8,8,1>F { align1 WE_normal >> 1Q }; >> + * send(8) g4<1>F g4<8,8,1>D >> + * data (9, 2, 3) mlen 1 rlen 1 { align1 WE_normal >> 1Q }; >> + * mov(8) null g4<8,8,1>F { align1 WE_normal >> 1Q }; >> + * >> + * So, if it's cache-hot, it's about 140. If it's cache cold, it's >> + * about 460. We expect to mostly be cache hot, so pick something >> more >> + * in that direction. >> + */ >> + latency = 200; >> + break; > > Painful. Your "we expect to mostly be cache hot" comment makes sense, > except that Ivybridge's caches are awful when the same cacheline is > accessed within 16 cycles or so. > > I'd really love to see some timing data on using LD messages (to get the > L1 and L2 caches). See my old patch that we couldn't justify: I think we'll probably only justify this one through whole app testing. The uniform load we're using now *is* faster (480/620 for 1 or 2 loads vs 697/840 using texturing), as long as you don't hit the bug
pgpkyk7x3d2fO.pgp
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev