Le lundi 24 Avril 2006 02:01, Hugh Fisher a écrit :
> The argument for vectors is that, with todays GPUs, the
> majority of the instructions are four-way vector ops to
> begin with.
Follow the extract of the code compile by Carl Witty.
1er shader :
MOV result.color, c[8].x;
1
MOV result.texcoord[0].xy, vertex.texcoord[0]; 2
MUL result.texcoord[1].xy, vertex.texcoord[0], c[9].x; 2
MOV result.texcoord[4].w, c[10].x; 1
DP3 result.texcoord[4].z, vertex.normal, c[4]; 3
DP3 result.texcoord[4].y, vertex.attrib[15], c[4]; 3
DP3 result.texcoord[4].x, vertex.attrib[14], c[4]; 3
MOV result.texcoord[5].w, c[11].x; 1
DP3 result.texcoord[5].z, vertex.normal, c[5]; 3
DP3 result.texcoord[5].y, vertex.attrib[15], c[5]; 3
DP3 result.texcoord[5].x, vertex.attrib[14], c[5]; 3
DP3 result.texcoord[6].z, vertex.normal, c[6]; 3
DP3 result.texcoord[6].y, vertex.attrib[15], c[6]; 3
DP3 result.texcoord[6].x, vertex.attrib[14], c[6]; 3
MUL R1, vertex.position.y, c[5];
1
MUL R0, vertex.position.y, c[1];
1
MAD R0, vertex.position.x, c[0], R0; 1
MAD R1, vertex.position.x, c[4], R1; 1
MAD R1, vertex.position.z, c[6], R1; 1
MAD R0, vertex.position.z, c[2], R0; 1
MAD result.position, vertex.position.w, c[3], R0; 1
MAD result.texcoord[7], vertex.position.w, c[7], R1; 1
22 instructions :
- no instruction with vect4 so 1 FPU always idle
- 9 instructions with vect3 to a scalar
- 2 vect2 instructions
- 13 scalar instructions
- estimated cycle for a vector shader : 22
- estimated cycle for a LIW scalar shader : 42
- speedup 4 scalar core / 1 vect4 core : x2.1
2ième shader:
ADD R1.xyz, -fragment.texcoord[7], c[1];
3
DP3 R0.y, R1, R1;
3
RSQ R0.y, R0.y;
1
MUL R2.xyz, R0.y, R1;
3
DP3 R0.x, -fragment.texcoord[7], -fragment.texcoord[7]; 3
RSQ R0.x, R0.x;
1
MUL R0.xyz, R0.x, -fragment.texcoord[7];
3
ADD R3.xyz, R0, R2;
3
TEX R0.xyz, fragment.texcoord[0], texture[0], 2D; 3
DP3 R0.w, R3, R3;
3
MOV R1.xy, fragment.texcoord[4].w;
2
MOV R1.z, c[0].x;
1
MUL R1.xyz, R1, R0;
3
DP3 R0.z, fragment.texcoord[6], R1;
3
DP3 R0.x, fragment.texcoord[4], R1;
3
DP3 R0.y, fragment.texcoord[5], R1;
3
DP3 R1.x, R0, R0;
3
RSQ R1.y, R0.w;
1
RSQ R0.w, R1.x;
1
MUL R1.xyz, R1.y, R3;
3
MUL R0.xyz, R0.w, R0;
3
DP3 R1.y, R0, R1;
3
DP3 R1.x, R0, R2;
3
MOV R1.z, c[0].y;
1
LIT R2.yz, R1.xyzz;
4
TEX R0, fragment.texcoord[0], texture[2], 2D; 2
MUL R1, R2.z, R0;
1
TEX R0, fragment.texcoord[0], texture[1], 2D; 2
MAD result.color, R2.y, R0, R1;
1
29 instructions :
- vect4 instructions : 1
- vect3 instructions : 17
- vect2 instructions : 3
- scalar instructions: 8
- estimated cycle for a vector shader : 29
- estimated cycle for a LIW scalar shader : 69
- speedup 4 scalar core / 1 vect4 core : x1.68
3th
DP3 R0.w, -fragment.texcoord[7], -fragment.texcoord[7]; 3
TEX R0.xyz, fragment.texcoord[1], texture[3], 2D; 2
MOV R2.xy, fragment.texcoord[4].w;
2
MOV R2.z, c[0].x;
1
TEX R1.xyz, fragment.texcoord[0], texture[0], 2D; 2
MUL R1.xyz, R2, R1;
3
MAD R2.xyz, R0, c[0].y, -c[0].x;
3
MOV R0.xy, fragment.texcoord[5].w;
2
MOV R0.z, c[0].x;
1
MAD R2.xyz, R0, R2, R1;
3
ADD R1.xyz, -fragment.texcoord[7], c[1];
3
DP3 R1.w, R1, R1;
3
RSQ R1.w, R1.w;
1
MUL R1.xyz, R1.w, R1;
3
DP3 R0.z, fragment.texcoord[6], R2;
3
DP3 R0.y, fragment.texcoord[5], R2;
3
DP3 R0.x, fragment.texcoord[4], R2;
3
RSQ R0.w, R0.w;
1
MUL R2.xyz, R0.w, -fragment.texcoord[7];
3
ADD R2.xyz, R2, R1;
3
DP3 R1.w, R0, R0;
3
DP3 R0.w, R2, R2;
3
RSQ R1.w, R1.w;
1
MUL R0.xyz, R1.w, R0;
3
DP3 R1.x, R0, R1;
3
RSQ R0.w, R0.w;
1
MUL R2.xyz, R0.w, R2;
3
DP3 R1.y, R0, R2;
3
MOV R1.z, c[0];
1
LIT R2.yz, R1.xyzz;
4
TEX R0, fragment.texcoord[0], texture[2], 2D; 2
MUL R1, R2.z, R0;
4
TEX R0, fragment.texcoord[0], texture[1], 2D; 2
MAD result.color, R2.y, R0, R1;
1
34 instructions :
- vect4 instructions : 2
- vect3 instructions : 18
- vect2 instructions : 6
- scalar instructions: 8
- estimated cycle for a vector shader : 34
- estimated cycle for a LIW scalar shader : 82
- speedup 4 scalar core / 1 vect4 core : x1.66
I think the result will be the same, for the next compiled code. There is very
few vect4 here. But it could be very specific for this kind of shader. What
should be optimised is the number of Instruction per second and per mm² of
silicon die.
On the same Si technology, i am pretty sure that a vector shader will be
slower in clock speed compared to a LIW scalar shader because of the needed
switch to access each vector member quickly.
Nicolas Boulay
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)