On Mon, Sep 21, 2009 at 9:41 PM, Hugh Fisher <[email protected]> wrote:
> Timothy Normand Miller wrote:
>>
>> Besides the obvious scalar ALU instructions, there are other
>> instructions that take bandwidth that are also not vector:  flow
>> control, loads and stores
>> There's lots of those.  No?
>
> No. A vertex shader has to multiply the vertex by the current matrix
> which is four 4-way multiplies and four 4-way adds. No branching.

Sure.  Indeed, lots of kernels are a straight shot with no flow
control at all.  You make a case for including vector instructions,
but I still don't see a case for vector ALUs, what with all the stalls
that will be experienced when trying to read textures.

You could have really simple flat-shading kernels that basically do
nothing but manipulate the primary color.  But in that case, the
kernels would be so short-run that you'd saturate your _write_
bandwidth to memory, causing stalls in that end.

The channel to graphics memory will be all too easy to saturate.
Especially when we have multiple textures to walk at odd angles,
causing row misses.  We want fast cores for the case when someone
writes a shader kernel so complex that the tasks have long run time.
But that's not a common case.

> Standard OpenGL/Direct3D fixed pipeline lighting has one loop, over
> the available light sources, say two scalar instructions. Each time
> through the loop there's the surface normal multiply by (3x3) matrix
> and normalise, thirteen 3-way multiplies/adds and one scalar division.
> Lighting equation is eight (maybe more, depending on LIT) 4-way
> multiplies/adds and one scalar test & branch.

And like I say, I fear that all the hardware thrown at that will be
wasted the moment we hit a texture read and have to stall for an
extended period.

I have no vested interest in a scalar architecture.  If we design a
prototype scalar engine and then find that the aggregate cannot
saturate the memory bandwidth, then I'll be more than happy to add
vector ALUs.  However, adding vector ALUs will nearly quadruple the
area requirement per core, meaning we can only fit 1/4 as many in the
same area (if we're lucky).  So really, I don't see the advantage.
We'll have no affect on memory read latency, and we won't have any
improvement in instruction file storage.

Let's do some math.  What is the proportion of vector to scalar
instructions in an "average" kernel?  What's the average number of
texture or surface reads?  We'll take a guess on the read latency and
work it out.  Back-of-the-envelope, you'll get less than 4x speedup by
adding the vector ALU, but you'll get 1/4 as many threads.

If memory stalls are zero, and we manage to get enough vector
parallelism that we get a 3x speedup via the vector ALU, then with 1/4
as many cores, we'll get 3/4 the performance of using scalar engines.
Where am I going wrong here?

Even if we could get 3.9x performance improvement, we still get 3.9/4
as much aggregate throughput.

>
> Loads and stores are mostly of matrices (eg skinning), or materials
> and colors which are one or more 3/4-way RGB/RGBA vectors.

Good argument for vector load instructions.  I can totally buy that.

I'm on the fence about vector ALU instructions.  I LIKE the idea of
vector load/store instructions because I think it could improve memory
throughput.  But I'm still against vector ALUs.

> Loads from texture maps are also vector ops, either RGB/RGBA vectors
> or surface normals or other 3/4-way floating point vectors.
>
>> Also, if memory load instruction latency dominates, then none of this
>> matters.  Many shader programs will spend most of their time waiting
>> on memory, making vector optimizations moot.
>
> If memory load is important, isn't SIMD faster than fetching and
> executing four scalar instructions in succession?

Yes.  I think you're right about that.

>> I can see vertex shader programs being DP heavy.  But there will be
>> far fewer vertexes than fragments.  How DP-heavy are fragment shader
>> programs, generally?
>
> Vertex processing is more important for CAD type workloads (lots of
> wireframes). For all types of 3D, as geometry is tessellated into
> smaller polys for more detail, the number of vertices increases relative
> to fragments.
>
> In classic 3D, fragment shaders do texture loads and color multiplies,
> all 3/4-way vector ops. Modern fragment shaders implement full lighting
> calculations (see above), bump or displacement mapping (vector math),
> fogging effects (vector math). Yes they do test and branch as well,
> but like most aspects of 3D they are heavy on the vector/matrix maths.

I'm not as much of a 3D graphics expert as you are.  I know more about
CPU architecture.  But I know enough to know that you're right.
Nevertheless, every scalar instruction and every flow-control
instruction is a missed opportunity to exploit the available
parallelism.


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to