Timothy Normand Miller wrote:
Your analogy with CPU pipelines isn't quite on point here.
Actually, I didn't say anything about CPU pipelines. I think that we
are equivocating about the meaning of 'pipeline'. The Pixel Pipeline
and a pipelined FPU do not mean the same thing by "pipeline".
CPUs have dependencies from later stages to earlier ones. A graphics
pipeline doesn't.
This is the FPU pipeline, and yes it appears that a pipelined shader can
stall. If it is a systolic processor pipeline, then it can't stall. If
they aren't being run on the same processor, then there is no stall.
However, one stage must still wait for the previous one.
The main thing to recognize about a long pipeline without
dependencies is the number of pixels in flight at one time. If there
are 100 stages, then 100 pixels can be in flight at one time, and
each one takes effectively one cycle to process (streaming). If
those same 100 "stages" had to be processed sequentially by a more
traditional processor, then each pixel would take, effectively, 100
clock cycles. So to get the same throughput, you need a clock rate
100 times faster.
I think that this only applies in some cases. Yes, it would apply to a
shader -- a shader is an accumulator type processor and they don't
pipeline very well, but to avoid it you would need to basically unroll
the loop and that would take a lot of hardware. The other solution is
to write the software so that the shader doesn't have a pipeline stall.
I'm sure you realize that there is some exaggeration there because
ATI and nVidia chips have some degree of pipelining. But it's not an
exaggeration if you were to try to do this with a general-purpose
CPU.
The exaggeration is that you presume that you now have 100 CPUs (which
ATI & nVidia do have). AFAIK, ATI and nVidia chips have an array of
"general purpose" processors that can be configured on the fly into
systolic arrays.
But, I am not talking about using a general-purpose CPU. What is being
discussed is a 4 word SIMD vector processor -- actually more than one of
them.
A lot of this is going to be MAC (or is it?). There is a dependency
there, but it _doesn't_ matter since the processor is optimized for it
and a MAC has to wait om the M anyhow. Since the addition is always
faster than the multiply, matrix multiplication is going to run just as
fast on a long word processor as long as it has as many words as the
larger side of the matrix.
As far as area is concerned, it appears to me that most of the
hardware area is going to be consumed by the hardware multiplier
arrays. So, other minor differences would be minor. This also
means that you should avoid making fixed use of a multiplier array
if that meant that it would stand idle part of the time.
What I meant was that the multipliers take up the space and other minor
differences in architecture are minor as far as space is concerned.
However, I do see the idle multipliers as a major issue.
Do some reading about data-flow processors.
I understand what a systolic processor is -- one with a processor for
each stage of the computation. Do we have room for that much hardware?
I thought that space requirements might mean that several long word
processors would be better. That is the architecture which most modern
DSPs use.
A fixed-function GPU is like a data-flow processor, where the flow
has been fixed in advance. And it's fixed that way because it's a
good way to handle 3D rendering.
Do you have an algorithm that you intend to implement that doesn't use
shaders -- doesn't multiply matrices? As I asked, where is it?
Everything I have read about 3D is based on matrix multiplication.
--
JRT
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)