Timothy Miller wrote:
On 8/20/06, Jon Smirl <[EMAIL PROTECTED]> wrote:
But I don't know if you read my earlier post where I described
the theoretical 1000x throughput difference between
fixed-function and programmable designs.
Something isn't matching here. I know NV/ATI are removing the fixed
function pipeline from future designs. Maybe what is missing is
that these GPUs are SIMD and can also dispact multiple instructions
in parallel for each data stream. I have never seen the actual
designs but I suspect they look like VLIW. The GPGPU people should
know.
I believe the NV7800 can process 24 data streams in parallel while
executing 2 simultaneous instructions on each stream. For example
each stream can do two FP mul/add instructions in a single clock.
They make everything SIMD compatible (adjusting for loops and
branches) in the compile phase. AFAIK the shaders are running at
500-600Mhz clocks.
I didn't really do a good job with my reasoning before, so here's a
reexplanation:
Fixed-function design, one pixel wide: All operations are separate
macro pipeline stages, which are further subdivided. With 100
pipeline stages, 100 fragments are in flight in parallel. Due to
pipelining, we could run at, say, 100MHz in an FPGA.
I think that you are missing something here. With your example, you are
throwing more hardware at the problem, therefore it will run faster.
Now it is true that with fixed function hardware that you don't have to
throw as much hardware at the problem as you would with fully
programmable hardware, but you could achieve the same results with fully
programmable hardware -- it would just require more hardware.
Programmable shader design, one pixel wide: All operations are
shader instructions, executed sequentially. For any instruction,
only a small portion of the hardware is utilized.
Well not exactly true. If the hardware has MD of four 32 bit words, you
are only going to use all of it except when you use four word vectors as
one of the variables in the operation.
Would it be possible to dynamically allocate hardware resources
depending on the op and the data?
If the average number of instructions to compute a fragment is, say,
10, and they can be arranged to take only 10 cycles, then the GPU can
only push out one pixel every 10 cycles.
You can pipeline fully programmable shaders too. But, it would be
better to use them in parallel to increase throughput. If you had 10 in
parallel, you could start a fragment every clock cycle and output a
pixel every clock.
Due to more feedback in the architecture of the shader, it runs at
only 50MHz in the FPGA.
This is for relative comparison, so don't worry about the fact that
they'd all run much faster in an ASIC.
Anyhow, given this hypothetical example, the fixed-function design
can theoretically process pixels 20 times as fast as the shader in a
given architecture.
But how much more hardware does it have?
This is why the shaders have to run at 500Mhz to fully utilize their
memory bandwidth, while the fixed-function design can run at a much
lower clock rate to get the same performance. And get much better
power utilization.
And remember that for 99% of what desktop users need, even OGA is
overkill. A programmable shader is out of the ballpark.
Except that it is what the future needs. That is, only with
programmable hardware can we be sure of having a viable product for the
future.
OTOH, this is not an either/or question. If there are very common
fragment shader operations, I see nothing wrong with having a hardware
configured shader to do them. The issue is whether they will be
utilized enough so that there will be a net hardware savings vs. fully
programmable.
--
JRT
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)