On Tue, Mar 19, 2013 at 5:34 AM,  <[email protected]> wrote:
> Le 2013-03-18 15:38, Nicolas Boulay a écrit :
>> 2013/3/18 Timothy Normand Miller <[email protected]>
>>> For CPUs, 32 was found to be optimal by some paper published back in
>>> the early 90s, I think.  16 was a second best, while 64 had
>>> diminishing returns.  Im not sure how this applies to GPUs,
>>>
>>> however.  One problem with doubling the RF size is that you slow it
>>> down.
>>
>> This number came without superpipelining and superscalaire in mind.
>
> not to mention renamed, out-of-order architectures...
>
>> Unrolling loops is a good way to avoid instruction for the control
>> flow, and removing dependencies between instructions but this need at
>> least twice the number of register.
>
> From memory of the F-CPU design, if we consider a constant stream
> of computation instructions with 2 reads 1 writes that can not overlap
> (the dependencies are loose and the code is unrolled to fit the pipeline),
>
> 32 registers => up to 11 instructions "in flight" in the pipeline at a time
> without dependencies. That means a 5-deep, 2-wide superpipeline.
>
> 64 was chosen for F-CPU because the pipeline could be pushed to 3-wide by 7
> deep or 5 deep * 4 wide. That was the end of the 90s :-)

This thread remind me of that to ! Just a thought, memory bandwidth
and latency is clearly critical today. I am working this days on
Enlightenment Foundation Libraries. Mostly a 2d graphics library with
some pseudo 3d effect. We do have a software backend and a OpenGL
backend. The software backend is faster in a lot of work load for two
reason : very few OpenGL driver do implement a partial update and
modern CPU are able to saturate memory bandwidth with just one core
for almost all operation. We are at the point of considering doing
light compression in software to use less memory bandwidth (that and a
proper walking pattern in a rendered tile) as the only we can think
off to improve performance.
   I wish we had on all hardware (CPU and GPU) a L1 "memory" that was
directly accessible (only by shader in the case of a GPU) so we could
use it as a scratchpad. That one should really be like normal memory
and completely driven by software, not like a cache (that should be
there but clearly separated). This way it would be doable to do light
compression/decompression on each tile (source and destination) and
artificially improve memory bandwidth this way. This should work well
for all computer generated content (the case for compositing and UI in
general).
-- 
Cedric BAIL
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to