On Tue, Mar 19, 2013 at 5:34 AM, <[email protected]> wrote: > Le 2013-03-18 15:38, Nicolas Boulay a écrit : >> 2013/3/18 Timothy Normand Miller <[email protected]> >>> For CPUs, 32 was found to be optimal by some paper published back in >>> the early 90s, I think. 16 was a second best, while 64 had >>> diminishing returns. Im not sure how this applies to GPUs, >>> >>> however. One problem with doubling the RF size is that you slow it >>> down. >> >> This number came without superpipelining and superscalaire in mind. > > not to mention renamed, out-of-order architectures... > >> Unrolling loops is a good way to avoid instruction for the control >> flow, and removing dependencies between instructions but this need at >> least twice the number of register. > > From memory of the F-CPU design, if we consider a constant stream > of computation instructions with 2 reads 1 writes that can not overlap > (the dependencies are loose and the code is unrolled to fit the pipeline), > > 32 registers => up to 11 instructions "in flight" in the pipeline at a time > without dependencies. That means a 5-deep, 2-wide superpipeline. > > 64 was chosen for F-CPU because the pipeline could be pushed to 3-wide by 7 > deep or 5 deep * 4 wide. That was the end of the 90s :-)
This thread remind me of that to ! Just a thought, memory bandwidth and latency is clearly critical today. I am working this days on Enlightenment Foundation Libraries. Mostly a 2d graphics library with some pseudo 3d effect. We do have a software backend and a OpenGL backend. The software backend is faster in a lot of work load for two reason : very few OpenGL driver do implement a partial update and modern CPU are able to saturate memory bandwidth with just one core for almost all operation. We are at the point of considering doing light compression in software to use less memory bandwidth (that and a proper walking pattern in a rendered tile) as the only we can think off to improve performance. I wish we had on all hardware (CPU and GPU) a L1 "memory" that was directly accessible (only by shader in the case of a GPU) so we could use it as a scratchpad. That one should really be like normal memory and completely driven by software, not like a cache (that should be there but clearly separated). This way it would be doable to do light compression/decompression on each tile (source and destination) and artificially improve memory bandwidth this way. This should work well for all computer generated content (the case for compositing and UI in general). -- Cedric BAIL _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
