2013/3/20 Cedric BAIL <[email protected]> > On Tue, Mar 19, 2013 at 5:34 AM, <[email protected]> wrote: > > Le 2013-03-18 15:38, Nicolas Boulay a écrit : > >> 2013/3/18 Timothy Normand Miller <[email protected]> > >>> For CPUs, 32 was found to be optimal by some paper published back in > >>> the early 90s, I think. 16 was a second best, while 64 had > >>> diminishing returns. Im not sure how this applies to GPUs, > >>> > >>> however. One problem with doubling the RF size is that you slow it > >>> down. > >> > >> This number came without superpipelining and superscalaire in mind. > > > > not to mention renamed, out-of-order architectures... > > > >> Unrolling loops is a good way to avoid instruction for the control > >> flow, and removing dependencies between instructions but this need at > >> least twice the number of register. > > > > From memory of the F-CPU design, if we consider a constant stream > > of computation instructions with 2 reads 1 writes that can not overlap > > (the dependencies are loose and the code is unrolled to fit the > pipeline), > > > > 32 registers => up to 11 instructions "in flight" in the pipeline at a > time > > without dependencies. That means a 5-deep, 2-wide superpipeline. > > > > 64 was chosen for F-CPU because the pipeline could be pushed to 3-wide > by 7 > > deep or 5 deep * 4 wide. That was the end of the 90s :-) > > This thread remind me of that to ! Just a thought, memory bandwidth > and latency is clearly critical today. I am working this days on > Enlightenment Foundation Libraries. Mostly a 2d graphics library with > some pseudo 3d effect. We do have a software backend and a OpenGL > backend. The software backend is faster in a lot of work load for two > reason : very few OpenGL driver do implement a partial update and > modern CPU are able to saturate memory bandwidth with just one core > for almost all operation. We are at the point of considering doing > light compression in software to use less memory bandwidth (that and a > proper walking pattern in a rendered tile) as the only we can think > off to improve performance. > I wish we had on all hardware (CPU and GPU) a L1 "memory" that was > directly accessible (only by shader in the case of a GPU) so we could > use it as a scratchpad. That one should really be like normal memory > and completely driven by software, not like a cache (that should be > there but clearly separated). This way it would be doable to do light > compression/decompression on each tile (source and destination) and > artificially improve memory bandwidth this way. This should work well > for all computer generated content (the case for compositing and UI in > general). >
NVIDIA GPU have many different way of coding texture to compact there representation. The load instruction expand the value in the expected RGB or YUV representation. It's look like compression. > -- > Cedric BAIL > _______________________________________________ > Open-graphics mailing list > [email protected] > http://lists.duskglow.com/mailman/listinfo/open-graphics > List service provided by Duskglow Consulting, LLC (www.duskglow.com) >
_______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
