[Open-graphics] Fwd: Another architecture debate: Write-back flag vs. bit bucket

Nicolas Boulay Thu, 21 Mar 2013 04:49:28 -0700

2013/3/20 Cedric BAIL <[email protected]>

> On Tue, Mar 19, 2013 at 5:34 AM,  <[email protected]> wrote:
> > Le 2013-03-18 15:38, Nicolas Boulay a écrit :
> >> 2013/3/18 Timothy Normand Miller <[email protected]>
> >>> For CPUs, 32 was found to be optimal by some paper published back in
> >>> the early 90s, I think.  16 was a second best, while 64 had
> >>> diminishing returns.  Im not sure how this applies to GPUs,
> >>>
> >>> however.  One problem with doubling the RF size is that you slow it
> >>> down.
> >>
> >> This number came without superpipelining and superscalaire in mind.
> >
> > not to mention renamed, out-of-order architectures...
> >
> >> Unrolling loops is a good way to avoid instruction for the control
> >> flow, and removing dependencies between instructions but this need at
> >> least twice the number of register.
> >
> > From memory of the F-CPU design, if we consider a constant stream
> > of computation instructions with 2 reads 1 writes that can not overlap
> > (the dependencies are loose and the code is unrolled to fit the
> pipeline),
> >
> > 32 registers => up to 11 instructions "in flight" in the pipeline at a
> time
> > without dependencies. That means a 5-deep, 2-wide superpipeline.
> >
> > 64 was chosen for F-CPU because the pipeline could be pushed to 3-wide
> by 7
> > deep or 5 deep * 4 wide. That was the end of the 90s :-)
>
> This thread remind me of that to ! Just a thought, memory bandwidth
> and latency is clearly critical today. I am working this days on
> Enlightenment Foundation Libraries. Mostly a 2d graphics library with
> some pseudo 3d effect. We do have a software backend and a OpenGL
> backend. The software backend is faster in a lot of work load for two
> reason : very few OpenGL driver do implement a partial update and
> modern CPU are able to saturate memory bandwidth with just one core
> for almost all operation. We are at the point of considering doing
> light compression in software to use less memory bandwidth (that and a
> proper walking pattern in a rendered tile) as the only we can think
> off to improve performance.
>    I wish we had on all hardware (CPU and GPU) a L1 "memory" that was
> directly accessible (only by shader in the case of a GPU) so we could
> use it as a scratchpad. That one should really be like normal memory
> and completely driven by software, not like a cache (that should be
> there but clearly separated). This way it would be doable to do light
> compression/decompression on each tile (source and destination) and
> artificially improve memory bandwidth this way. This should work well
> for all computer generated content (the case for compositing and UI in
> general).
>


NVIDIA GPU have many different way of coding texture to compact there
representation. The load instruction expand the value in the expected RGB
or YUV representation. It's look like compression.




> --
> Cedric BAIL
> _______________________________________________
> Open-graphics mailing list
> [email protected]
> http://lists.duskglow.com/mailman/listinfo/open-graphics
> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

[Open-graphics] Fwd: Another architecture debate: Write-back flag vs. bit bucket

Reply via email to