On Friday 25 February 2005 14:14, Timothy Miller wrote:
> On Fri, 25 Feb 2005 14:03:12 -0500, Daniel Phillips <[EMAIL PROTECTED]>
wrote:
> > Hi Timothy,
> >
> > On Friday 25 February 2005 07:55, Timothy Miller wrote:
> > > On Thu, 24 Feb 2005 18:22:28 -0500, Daniel Phillips wrote:
> > > > Timothy has suggested that each memory
> > > > controller will have its own cache, and I can see how that
> > > > simplifies things, but on the other hand it will require more cache
> > > > entries in total to get the same benefit versus a shared cache.
> > >
> > > I was thinking four entries, more precisely two pixel pairs. :)
> >
> > Ah, that makes sense.
> >
> > There is a total of 8 pixel pairs, or 16 pixels. This should work ok
> > for bilinear blending, but for trilinear blending there are two
> > separate textures, which will break the cache completely. What's worse
> > is, the two textures will be on different DRAM rows, so this is where
> > the cache is really needed. Doubling the cache to four pixel pairs per
> > port should fix this.
>
> Yeah, you're right. The reason for having two pairs is to deal with
> fetch requests on odd boundaries. But we need another two for the
> other row.
>
> > Again, sharing the cache between all four ports is something to think
> > about, it might reduce the total footprint at the expense of more
> > design complexity. Something for a later rev?
>
> The reason for more smaller caches is so that one stream does not hog
> the cache, killing needed efficiency in another.
>
> > OK, with a tiny cache a round robin replacement policy is way too
> > crude, so here is a better suggestion: for each cache miss, search for
> > a cache entry that is in the same texture but not adjacent either
> > vertically or horizontally to the new pixel pair and replace that one.
> > For the common case where the filter footprint is moving in any
> > direction one texel at a time, this replacement policy is optimal. The
> > "in the same texture" test is aimed at the trilinear filtering case,
> > which is the one that needs optimizing.
>
> Anything more sophisticated than simple tables with LRU replacement is
> going to require too much logic.
LRU implements the policy I described so long as the eight texels of a
trilinear blend are accessed in the proper order, which isn't the best
order for the DRAM (it alternates between two textures). But if the memory
request queue is long enough, maybe the accesses can be reordered there.
Or simpler, have two independent texture caches per port, or switch that on
when trilinear blending is active.
OK, it looks like you have the memory controller under control (so to
speak). It's obviously something you have a lot of experience with in the
2D case. The reason I'm worrying about it so much is, I'm wondering if
there is any way to squeeze 2x2 FSAA + trilinear blending out of this card,
even if only usable at 640 x 480. This for me is where we enter the realm
of "doesn't suck" for a 3D card.
A pixel is one word, so I'll do the estimate in words. We have 400
megawords per second bandwidth, ignoring pre-charge delays. We have to
draw 640x480 at double resolution, that is, 4 times the pixels. Each pixel
accesses 8 words of texture and writes one word of frame buffer, that's 9
words/pixel. Quake 3 has a roughly a 2X overdraw factor, so assuming that
and 30 frames/second, the grand total is:
640 * 480 * 4 * 9 * 2 * 30 = 663,552,000 words/sec
which is 65% too much. The scanout adds another 640 * 480 * 4 * 30 =
36,864,000 words/sec, making it 75% over, not counting DRAM pre-charge
latencies. The texel access pattern will sometimes bottleneck on one or
two chips, slowing things down some more. The point is, we're not that far
away, and even if we can't quite get a playable frame rate, it would still
be wonderful to at least have the option of seeing a scene in full glorious
non-jagginess from time to time.
There are several ways to claw back some fill rate:
1) Texture cache, saving maybe 25% of texel accesses
2) 16 bit textures, saving 50% of texel bandwidth and improving cache hit
rate slightly
3) 16 bit frame buffer, saving 50% of pixel write and scanout bandwidth.
If we do all of these, we just might hit 30 frames/sec for Quake 3, with
decent filtering and antialiasing.
Regards,
Daniel
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)