André and Kenneth spent some time working on the OGA2 spec in advance
of posting to the list so the could get their thoughts together and
make sure to present something reasonably coherent and detailed.  One
of the things we now want help with is further clarifying our basic
intent, after which we can move on to more and more specific
implementation details.

One of the design details that seems to be hard to present is the MIMD
architecture.  At first glance, it looks like a SIMD architecture.
But all of you are right to point out that shader workloads are
primarily scalar.  We might include vector instructions if we find
them helpful, but they'll just translate into optimized use of the
scalar ALUs.  We'll have three basic datatypes:  float32, int32, and
uint32.  (And int32 and uint32 will mostly be treated identically
except in cases of mult, div, and conversions.)  All other data types
will simply be converted on the way in/out of some other resource.
Since we have a successful MIPS-like architecture in HQ, we'll just
extend this (conceptually).  The MIPS architecture, lacking things
like carry/overflow/etc flags is just simply easier to implement.
Those instances where we end up requiring a couple extra instructions
are a worthwhile tradeoff to allow us to have only the register file
and program counter in the active thread context.

The programmer's view of the processor is that of each fragment
associated with a polygon is processed by some scalar program running
on a scalar processor (possibly with some vector optimizations).  The
compiler and scheduler will then see a sea of processors that can
execute this program on each fragment in parallel.  Each fragment will
get its own task (short-running thread).  Since each fragment will
refer to a different set of colors and texels, then each task will
require its own register file.  But there will only be a handful of
unique tasks running at the same time.  A lot of the time, you'll find
thousands of fragments being processed by the same shader program.
Moreover, they'll mostly be processed by exactly the same instruction
sequence.

Thus, it makes sense to economize on program file space.  One idea is
to run dozens of threads off of the same instruction cache, in groups.
 As an example, we might assign up to 64 threads to to an 8-wide MIMD
processor.  Initially, the constraint for the scheduler is that sets
of 8 should start with the same program counter.  The shared program
file (or instruction cache), which allows us to economize on RAM, has
an instruction fetched once and issues to 8 different pipelines.  The
remaining threads will have instructions issued in round-robin fashion
also in groups of 8.  Should one or more of those programs diverge (a
conditional instruction doesn't agree across threads), then similar
threads will be split off and scheduled separately as one additional
cycle in the round-robin.

Memory reads are the worst thing to handle.  This is especially true
if the read incurs a row miss.  Typical GPUs have thousands of active
threads.  If one thread is stalled waiting on a read, another can take
its place.  We need to investigate how we can do the same.  My
personal opinion is that we should optimize for sort-first.
Essentially, what we do is section the target surface into rectangular
areas that correspond to a memory row.  This keeps all writes and
read-modify-writes of the target surface confined to a single memory
row.  The nondeterminstic nature of the scheduling to this sea of
processors means that we cannot expect fragments to be processed in
some convenient order, as we could with a fixed-function design.
Fortunately, confinement to a memory row helps tremendously with that.
 This confinement to one memory row means that we have a higher
probability of a cache hit for a read, and a cache miss is relatively
cheap compared to when we have a row miss.  With sort-first, we would
first identify all triangles that intersect the current rectangle and
render everything that would be seen there before moving on to the
next one.  This makes the target surface very cheap to access.  Texel
reads could be much more random-access, however.  The sort-first will
help confine it a bit, and we'll have to take care of the rest with
caching and texture compression.

I might be remembering this wrong, but with OGD1 and its four memory
controllers, one memory row corresponds to 1024 pixels, which could be
configured as a 32x32 patch.  We need to decide if that's enough.
Currently, the bank selects are configured to be the top part of the
address.  This allows, for instance, alternate reads and writes to
different areas of graphics memory (not within the same bank) to be
relatively cheap because they avoid row misses.  This is why I like to
allocate the screen starting at the beginning of memory but off-screen
surfaces starting at the END and working their way down.  However, if
we were to move those two bits into the lower part of the address, we
could extend the row length to 4096 pixels, giving us a 64x64 patch to
work with.

As with Larrabee, I suggest that we provide dedicated texture
hardware.  We could cache the compressed data and use one decompressor
or a small number to service all texel requests.  I expect that random
access to the graphics memory will have much higher latency than
pushing all requests through a small number of decompressors.

I'm also leaning toward one or a couple of dedicated rasterizers.
Intel's engineers were able to come up with a clever software
rasterizer, but our engines are not as powerful and would not be
optimized for that task.  Software rasterization is known to be
inefficient, while efficient hardware implementations are known to be
fast but can require a lot of area.  We need to figure out just how
flexible we need the rasterizer to be.  Essentially, it would have to
linearly walk a triangle, interpolating several arbitrary parameters
(that would later be interpreted as things like coordinates, texture
coordinates, colors, etc.).  But for however many parameters we can
interpolate, we'll usually need fewer, while someone is eventually
going to want more.  So we need to think about a rasterizer that is
optimized specifically for that task but highly flexible, where we can
trade off performance for the number of parameters it walks.  For
instance, it might interpolate groups of four but be able to walk up
to 64 of them.  In that case, it would take 16 cycles to generate the
parameters for one fragment, but that's okay since the shaders are
scalar and can't really manipulate all of them at once.

The data flow goes something roughly like this:

geometry (shaders) -> vertexes (shaders) -> triangulation ->
rasterizer -> fragments (shaders)

Processing things in patches means that we often won't have that many
vertexes corresponding to the area.  Something has to sort things into
patches, and that could be the host CPU (which would therefore also do
the geometry), or it could be the shaders.  (Are there special
instructions we need to make geometry processing efficient?)  Once
it's determined which polygons (broken into triangles) overlap the
patch, those vertexes can be rasterized into fragments.

We have a scheduling problem.  One thing we might want to do is
perform geometry processing for the whole scene in one pass, storing
the intermediate results (vertexes) in graphics memory.  This isn't my
area, but I'm guessing the next step would be a vertex shading pass,
also storing the vertexes in graphics memory.  This is worth-while as
long as we can keep everything busy.  If not, we have to think about
getting a head start on fragment processing.  If we organize the
vertexes properly, then when it's time to do fragment processing, we
can have one or two threads of a shader fetching processed vertexes
and feeding them to the rasterizer(s).  All remaining shaders would
have fragments dynamically load-balanced across them.

We don't have to totally solve the scheduling issue up front.  (No
one's yet done a perfect job anyhow.)  But we do need to keep it in
mind so that we don't end up hacking changes to the shader engine as
an after-thought.

Ok!  Now it's time for everyone to rip apart my arguments.  :)
Seriously, we need to do a good job at this.  With all of this open
source brainpower on the problem, that should be a piece of cake,
right?

-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to