Step one, of course, is to develop a preliminary GPU architecture to
make a simulator of.  We have something to start with, but it should
also remain in flux, because we need it to understand the impact of
architectural design decisions.

I have a friend who is currently doing research on GPUs, and he could
really benefit RIGHT NOW from a detailed and accurate GPU simulator.
I could use it eventually for power modeling, whereas his need is more
reliability-oriented.  If we have as a long-term goal a fabbable
design for a GPU, then we're going to need a simulator, before and
after we have a complete design.  For the OGP, this would be the basis
for us making architectural decisions.  But if we make it somewhat
general and parameterized, it could be used around the world by people
doing research on GPU architecture.

FOSS means different things for different people.  But these things
general include freedom, usefulness, impact on the world, and
mindshare.  By setting up a realistic SOFTWARE goal for this project,
we can contribute to all of those, without the sort of monetary
expense that posed too much of a challenge the first time around.  If
as far as we get is a useful GPU simulator, we'll have done some good
for the world, and we'll be a success (arguably for a second time,
then we can make it a third).



So, let's talk engineering now.


I'll assert (in order to generate discussion) that a GPU needs to be
divided into the following components:

- A geometry processor *
- A vertex processor *
- A rasterizer (or set thereof)
- A fragment processor *
- A process scheduler (to distribute work to compute engines)
- A texture processor (mostly for decompressing pre-computed textures)
- A memory system for storing drawing surfaces and textures (possibly
with some amount of caching)

* Comprised of an array of compute engines.


Whether or not the compute engines for each task are all identical
(and/or dynamically partitioned) is a design parameter (i.e. an
architectural characteristic to be examined).

Whether or not the compute engines are vector or scalar is a design
parameter.  (I argue that scalar designs will get better internal
utilization and allow more engines to be fit into the same die area.
A counter argument to consider is that with too many engines, we may
be unable utilize them all, i.e. poor external utilization.)

Larrabee demonstrated that rasterization is challenging for a
general-purpose processor, suggesting that we may want dedicated
hardware.  Moreover, the only specialized hardware in Larrabee was a
texture engine, because they knew up front that that was too
inefficient on a general-purpose processor.

In general, GPUs perform identical sets of operations on different
data (neighboring pixels), and only under limited circumstances do you
get divergence (of flow control).  So thread-level parallelism is
lined up into WARPS, where neighboring processing pipelines, THREAD
PROCESSORS, execute the same instructions, which in practical terms
means they all share the same instruction cache.  There are actually
two dimensions to warps.  The horizontal dimension is spatial,
corresponding to parallel execution pipelines (like SMP).  The
vertical dimension is temporal, where instructions from different
threads are multiplexed down the same pipeline (like SMT).

Some GPUs, as I understand it, switch threads vertically only on
memory stalls.  For OGA2, the design spec was round-robin issue.
Moreover, the spec was to make the vertical warp size exceed the
pipeline depth, eliminating all pipeline hazards (ALU, branch, etc.).
We had come up with a 9-stage pipeline but needed to make the warp
depth 10 for register file multiplexing reasons (IIRC).  With a width
of 4 horizontally, this would make the warp size 40 for OGA2.  (Where
32 is common for mainstream GPUs, so we're not very far off.)

OGA2 was spec'd to do sort-first rendering, because it has data
locality for the target surface, and we spec'd therefore a data cache
that was the exact size of the rendering block (32x32 pixels).
However, most GPUs are are sort-last for scalability reasons.  This
is, therefore, a design parameter.  (One that may even be a function
at instantiation time of other parameters.)


This is some high-level stuff.  My suggestion is to make indirect
consideration of pipeline structure and start with a purely functional
simulator.  That is, functional-level shader engines that have
parameterized warp sizes, a parameterized number of engines, geometry
processing, vertex processing, fragment processing, rasterizer, no
texture compression, and simple memory system.  Although the compute
engine is abstract, we're still simulating a parallel system already.
So rather than reinventing the wheel, developing our own event-based
system or some such, let's take advantage of existing software.  I
suggest that we write the functional model in BEHAVIORAL VERILOG.
(See the suggestion I posted here:
http://stackoverflow.com/questions/3999717/discrete-event-simulators-for-c/9350333#9350333)
 And then we can use the open source Icarus Verilog simulator to run
it.  I'm willing to entertain other options for performance reasons,
because faster simulators are more useful.  (But even better would be
to enhance Icarus to be faster.  I doubt it does JIT already, and it
would be great to add that.  More contributions that are useful to
many people.)

With regard to instruction set, think of it as a data structure, and
make it as wide as you want.  I don't mean VLIW.  I mean that if we
need 37 bits initially, let's use 37 bits.  Or 49.  Or whatever.  If
our simulator is in C++, then the instruction will actually BE a data
structure, even if it corresponds 1:1 with a 32-bit word in a later
iteration of the design.  We need an appropriate set of OPERATIONS,
and we'll solidify that into a more concrete instruction set as we
explore the implications.  (One advantage of doing this in C++ is that
if we change the instruction set, much of the simulator doesn't
change, because although the presentation changes, the access methods
for the instruction object mostly don't.)


P.S., I had an IM discussion with Gary Sheppard, and I'd like to
clarify some points.  I think we have the potential to beat PowerVR,
AMD, and nVidia in terms of performance and energy efficiency.
ARCHITECTURALLY.  If we're clever enough.  Real hardware would be made
at the behest of a licensee wanting to embed our design in their SoC.
We're defining an ARCHITECTURE.  We care only peripherally about
things like the transistor technology.  A licensee of our design COULD
start from a synthesized and placed design, or they could start from
the Verilog source.  Not our problem, because we're fabless.  Also,
the patent minefield issue isn't our problem either, because the
licensee would have to deal with that.  When we care about transistor
technology is when we want to prove that our design is more efficient,
so we have to synthesize something and simulate it at gate-level based
on some standard cell technology.  But we don't prescribe it.  That's
just evidence that we're better.  Also, in my imagination, we're
targeting embedded systems first and supercomputers second.  Not
necessarily with the same exact architecture, because 3D and GPGPU
have different compromises.  One or both will eventually trickle into
the desktop.

P.P.S., News Headline:  The Open Graphics Project reinvents itself to
be completely fabless, with the goal of developing the best GPU
architecture, to beat all others in terms of efficiency, performance,
and freedom.  (I know.  I'm being over the top.  But I feel like
something really impactful is doable now.)


There.  Now that we don't have finances as a barrier to entry, we can
really succeed.

-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to