Re: [Open-graphics] OGA2 SIMD/MIMD

Petter Urkedal Sun, 11 Oct 2009 02:35:42 -0700

On 2009-10-10, Timothy Normand Miller wrote:
> For integer vector ops like this, you can select the word size by
> breaking carry chains.  But with float, you would probably need
> different circuitry for each word size.


I missed the part of the draft spec which says that we're only going to
support 32 bit FP.  I don't suggest we support 64 bit unless we have to.
But even if we needed to support both, I think we could find ways to
share adders and multipliers between them, even if the glue logic
becomes very different.

> One of the framebuffer formats we'll support will be 8888 ARGB.  By
> default, I have been assuming that it would get converted, on the way
> into the shader, into four 32-bit uints or four 32-bit floats
> (depending on what the converter is set to).  In the shader, we would
> operate on the channels separately at high precision and then truncate
> on the way back out.  In theory, we could operate on them in their
> original 8-bit form.  But I feel that this complicates matters.  Most
> of the time, the shader will be treating the color channels as floats.
>  That's how OpenGL models it.

The approach approach I had in mind for the ALU is to look at FP add and
FP mul and add enough basic adders and multipliers at the right stages.
I don't know much sharing between add and mul is possible without
extending the pipeline, which we should be careful about.  But, give the
complexity of the FP ops, I think we have the opportunity to perform
several integer ops per instruction at the cost of replacing hardwiring
with LUTs.  Admittedly, I don't think we'll know if it's feasible before
having a closer look at the FP ops and try to work it out.

> > The first stage receives (insn_kind, reg_a, reg_b, insn_no) it fetches
> > reg_a and reg_b from the thread context and insn_no from the
> > microinstruction store.  The microinstruction contains maybe one byte
> > per stage, which will determine what function to perform on the data
> > from the previous stage.  The available functions are carefully selected
> > to allow FP add/sub, and FP mult as the hardest constraints.
> 
> Are you suggesting a microcoded architecture?  Is this microcode table
> programmable?

Yes & yes, if we can deal with the instruction bandwidth.

> From a theoretical computer science perspective, this is interesting.
> From a "simplify the hell out of this design" perspective, I'm
> dubious.
> 
> Lets say we divide up the instruction into four 8-bit slices, because
> we have decided (in the hypothetical) to provide 256 local 32-bit
> registers to each thread.  That leaves 8 bits for the opcode.  That
> opcode is likely to require some amount of translation.  But I would
> expect to hard-code that translation into the pipeline logic, and we
> would select bit patterns (late in the design) to simplify that
> translation as much as possible.  In theory, we could have a 256xN
> look-up table to do this, but hard-coded logic that relied on
> carefully chosen codes would require less gate area.

You're probably right.  It depends how important it is to economise on
the adders and multipliers as compared to LUTs and the BRAMs for the
microinstructions.  Whether to add a microcode table is the basic
decision, if we do, then how wide and how many LUTs we throw in can be
adjusted according to resources.

An argument for adding complexity to the ALU, whether this is though
microcode or more complex instructions in general is that the program is
something the compiler can know in advance.  It cannot know how memory
access or branches goes, but locally it has the opportunity to optimise
a linear sequence of instructions.  The basic functions which are
combined in the ALU comes with the promise that there are no
unpredictable interruptions.

> > We may have a standard set of microinstructions, but if we get ambitious
> > with the compiler, it could create it's own adapted for the specific
> > kernel.  The furthest we could go here is probably 32 bit instructions
> > and 32 bit microinstruction sequences.
> 
> The main benefit we would get from this would be the potential to
> combine certain sequences of opcodes into single opcodes thereby
> saving time and code space.  We should not do this as an up-front
> optimization, but I can see, given some empirical data, us refactoring
> the instruction set to do this for certain very common cases.
> 
> I don't know if you're familiar with "NISC" architectures.  No
> Instruction Set Computer.  In the extreme, the instruction word is
> looked up in a huge table where it gets translated into arbitrary
> signals in the execution pipeline.  Of course, the practical problem I
> have with this is the huge LUT.

Thanks for the explanation.  Yes, it's half way NISC, but we don't add
more configurability at each stage than what we think will be useful,
both to save LUTs and instruction bandwidth.

> Oh, and there's the fact that going
> so non-traditional, the compiler would probably be impossible to
> write.  Think about how hard it is to support something as
> "straightforward" as predication.  It seemed like a fantastic idea for
> Itanium, except that no one has ever been able to take proper
> advantage of it.

The microinstruction images could be hand-written, partly by us and
partly by developers.  The latter would probably require developers to
recompile the compiler, unless we have a pool of potential instructions
which can be turned on by compiler flags.
 
> >> What did you have in mind?
> >
> > Given that we only save 17/20 of the space, I don't have a feasible
> > solution but for the curious:
> >
> >    Let a continuation point be a) either the target address or the
> >    address of the instruction after a conditional jump as chosen by the
> >    compiler for that instruction, or c) directly follows a load
> >    instruction.  We declare continuation points to be a limit resource,
> >    in the sense that each kernel can have a finite number.  Thus, we
> >    can use a queue for each continuation point to collect threads which
> >    have reached that point.
> 
> If I understand you correctly, you're suggesting that any stalled
> thread can be put to sleep and then rescheduled arbitrarily on any
> shader.  Meanwhile, non-stalled can be migrated to shaders that have
> spare execution slots.

I can see another problem:  The context would no longer be localised to
a unit unless it also can be migrated.  Anyway, we're not doing this.

> With a task-based design, kernels are short-lived, and multitasking
> becomes cooperative.  We can dispense with any hardware support for
> context switching.
> 
> I like the compromise that Andre proposed.  In a later iteration, we
> could extend the number of tasks assigned at one time to a shader from
> 8 to 16.  There are still 8 execution slots, but with 16 tasks to run,
> no slot will go unused unless more than 8 of the threads are stalled
> on memory.

If we were still talking about re-combining groups, then I wouldn't see
the benefit:  By grouping threads we only reduce the number of BRAMs by
a factor 17/20, whereas going from 8 to 16 threads costs almost a
factor 2.

But you mention memory, in which case it makes sense.  Except, I'd
suggest to add only 1 or 2 threads due to their cost.  As you could see
below, I was even careful not to add a single extra thread to the ALUs.

We should also not completely fix the number of threads per pipeline
before we design it.  If we can save one stage, so we have 7 instead of
8 stages, do we have an argument for rounding up to 8?  If not, we can
go with 7 threads and thus fit more pipelines.

> > On the other hand, I think we could use some queueing and sorting to
> > optimise memory access now that if we go with independent threads.  We
> > have 60 pipelines, so it seems reasonable to spend a bit logic to keep
> > them busy.  Instead of letting the ALU do load, we send these
> > instructions to a shared memory unit.
> 
> There would be a dedicated "load unit".  When FETCH detects a load
> instruction, it issues the instruction but immediately puts that
> thread to sleep if there is no read data available.  Essentially,
> NOOPs get issued until the read return queue is no longer empty.
> 
> > It may be tempting to add one or
> > two extra threads per ALU to keep the ALU busy, but due to the cost and
> > the low frequency of loads, it may be better to send a "phantom" down
> > the ALU for the thread doing the load.  The result of the load can be
> > fetched back via a short return queue on each ALU.  This could be just
> > one or two slots if we allow stalling on rare cases.  As soon as a
> > "phantom" comes out of the ALU, a real thread is dequeued and passed
> > down it place of it.
> 
> I'm not sure, but you may be saying the same thing.  :)

No.  The load instruction does not prevent the execution of the next
thread.  Say thread I is a load and for simplicity only have 4 stages:

  (  I0,   H0,   G0,   F0)
  (  F1, noop,   H0,   G0)
  (  G1,   F1, noop,   H0)
  (  H1,   G1,   F1, noop)
  (noop,   H1,   G1,   F1)  <-- return queue from load is still empty
  (  F2, noop,   H1,   G1)  <-- result for thread I ready
  (  G2,   F2, noop,   H1)
  (  H2,   G2,   F2, noop)
  (  I1,   H2,   G2,   F2)  <-- thread I re-scheduled

It should be noted than if there are several loads in action for a
pipeline, we can plug the first noop which is about to re-enter the
pipeline with the first thread from the pipeline which finishes the
load.

If we can afford to add one extra thread per ALU (as a moderate
interpretation of Andre's proposal), we can keep the ALU 100 % busy for
the first load.  For the second, we still create a hole, but now this
hole has a higher chance of being plugged faster since there are two
candidate threads.
 
> > Once memory requests goes to a shared unit, maybe we can spend some
> > transistors on it?  We have four memory rows, as far as I understand.
> > Compare each incoming request and queue them for the correct row if they
> > match one.  Otherwise pass them into a holding area where we do some
> > heuristics I haven't quite out to elect which will be next row to open
> > once one of the former queues dries out.
> 
> I'm not sure what you're saying here.  Are you talking about a shader
> or the memory system?

I'm talking about any load from the main memory.  I'm not sure where the
shader comes in here.  In more detail, a simple design could go
something like:

Stage 1 compares the row of the load with the currently open row.  If it
matches and the queue for that row and the queue is not full, or if the
queue is empty, then stage 2 will enqueue the request.  Stages 2 though
5 does the same thing for the next 3 available open row.  Requests which
fall though are put into queue which feeds back to stage 1 on each cycle
where the queue of inputs from the ALUs is empty.

   +-->[cmp]
   |   [cmp]  [enq] --> load from row i
   |   [cmp]  [enq] --> load from row j
   |   [cmp]  [enq] --> load from row k
 [enq]        [enq] --> load from row l

With some extra logic we could reduce the pipeline to 3 stages to
increase the throughput an thus the chance of catching all loads for a
particular row before its queue dries out.

Another extension would be to make speculative guesses of which row to
open next.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] OGA2 SIMD/MIMD

Reply via email to