Re: [Open-graphics] OGA2 SIMD/MIMD

Timothy Normand Miller Sun, 11 Oct 2009 08:05:36 -0700

On Sun, Oct 11, 2009 at 5:24 AM, Petter Urkedal <[email protected]> wrote:
> On 2009-10-10, Timothy Normand Miller wrote:
>> For integer vector ops like this, you can select the word size by
>> breaking carry chains.  But with float, you would probably need
>> different circuitry for each word size.
>
> I missed the part of the draft spec which says that we're only going to
> support 32 bit FP.  I don't suggest we support 64 bit unless we have to.
> But even if we needed to support both, I think we could find ways to
> share adders and multipliers between them, even if the glue logic
> becomes very different.


We have no plans for double-precision floats.

>> One of the framebuffer formats we'll support will be 8888 ARGB.  By
>> default, I have been assuming that it would get converted, on the way
>> into the shader, into four 32-bit uints or four 32-bit floats
>> (depending on what the converter is set to).  In the shader, we would
>> operate on the channels separately at high precision and then truncate
>> on the way back out.  In theory, we could operate on them in their
>> original 8-bit form.  But I feel that this complicates matters.  Most
>> of the time, the shader will be treating the color channels as floats.
>>  That's how OpenGL models it.
>
> The approach approach I had in mind for the ALU is to look at FP add and
> FP mul and add enough basic adders and multipliers at the right stages.
> I don't know much sharing between add and mul is possible without
> extending the pipeline, which we should be careful about.  But, give the
> complexity of the FP ops, I think we have the opportunity to perform
> several integer ops per instruction at the cost of replacing hardwiring
> with LUTs.  Admittedly, I don't think we'll know if it's feasible before
> having a closer look at the FP ops and try to work it out.

If you can think of a way to do this, propose it.  But I think we
really should stick with a more traditional, well-understood
architecture for the first pass.  Something too exotic could very well
be ignored as too difficulty by those whom we need help from to write
compilers for this.

>> > The first stage receives (insn_kind, reg_a, reg_b, insn_no) it fetches
>> > reg_a and reg_b from the thread context and insn_no from the
>> > microinstruction store.  The microinstruction contains maybe one byte
>> > per stage, which will determine what function to perform on the data
>> > from the previous stage.  The available functions are carefully selected
>> > to allow FP add/sub, and FP mult as the hardest constraints.
>>
>> Are you suggesting a microcoded architecture?  Is this microcode table
>> programmable?
>
> Yes & yes, if we can deal with the instruction bandwidth.
>
>> From a theoretical computer science perspective, this is interesting.
>> From a "simplify the hell out of this design" perspective, I'm
>> dubious.
>>
>> Lets say we divide up the instruction into four 8-bit slices, because
>> we have decided (in the hypothetical) to provide 256 local 32-bit
>> registers to each thread.  That leaves 8 bits for the opcode.  That
>> opcode is likely to require some amount of translation.  But I would
>> expect to hard-code that translation into the pipeline logic, and we
>> would select bit patterns (late in the design) to simplify that
>> translation as much as possible.  In theory, we could have a 256xN
>> look-up table to do this, but hard-coded logic that relied on
>> carefully chosen codes would require less gate area.
>
> You're probably right.  It depends how important it is to economise on
> the adders and multipliers as compared to LUTs and the BRAMs for the
> microinstructions.  Whether to add a microcode table is the basic
> decision, if we do, then how wide and how many LUTs we throw in can be
> adjusted according to resources.
>
> An argument for adding complexity to the ALU, whether this is though
> microcode or more complex instructions in general is that the program is
> something the compiler can know in advance.  It cannot know how memory
> access or branches goes, but locally it has the opportunity to optimise
> a linear sequence of instructions.  The basic functions which are
> combined in the ALU comes with the promise that there are no
> unpredictable interruptions.

And for this optimization, I'm expecting we'll have to rely on the
existing GCC infrastructure.  We have to fight our urges to be overly
clever with this architecture when it complicates things for others,
because we have a very practical goal here:  Design a GPU that works.
Remember, we can always revise and revise.  Nothing is set in stone.
Designing something mundane in and of itself will be educational for
us.  So don't underestimate the value of reinventing the wheel.

>> > We may have a standard set of microinstructions, but if we get ambitious
>> > with the compiler, it could create it's own adapted for the specific
>> > kernel.  The furthest we could go here is probably 32 bit instructions
>> > and 32 bit microinstruction sequences.
>>
>> The main benefit we would get from this would be the potential to
>> combine certain sequences of opcodes into single opcodes thereby
>> saving time and code space.  We should not do this as an up-front
>> optimization, but I can see, given some empirical data, us refactoring
>> the instruction set to do this for certain very common cases.
>>
>> I don't know if you're familiar with "NISC" architectures.  No
>> Instruction Set Computer.  In the extreme, the instruction word is
>> looked up in a huge table where it gets translated into arbitrary
>> signals in the execution pipeline.  Of course, the practical problem I
>> have with this is the huge LUT.
>
> Thanks for the explanation.  Yes, it's half way NISC, but we don't add
> more configurability at each stage than what we think will be useful,
> both to save LUTs and instruction bandwidth.
>
>> Oh, and there's the fact that going
>> so non-traditional, the compiler would probably be impossible to
>> write.  Think about how hard it is to support something as
>> "straightforward" as predication.  It seemed like a fantastic idea for
>> Itanium, except that no one has ever been able to take proper
>> advantage of it.
>
> The microinstruction images could be hand-written, partly by us and
> partly by developers.  The latter would probably require developers to
> recompile the compiler, unless we have a pool of potential instructions
> which can be turned on by compiler flags.

This shader design will be FPGA-only for a long time.  Don't make the
microcode programmable for our sake.  Only make something programmable
if it'll be very likely to (a) make it easier to write the compiler or
less importantly (b) make the tasks run faster.

>
>> >> What did you have in mind?
>> >
>> > Given that we only save 17/20 of the space, I don't have a feasible
>> > solution but for the curious:
>> >
>> >    Let a continuation point be a) either the target address or the
>> >    address of the instruction after a conditional jump as chosen by the
>> >    compiler for that instruction, or c) directly follows a load
>> >    instruction.  We declare continuation points to be a limit resource,
>> >    in the sense that each kernel can have a finite number.  Thus, we
>> >    can use a queue for each continuation point to collect threads which
>> >    have reached that point.
>>
>> If I understand you correctly, you're suggesting that any stalled
>> thread can be put to sleep and then rescheduled arbitrarily on any
>> shader.  Meanwhile, non-stalled can be migrated to shaders that have
>> spare execution slots.
>
> I can see another problem:  The context would no longer be localised to
> a unit unless it also can be migrated.  Anyway, we're not doing this.
>
>> With a task-based design, kernels are short-lived, and multitasking
>> becomes cooperative.  We can dispense with any hardware support for
>> context switching.
>>
>> I like the compromise that Andre proposed.  In a later iteration, we
>> could extend the number of tasks assigned at one time to a shader from
>> 8 to 16.  There are still 8 execution slots, but with 16 tasks to run,
>> no slot will go unused unless more than 8 of the threads are stalled
>> on memory.
>
> If we were still talking about re-combining groups, then I wouldn't see
> the benefit:  By grouping threads we only reduce the number of BRAMs by
> a factor 17/20, whereas going from 8 to 16 threads costs almost a
> factor 2.

No.  Not recombining groups.  Symmetric multithreading.  Round-robin
scheduling of multiple threads sequentially to the same pipeline.  One
icache, one set of functional units, N register files, N contexts.

>
> But you mention memory, in which case it makes sense.  Except, I'd
> suggest to add only 1 or 2 threads due to their cost.  As you could see
> below, I was even careful not to add a single extra thread to the ALUs.
>
> We should also not completely fix the number of threads per pipeline
> before we design it.  If we can save one stage, so we have 7 instead of
> 8 stages, do we have an argument for rounding up to 8?  If not, we can
> go with 7 threads and thus fit more pipelines.

If we assign N threads to 1 pipeline, we need one icache, one of each
functional unit, but N register files.  If we allocate 256 registers
for a thread, then we'll want to assign threads in even numbers since
the BRAMs hold 512 words.

>
>> > On the other hand, I think we could use some queueing and sorting to
>> > optimise memory access now that if we go with independent threads.  We
>> > have 60 pipelines, so it seems reasonable to spend a bit logic to keep
>> > them busy.  Instead of letting the ALU do load, we send these
>> > instructions to a shared memory unit.
>>
>> There would be a dedicated "load unit".  When FETCH detects a load
>> instruction, it issues the instruction but immediately puts that
>> thread to sleep if there is no read data available.  Essentially,
>> NOOPs get issued until the read return queue is no longer empty.
>>
>> > It may be tempting to add one or
>> > two extra threads per ALU to keep the ALU busy, but due to the cost and
>> > the low frequency of loads, it may be better to send a "phantom" down
>> > the ALU for the thread doing the load.  The result of the load can be
>> > fetched back via a short return queue on each ALU.  This could be just
>> > one or two slots if we allow stalling on rare cases.  As soon as a
>> > "phantom" comes out of the ALU, a real thread is dequeued and passed
>> > down it place of it.
>>
>> I'm not sure, but you may be saying the same thing.  :)
>
> No.  The load instruction does not prevent the execution of the next
> thread.  Say thread I is a load and for simplicity only have 4 stages:
>
>  (  I0,   H0,   G0,   F0)
>  (  F1, noop,   H0,   G0)
>  (  G1,   F1, noop,   H0)
>  (  H1,   G1,   F1, noop)
>  (noop,   H1,   G1,   F1)  <-- return queue from load is still empty
>  (  F2, noop,   H1,   G1)  <-- result for thread I ready
>  (  G2,   F2, noop,   H1)
>  (  H2,   G2,   F2, noop)
>  (  I1,   H2,   G2,   F2)  <-- thread I re-scheduled
>
> It should be noted than if there are several loads in action for a
> pipeline, we can plug the first noop which is about to re-enter the
> pipeline with the first thread from the pipeline which finishes the
> load.

Those noops could go on for a long time.  The memory latency is
unpredictable.  Besides, this is moot if we dispense with the MIMD.
I'm assuming your columns are different threads.  Getting rid of MIMD
means that we just round-robin N threads on one pipeline.  We're not
gluing together more than one pipeline.

>
> If we can afford to add one extra thread per ALU (as a moderate
> interpretation of Andre's proposal), we can keep the ALU 100 % busy for
> the first load.  For the second, we still create a hole, but now this
> hole has a higher chance of being plugged faster since there are two
> candidate threads.
>
>> > Once memory requests goes to a shared unit, maybe we can spend some
>> > transistors on it?  We have four memory rows, as far as I understand.
>> > Compare each incoming request and queue them for the correct row if they
>> > match one.  Otherwise pass them into a holding area where we do some
>> > heuristics I haven't quite out to elect which will be next row to open
>> > once one of the former queues dries out.
>>
>> I'm not sure what you're saying here.  Are you talking about a shader
>> or the memory system?
>
> I'm talking about any load from the main memory.  I'm not sure where the
> shader comes in here.  In more detail, a simple design could go
> something like:
>
> Stage 1 compares the row of the load with the currently open row.  If it
> matches and the queue for that row and the queue is not full, or if the
> queue is empty, then stage 2 will enqueue the request.  Stages 2 though
> 5 does the same thing for the next 3 available open row.  Requests which
> fall though are put into queue which feeds back to stage 1 on each cycle
> where the queue of inputs from the ALUs is empty.
>
>   +-->[cmp]
>   |   [cmp]  [enq] --> load from row i
>   |   [cmp]  [enq] --> load from row j
>   |   [cmp]  [enq] --> load from row k
>  [enq]        [enq] --> load from row l
>
> With some extra logic we could reduce the pipeline to 3 stages to
> increase the throughput an thus the chance of catching all loads for a
> particular row before its queue dries out.
>
> Another extension would be to make speculative guesses of which row to
> open next.

So, basically, you're reordering memory accesses, prioritizing row
hits and deferring row misses.  Certainly.  We'll also have a dcache
which needs to be checked too.

We should try to make the interface to memory somewhat generic.  That
is, let's design the shaders and memory system somewhat independently.
 This makes it easier for us to think about it and also makes it
possible for us to repurpose the shader engine.  One friend of mine
had the idea to put a single shader on a chip and sell that as a
microcontroller to be used in things like robotics.  In this case, the
icache would just be the program file and the dcache would be all of
main memory.


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] OGA2 SIMD/MIMD

Reply via email to