On Sat, Oct 10, 2009 at 12:30 PM, Petter Urkedal <[email protected]> wrote:
> On 2009-10-10, Timothy Normand Miller wrote:
>> In light of your comment, I say we ditch the icache space optimization
>> for now.  Or at most, we might consider feeding TWO pipelines from the
>> same icache, since the BRAMs are dual-ported, but we can even make
>> that an afterthought.
>
> How many bits can we read off a BRAM in one cycle?  In HQ we configured
> the BRAM as 512 words of 32 bits, so with two ports, is 64 bits the
> limit?  I'm thinking about how much control we have give the program
> over the ALU by translating one instruction into microinstructions to
> configure each stage of the ALU.  That could allow a single full-width
> FP, two half-width, or several simple integer operations.  It would go
> something like this:

For integer vector ops like this, you can select the word size by
breaking carry chains.  But with float, you would probably need
different circuitry for each word size.

One of the framebuffer formats we'll support will be 8888 ARGB.  By
default, I have been assuming that it would get converted, on the way
into the shader, into four 32-bit uints or four 32-bit floats
(depending on what the converter is set to).  In the shader, we would
operate on the channels separately at high precision and then truncate
on the way back out.  In theory, we could operate on them in their
original 8-bit form.  But I feel that this complicates matters.  Most
of the time, the shader will be treating the color channels as floats.
 That's how OpenGL models it.

>
> The first stage receives (insn_kind, reg_a, reg_b, insn_no) it fetches
> reg_a and reg_b from the thread context and insn_no from the
> microinstruction store.  The microinstruction contains maybe one byte
> per stage, which will determine what function to perform on the data
> from the previous stage.  The available functions are carefully selected
> to allow FP add/sub, and FP mult as the hardest constraints.

Are you suggesting a microcoded architecture?  Is this microcode table
programmable?

>From a theoretical computer science perspective, this is interesting.
>From a "simplify the hell out of this design" perspective, I'm
dubious.

Lets say we divide up the instruction into four 8-bit slices, because
we have decided (in the hypothetical) to provide 256 local 32-bit
registers to each thread.  That leaves 8 bits for the opcode.  That
opcode is likely to require some amount of translation.  But I would
expect to hard-code that translation into the pipeline logic, and we
would select bit patterns (late in the design) to simplify that
translation as much as possible.  In theory, we could have a 256xN
look-up table to do this, but hard-coded logic that relied on
carefully chosen codes would require less gate area.

> We may have a standard set of microinstructions, but if we get ambitious
> with the compiler, it could create it's own adapted for the specific
> kernel.  The furthest we could go here is probably 32 bit instructions
> and 32 bit microinstruction sequences.

The main benefit we would get from this would be the potential to
combine certain sequences of opcodes into single opcodes thereby
saving time and code space.  We should not do this as an up-front
optimization, but I can see, given some empirical data, us refactoring
the instruction set to do this for certain very common cases.

I don't know if you're familiar with "NISC" architectures.  No
Instruction Set Computer.  In the extreme, the instruction word is
looked up in a huge table where it gets translated into arbitrary
signals in the execution pipeline.  Of course, the practical problem I
have with this is the huge LUT.  Oh, and there's the fact that going
so non-traditional, the compiler would probably be impossible to
write.  Think about how hard it is to support something as
"straightforward" as predication.  It seemed like a fantastic idea for
Itanium, except that no one has ever been able to take proper
advantage of it.

> Then, maybe it's better to encode it in the pipeline and save the BRAMs,
> esp since I'm arguing not to group threads.
>
>> And moreover, we need to think about how many independent paths there
>> will be to the global dcache.  Lots of shaders trying to hit memory at
>> once will bog down, serialized really.  The main reason we have so may
>> shaders, actually, is because the proportion of math and flow control
>> instructions in a kernel should be high compared to the number of
>> memory accesses.
>
> See below.
>
>> > The proposed architecture is nice given a mostly linear flow of
>> > instructions which only use local memory, but can deal with the more
>> > general case effectively?  If threads were much more lightweight, it
>> > would seem easier to come up with a solution.
>>
>> What did you have in mind?
>
> Given that we only save 17/20 of the space, I don't have a feasible
> solution but for the curious:
>
>    Let a continuation point be a) either the target address or the
>    address of the instruction after a conditional jump as chosen by the
>    compiler for that instruction, or c) directly follows a load
>    instruction.  We declare continuation points to be a limit resource,
>    in the sense that each kernel can have a finite number.  Thus, we
>    can use a queue for each continuation point to collect threads which
>    have reached that point.

If I understand you correctly, you're suggesting that any stalled
thread can be put to sleep and then rescheduled arbitrarily on any
shader.  Meanwhile, non-stalled can be migrated to shaders that have
spare execution slots.

With a task-based design, kernels are short-lived, and multitasking
becomes cooperative.  We can dispense with any hardware support for
context switching.

I like the compromise that Andre proposed.  In a later iteration, we
could extend the number of tasks assigned at one time to a shader from
8 to 16.  There are still 8 execution slots, but with 16 tasks to run,
no slot will go unused unless more than 8 of the threads are stalled
on memory.

> On the other hand, I think we could use some queueing and sorting to
> optimise memory access now that if we go with independent threads.  We
> have 60 pipelines, so it seems reasonable to spend a bit logic to keep
> them busy.  Instead of letting the ALU do load, we send these
> instructions to a shared memory unit.

There would be a dedicated "load unit".  When FETCH detects a load
instruction, it issues the instruction but immediately puts that
thread to sleep if there is no read data available.  Essentially,
NOOPs get issued until the read return queue is no longer empty.

> It may be tempting to add one or
> two extra threads per ALU to keep the ALU busy, but due to the cost and
> the low frequency of loads, it may be better to send a "phantom" down
> the ALU for the thread doing the load.  The result of the load can be
> fetched back via a short return queue on each ALU.  This could be just
> one or two slots if we allow stalling on rare cases.  As soon as a
> "phantom" comes out of the ALU, a real thread is dequeued and passed
> down it place of it.

I'm not sure, but you may be saying the same thing.  :)

> Once memory requests goes to a shared unit, maybe we can spend some
> transistors on it?  We have four memory rows, as far as I understand.
> Compare each incoming request and queue them for the correct row if they
> match one.  Otherwise pass them into a holding area where we do some
> heuristics I haven't quite out to elect which will be next row to open
> once one of the former queues dries out.

I'm not sure what you're saying here.  Are you talking about a shader
or the memory system?


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to