On 2009-10-10, Timothy Normand Miller wrote: > For integer vector ops like this, you can select the word size by > breaking carry chains. But with float, you would probably need > different circuitry for each word size.
I missed the part of the draft spec which says that we're only going to support 32 bit FP. I don't suggest we support 64 bit unless we have to. But even if we needed to support both, I think we could find ways to share adders and multipliers between them, even if the glue logic becomes very different. > One of the framebuffer formats we'll support will be 8888 ARGB. By > default, I have been assuming that it would get converted, on the way > into the shader, into four 32-bit uints or four 32-bit floats > (depending on what the converter is set to). In the shader, we would > operate on the channels separately at high precision and then truncate > on the way back out. In theory, we could operate on them in their > original 8-bit form. But I feel that this complicates matters. Most > of the time, the shader will be treating the color channels as floats. > That's how OpenGL models it. The approach approach I had in mind for the ALU is to look at FP add and FP mul and add enough basic adders and multipliers at the right stages. I don't know much sharing between add and mul is possible without extending the pipeline, which we should be careful about. But, give the complexity of the FP ops, I think we have the opportunity to perform several integer ops per instruction at the cost of replacing hardwiring with LUTs. Admittedly, I don't think we'll know if it's feasible before having a closer look at the FP ops and try to work it out. > > The first stage receives (insn_kind, reg_a, reg_b, insn_no) it fetches > > reg_a and reg_b from the thread context and insn_no from the > > microinstruction store. The microinstruction contains maybe one byte > > per stage, which will determine what function to perform on the data > > from the previous stage. The available functions are carefully selected > > to allow FP add/sub, and FP mult as the hardest constraints. > > Are you suggesting a microcoded architecture? Is this microcode table > programmable? Yes & yes, if we can deal with the instruction bandwidth. > From a theoretical computer science perspective, this is interesting. > From a "simplify the hell out of this design" perspective, I'm > dubious. > > Lets say we divide up the instruction into four 8-bit slices, because > we have decided (in the hypothetical) to provide 256 local 32-bit > registers to each thread. That leaves 8 bits for the opcode. That > opcode is likely to require some amount of translation. But I would > expect to hard-code that translation into the pipeline logic, and we > would select bit patterns (late in the design) to simplify that > translation as much as possible. In theory, we could have a 256xN > look-up table to do this, but hard-coded logic that relied on > carefully chosen codes would require less gate area. You're probably right. It depends how important it is to economise on the adders and multipliers as compared to LUTs and the BRAMs for the microinstructions. Whether to add a microcode table is the basic decision, if we do, then how wide and how many LUTs we throw in can be adjusted according to resources. An argument for adding complexity to the ALU, whether this is though microcode or more complex instructions in general is that the program is something the compiler can know in advance. It cannot know how memory access or branches goes, but locally it has the opportunity to optimise a linear sequence of instructions. The basic functions which are combined in the ALU comes with the promise that there are no unpredictable interruptions. > > We may have a standard set of microinstructions, but if we get ambitious > > with the compiler, it could create it's own adapted for the specific > > kernel. The furthest we could go here is probably 32 bit instructions > > and 32 bit microinstruction sequences. > > The main benefit we would get from this would be the potential to > combine certain sequences of opcodes into single opcodes thereby > saving time and code space. We should not do this as an up-front > optimization, but I can see, given some empirical data, us refactoring > the instruction set to do this for certain very common cases. > > I don't know if you're familiar with "NISC" architectures. No > Instruction Set Computer. In the extreme, the instruction word is > looked up in a huge table where it gets translated into arbitrary > signals in the execution pipeline. Of course, the practical problem I > have with this is the huge LUT. Thanks for the explanation. Yes, it's half way NISC, but we don't add more configurability at each stage than what we think will be useful, both to save LUTs and instruction bandwidth. > Oh, and there's the fact that going > so non-traditional, the compiler would probably be impossible to > write. Think about how hard it is to support something as > "straightforward" as predication. It seemed like a fantastic idea for > Itanium, except that no one has ever been able to take proper > advantage of it. The microinstruction images could be hand-written, partly by us and partly by developers. The latter would probably require developers to recompile the compiler, unless we have a pool of potential instructions which can be turned on by compiler flags. > >> What did you have in mind? > > > > Given that we only save 17/20 of the space, I don't have a feasible > > solution but for the curious: > > > > Let a continuation point be a) either the target address or the > > address of the instruction after a conditional jump as chosen by the > > compiler for that instruction, or c) directly follows a load > > instruction. We declare continuation points to be a limit resource, > > in the sense that each kernel can have a finite number. Thus, we > > can use a queue for each continuation point to collect threads which > > have reached that point. > > If I understand you correctly, you're suggesting that any stalled > thread can be put to sleep and then rescheduled arbitrarily on any > shader. Meanwhile, non-stalled can be migrated to shaders that have > spare execution slots. I can see another problem: The context would no longer be localised to a unit unless it also can be migrated. Anyway, we're not doing this. > With a task-based design, kernels are short-lived, and multitasking > becomes cooperative. We can dispense with any hardware support for > context switching. > > I like the compromise that Andre proposed. In a later iteration, we > could extend the number of tasks assigned at one time to a shader from > 8 to 16. There are still 8 execution slots, but with 16 tasks to run, > no slot will go unused unless more than 8 of the threads are stalled > on memory. If we were still talking about re-combining groups, then I wouldn't see the benefit: By grouping threads we only reduce the number of BRAMs by a factor 17/20, whereas going from 8 to 16 threads costs almost a factor 2. But you mention memory, in which case it makes sense. Except, I'd suggest to add only 1 or 2 threads due to their cost. As you could see below, I was even careful not to add a single extra thread to the ALUs. We should also not completely fix the number of threads per pipeline before we design it. If we can save one stage, so we have 7 instead of 8 stages, do we have an argument for rounding up to 8? If not, we can go with 7 threads and thus fit more pipelines. > > On the other hand, I think we could use some queueing and sorting to > > optimise memory access now that if we go with independent threads. We > > have 60 pipelines, so it seems reasonable to spend a bit logic to keep > > them busy. Instead of letting the ALU do load, we send these > > instructions to a shared memory unit. > > There would be a dedicated "load unit". When FETCH detects a load > instruction, it issues the instruction but immediately puts that > thread to sleep if there is no read data available. Essentially, > NOOPs get issued until the read return queue is no longer empty. > > > It may be tempting to add one or > > two extra threads per ALU to keep the ALU busy, but due to the cost and > > the low frequency of loads, it may be better to send a "phantom" down > > the ALU for the thread doing the load. The result of the load can be > > fetched back via a short return queue on each ALU. This could be just > > one or two slots if we allow stalling on rare cases. As soon as a > > "phantom" comes out of the ALU, a real thread is dequeued and passed > > down it place of it. > > I'm not sure, but you may be saying the same thing. :) No. The load instruction does not prevent the execution of the next thread. Say thread I is a load and for simplicity only have 4 stages: ( I0, H0, G0, F0) ( F1, noop, H0, G0) ( G1, F1, noop, H0) ( H1, G1, F1, noop) (noop, H1, G1, F1) <-- return queue from load is still empty ( F2, noop, H1, G1) <-- result for thread I ready ( G2, F2, noop, H1) ( H2, G2, F2, noop) ( I1, H2, G2, F2) <-- thread I re-scheduled It should be noted than if there are several loads in action for a pipeline, we can plug the first noop which is about to re-enter the pipeline with the first thread from the pipeline which finishes the load. If we can afford to add one extra thread per ALU (as a moderate interpretation of Andre's proposal), we can keep the ALU 100 % busy for the first load. For the second, we still create a hole, but now this hole has a higher chance of being plugged faster since there are two candidate threads. > > Once memory requests goes to a shared unit, maybe we can spend some > > transistors on it? We have four memory rows, as far as I understand. > > Compare each incoming request and queue them for the correct row if they > > match one. Otherwise pass them into a holding area where we do some > > heuristics I haven't quite out to elect which will be next row to open > > once one of the former queues dries out. > > I'm not sure what you're saying here. Are you talking about a shader > or the memory system? I'm talking about any load from the main memory. I'm not sure where the shader comes in here. In more detail, a simple design could go something like: Stage 1 compares the row of the load with the currently open row. If it matches and the queue for that row and the queue is not full, or if the queue is empty, then stage 2 will enqueue the request. Stages 2 though 5 does the same thing for the next 3 available open row. Requests which fall though are put into queue which feeds back to stage 1 on each cycle where the queue of inputs from the ALUs is empty. +-->[cmp] | [cmp] [enq] --> load from row i | [cmp] [enq] --> load from row j | [cmp] [enq] --> load from row k [enq] [enq] --> load from row l With some extra logic we could reduce the pipeline to 3 stages to increase the throughput an thus the chance of catching all loads for a particular row before its queue dries out. Another extension would be to make speculative guesses of which row to open next. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
