On Thu, Jul 5, 2012 at 10:28 AM, Mark Marshall <[email protected]> wrote: > Hi. > > I'm glad to hear that there's some activity on the OGP again. It seems like > an interesting idea that Timothy's had, and I'll definitely try to help as > much as I can.
Thank you. I, along with the REAL GPU researchers of the world, will thank you. :) > > I like the processor design. It's what wikipedia calls a barrel-processor > (which isn't a term I've seen used on this list yet > http://en.wikipedia.org/wiki/Barrel_processor). It's nice, from a hardware > point of view you have a very long pipeline, From a software point of view > there is no pipeline (instruction N has fully completed before instruction > N+1 has started). Cool. Thanks for the ref! > > I've had a play with your code, and I have some questions and ideas for > changes. > I have actually made some of these changes,and a attach my modified version. > > Two minor changes: > - As this is c++ I'd use bool's for booleans, not bit fields. > - I'd rather not to use a hacky macro to convert a uint32 to a float, this > should be a union (TO_INT, TO_FLOAT). I have no attachment to any particular way of doing this. Also, using bools would likely produce faster code anyway, and performance is an important feature of a simulator. I'm hoping that my nasty hacks necessary to get some old thing working will soon evolve out. One reason to use bitfields, BTW, is if the flags correspond to a special register that can be copied to/from a regular register. Useful for context switches, if we were ever to need them. But all it does is simplify code in one place, so probably not worth slowing everything else. This is a LOGICAL simulator, not a physical one, and all physicalities can be emulated. > > The other coding change that I made was to have a slightly more powerful way > of defining all of the opcodes. The macro INS_LIST (in oga2-opcodes.h) is a > list of all opcodes. We use this one list to generate all per-opcode data. Sounds good. I'm also thinking about some meta-coding, where we have one file that fully defines all instructions, including code that executes the instructions, and by some pre-processor of our or some other origin, code is emitted at compile time that actually implements all of this. However, I'd hate to make it hard to penetrate by those reading the code, and I'd also hate to do something that impacts the portability. > I added predicate support (a simple syntax for now. [4] before an We need a complete parser. At first, we'll parse directly to the decoded structure for instructions. Later, we can add binary formats. All of this needs to be rather fluid so that we're minimally tied down to anything specific. At compile time, we select options. I need to give you write access to this project. Please privately email me to coordinate this. Thanks! > instruction gives it predicate n). How is the predicate stuff supposed to > work? Do you expect nothing to happen for an instruction where the > predicate is false? (This needed adding for jmp). Your code sets > write_target to false if the predicate is false, but what about instructions > that don't write a target (memery requests & jmp's?). In the real pipeline, a slightly more complex decision will be made towards the top of the pipeline. In most cases, the predicated-out instruction will get discarded at the top of the pipeline. It saves energy to minimize switching, and tossing it out early will eliminate switching further down. > > I added simple memory requests. This is a can of worms. I assumed that the > memory requests only worked on 32-bit quantities and one-in-four 8-bit > requests get promoted to a 32-bit request. consider: > read-req addr=<foo> len=5 > read 8 > read 32 > read 8 > read 32 > read 8 > read 32 > read 8 > read 32 That's not really a valid sequence. We wouldn't allow a compiler to emit that. > That sequence will read only 5 32-bit values. The first value read will be > returned in four 8-bit chunks. The following sequence could also be valid > (I think it should be): > write 8 > write 8 > write 8 > write-req addr=<foo> len=1 > write 8 > The four 8-bit values get packed into one 32-bit value. The write-req only > has to happen before the fourth 8-bit write. True! There are a lot of potential problems with how I templated this. Also, I'm thinking of eliminating request counts. A request with count > 1 will get adjacent memory words, but adjacent memory words should get processed by other threads, so that doesn't work. And since we can't predict how many threads are running, we also can't provide a stride for multi-word. Some GPGPU loads MAY benefit from being able to request to queue more than one adjacent word, but few graphics workloads. Rather, the rasterizer will provide sets of coordinates that are converted to a single memory address for each surface involved. All requests in that case will be single-word. > > (Basically, what I'm saying here is that it seems to make sense to split out > the part of the memory pipe that does 32-bit to 8-bit splits from the rest > of it. I think this will maek the hardware easier, and we can always > disallow the slightly bizarre cases above in the compiler?) I want to do the formatting on the way in, not in the pipeline critical path. This may require specialized req instructions. > > There are a few instructions that you are missing as well. We need variants > that loads constants (I assume, possibly others, shifting by a constant > amount, etc.). That's true. At the very least, a load immediate to register. But Andre and Kenneth specified a "constants" register file. > > It's probably worth adding some instructions to help calculate some hard > math functions as well. Square root and inverse square root spring to mind. > We have a long pipeline, which means we could get a lot done. It would also > be helpful (if we didn't want to calculate the whole square root) to > calculate a partial answer. It's then much easier to calculate the complete > answer in only a few instructions. As an option we can consider adding a pool of trig and transcendental function (along with other things like divide) units that are shared among multiple thread processors. > > t0 = aprox-square-root(X) > t1 = (t0 + X/t0) / 2 Newton's method? > > If t0 has n bits of accuracy t1 has 2n bits of accuracy. (Theres a really > nice geometric proof of this, which I love. We have a rectangle of area X. > We need to convert it to a square of area X. A more "square" rectangle will > have one side that is the average of the sides of our initial rectangle). > > Good work so far, and I might hack some more on this. I'd appreciate > feedback on my changes and ideas. It all sounds great! Thanks! > > MM > > > > On 1 July 2012 20:22, Timothy Normand Miller <[email protected]> wrote: >> >> Here you can browse the source of the first compilable version of the >> OGA2 simulator. >> >> >> http://sourceforge.net/p/openshader/code/ci/9d84745908ebdf51569b5efaa8b718aa4d81ab4b/tree/simulator/ >> >> My main goal was to get something to compile and run, however trivial. >> As a result, there is some truly horrid coding, which I apologize >> for. If you pull this tree and 'make' it, you'll get a demo that runs >> three instructions in an infinite loop. >> >> -- >> Timothy Normand Miller, PhD >> http://www.cse.ohio-state.edu/~millerti >> Open Graphics Project >> _______________________________________________ >> Open-graphics mailing list >> [email protected] >> http://lists.duskglow.com/mailman/listinfo/open-graphics >> List service provided by Duskglow Consulting, LLC (www.duskglow.com) > > -- Timothy Normand Miller, PhD http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
