On Sun, Mar 17, 2013 at 5:51 PM, Troy Benjegerdes <[email protected]> wrote:
> On Sun, Mar 17, 2013 at 05:04:27PM -0400, Timothy Normand Miller wrote: > > On Sun, Mar 17, 2013 at 4:31 PM, <[email protected]> wrote: > > > > > Le 2013-03-17 21:20, Timothy Normand Miller a ?crit : > > > <snip> > > > > > > I'm looking here at the bit compactness, because > > > less bits means less toggles and less power draw. > > > > > > Using a bit for each instruction to throw the register away : 1 > > > bit/instruction > > > > > > Using R0 : if you have, say, 64 registers, R0 will amount to 0,09 > > > bits/address, > > > or if you use 3 addresses : 0,28 bits/instruction. > > > Drawback : you have to add 3 6-input OR gates to decode the R0 > condition, > > > or create a custom register set where R0 is "hardwired". > > > However, all array generators need a regular, homogenous addressing > space. > > > FPGAs, ASICs etc. prefer and want memories where all the cells are > > > identical. > > > > > > > The logic for this isn't a big deal. We can MUX in a zero for R0 source, > > and there are a few good ways to cancel writeback for R0 destination. > (We > > could also initialize R0 to zero if it's a source else otherwise just not > > write to it, but my reliability intuition tells me not to do that.) > > I like the R0 is always zero simplicity, and maybe add that R 0b1111 (R31 > or > whatever) is always all ones. > The bit bucket has been done a lot. Not sure about the other one. But some ISAs have NOT instructions to compensate. We could always just do this: R1=SUBI(R0,1) > > Now, for one additional thought: What if we allow a special case for some > registers where we actually do the write-back, but turn off the > hazard/stall > logic? > Not very useful with multithreading. Having more threads than pipeline stages obviates this problem. > > I can imagine some parallel code algorithms that really care about the > flags, > and might have 16 threads that do some calculation, and you really only > need > the result from one of those threads, and you don't care which one, and if > you know the result will be in R1, but you don't care who put it there. > The idea of this kind of cross-thread communication not through a scratch pad gives me the willies. Especially if we get a warp divergence, where we don't know the relative time of execution. Also, how do you make sure that you get only one thread to produce the result without a warp divergence, or how is that different from just having every thread produce the same result? It would only make sense if that were precomputed by an entirely different earlier kernel, in which case we'd use scratch memory. > > Something in what I said above makes sense in my head, and I'm hoping > someone > else either gets an idea, or can describe something usefull better than > what > I did above. > Maybe I didn't get it. :) -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project
_______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
