On Thu, Jul 5, 2012 at 10:28 AM, Mark Marshall <[email protected]> wrote:
> Hi.
>
> I'm glad to hear that there's some activity on the OGP again.  It seems like
> an interesting idea that Timothy's had, and I'll definitely try to help as
> much as I can.

Thank you.  I, along with the REAL GPU researchers of the world, will
thank you.   :)

>
> I like the processor design.  It's what wikipedia calls a barrel-processor
> (which isn't a term I've seen used on this list yet
> http://en.wikipedia.org/wiki/Barrel_processor).  It's nice, from a hardware
> point of view you have a very long pipeline, From a software point of view
> there is no pipeline (instruction N has fully completed before instruction
> N+1 has started).

Cool.  Thanks for the ref!

>
> I've had a play with your code, and I have some questions and ideas for
> changes.
> I have actually made some of these changes,and a attach my modified version.
>
> Two minor changes:
> - As this is c++ I'd use bool's for booleans, not bit fields.
> - I'd rather not to use a hacky macro to convert a uint32 to a float, this
> should be a union (TO_INT, TO_FLOAT).

I have no attachment to any particular way of doing this.  Also, using
bools would likely produce faster code anyway, and performance is an
important feature of a simulator.  I'm hoping that my nasty hacks
necessary to get some old thing working will soon evolve out.

One reason to use bitfields, BTW, is if the flags correspond to a
special register that can be copied to/from a regular register.
Useful for context switches, if we were ever to need them.  But all it
does is simplify code in one place, so probably not worth slowing
everything else.  This is a LOGICAL simulator, not a physical one, and
all physicalities can be emulated.

>
> The other coding change that I made was to have a slightly more powerful way
> of defining all of the opcodes.  The macro INS_LIST (in oga2-opcodes.h) is a
> list of all opcodes.  We use this one list to generate all per-opcode data.

Sounds good.  I'm also thinking about some meta-coding, where we have
one file that fully defines all instructions, including code that
executes the instructions, and by some pre-processor of our or some
other origin, code is emitted at compile time that actually implements
all of this.  However, I'd hate to make it hard to penetrate by those
reading the code, and I'd also hate to do something that impacts the
portability.

> I added predicate support (a simple syntax for now. [4] before an

We need a complete parser.  At first, we'll parse directly to the
decoded structure for instructions.  Later, we can add binary formats.
 All of this needs to be rather fluid so that we're minimally tied
down to anything specific.  At compile time, we select options.

I need to give you write access to this project.  Please privately
email me to coordinate this.  Thanks!

> instruction gives it predicate n).  How is the predicate stuff supposed to
> work?  Do you expect nothing to happen for an instruction where the
> predicate is false?  (This needed adding for jmp).  Your code sets
> write_target to false if the predicate is false, but what about instructions
> that don't write a target (memery requests & jmp's?).

In the real pipeline, a slightly more complex decision will be made
towards the top of the pipeline.  In most cases, the predicated-out
instruction will get discarded at the top of the pipeline.  It saves
energy to minimize switching, and tossing it out early will eliminate
switching further down.

>
> I added simple memory requests.  This is a can of worms.  I assumed that the
> memory requests only worked on 32-bit quantities and one-in-four 8-bit
> requests get promoted to a 32-bit request. consider:
>   read-req addr=<foo> len=5
>   read 8
>   read 32
>   read 8
>   read 32
>   read 8
>   read 32
>   read 8
>   read 32

That's not really a valid sequence.  We wouldn't allow a compiler to emit that.

> That sequence will read only 5 32-bit values.  The first value read will be
> returned in four 8-bit chunks.  The following sequence could also be valid
> (I think it should be):
>   write 8
>   write 8
>   write 8
>   write-req addr=<foo> len=1
>   write 8
> The four 8-bit values get packed into one 32-bit value.  The write-req only
> has to happen before the fourth 8-bit write.

True!

There are a lot of potential problems with how I templated this.
Also, I'm thinking of eliminating request counts.  A request with
count > 1 will get adjacent memory words, but adjacent memory words
should get processed by other threads, so that doesn't work.  And
since we can't predict how many threads are running, we also can't
provide a stride for multi-word.  Some GPGPU loads MAY benefit from
being able to request to queue more than one adjacent word, but few
graphics workloads.  Rather, the rasterizer will provide sets of
coordinates that are converted to a single memory address for each
surface involved.  All requests in that case will be single-word.

>
> (Basically, what I'm saying here is that it seems to make sense to split out
> the part of the memory pipe that does 32-bit to 8-bit splits from the rest
> of it.  I think this will maek the hardware easier, and we can always
> disallow the slightly bizarre cases above in the compiler?)

I want to do the formatting on the way in, not in the pipeline
critical path.  This may require specialized req instructions.

>
> There are a few instructions that you are missing as well.  We need variants
> that loads constants (I assume, possibly others, shifting by a constant
> amount, etc.).

That's true.  At the very least, a load immediate to register.  But
Andre and Kenneth specified a "constants" register file.

>
> It's probably worth adding some instructions to help calculate some hard
> math functions as well.  Square root and inverse square root spring to mind.
> We have a long pipeline, which means we could get a lot done.  It would also
> be helpful (if we didn't want to calculate the whole square root) to
> calculate a partial answer.  It's then much easier to calculate the complete
> answer in only a few instructions.

As an option we can consider adding a pool of trig and transcendental
function (along with other things like divide) units that are shared
among multiple thread processors.

>
> t0 = aprox-square-root(X)
> t1 = (t0 + X/t0) / 2

Newton's method?

>
> If t0 has n bits of accuracy t1 has 2n bits of accuracy.   (Theres a really
> nice geometric proof of this, which I love.  We have a rectangle of area X.
> We need to convert it to a square of area X.  A more "square" rectangle will
> have one side that is the average of the sides of our initial rectangle).
>
> Good work so far, and I might hack some more on this.  I'd appreciate
> feedback on my changes and ideas.

It all sounds great!  Thanks!

>
> MM
>
>
>
> On 1 July 2012 20:22, Timothy Normand Miller <[email protected]> wrote:
>>
>> Here you can browse the source of the first compilable version of the
>> OGA2 simulator.
>>
>>
>> http://sourceforge.net/p/openshader/code/ci/9d84745908ebdf51569b5efaa8b718aa4d81ab4b/tree/simulator/
>>
>> My main goal was to get something to compile and run, however trivial.
>>  As a result, there is some truly horrid coding, which I apologize
>> for.  If you pull this tree and 'make' it, you'll get a demo that runs
>> three instructions in an infinite loop.
>>
>> --
>> Timothy Normand Miller, PhD
>> http://www.cse.ohio-state.edu/~millerti
>> Open Graphics Project
>> _______________________________________________
>> Open-graphics mailing list
>> [email protected]
>> http://lists.duskglow.com/mailman/listinfo/open-graphics
>> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>
>



-- 
Timothy Normand Miller, PhD
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to