On Sat, Jul 19, 2008 at 8:20 AM, Petter Urkedal <[EMAIL PROTECTED]> wrote:
> On 2008-07-18, Timothy Normand Miller wrote:
>> On Fri, Jul 18, 2008 at 1:31 PM, Petter Urkedal <[EMAIL PROTECTED]> wrote:
>> > I believe HQ has not yet been connected to the bridge?
>>
>> No.  Not yet.  We're going to get to that once we have debugged it to
>> this point.  However, that doesn't mean we can't start on the coding
>> and them merge things later.  Would you like to work with me to
>> architect this?
>
> I'm not sure how this will be done, but I can probably help anyway.
>
>  * As I recall from previous discussion, we want to decode the PCI
>    address and dispatch to HQ in hardware, rather than equipping HQ
>    with a control bit to intercept all incoming PCI commands.  Can we
>    assume the BAR for HQ is fixed, or shall HQ be able to configure the
>    pipe from PCI to intercept different address ranges on demand?

The way the bridge works, there can only be one outstanding read
transaction (for however many words are requested).  Once the request
has been made, the bridge switches the IO buffers so that the data
lines can only be used for read data until the transaction is
finished.  We can later consider bypass and through accesses to
memory, but for the moment, it's simpler to just make it all or
nothing.  This applies to both memory and register access, since they
both go through the same bridge.  We need to consider the consequences
of having HQ intercept EVERY access to the bridge.

>  * As far as I can see, there are four clocks involved, the PCI clock,
>    two separate clocks for bridge transmission and reception, and the
>    HQ clock.  Are all these different?

There are 3 or 4.  Here's how it's arranged right now, without HQ,
where the fifos exist to cross from the PCI clock domain to the bridge
clock domain:

bridge write data and commands (PCI clock)  --->  bridge write clock domain
bridge read clock domain  --->  bridge read return data (PCI)

The bridge is constrained to run at 100MHz, although it's running at
90 right now (didn't bother to change the clock generator numbers).  I
wouldn't expect HQ to run a whole lot faster, so it should be okay to
run HQ and the bridge at the same speed.  We can revisit that decision
later.

Adding in HQ, some things change.  The command queue (write data and
read requests) can simply be reconnected to HQ (but with a bypass too,
of course).  A read return queue from HQ to PCI only actually
necessary because we're crossing clock domains; otherwise, it's waste.
 HQ writes (and read commands) to the bridge could be direct, but the
bridge can go busy, for instance if video is making a long read, and
queues in the S3 fill, so we should have a queue there, where we can
read its fullness in software and push or not push writes (and read
commands).  Read data from the bridge to HQ has to be queued so that
HQ can request reads, go off and do something else, then come back and
read the data (or wait for it anyhow).

So that leaves us with four queues:

PCI write/cmd  --->  HQ
HQ write/cmd ---> bridge write
bridge read data ---> HQ
HQ read data --> PCI

With hop-overs when HQ is disabled.

> I assume memory access goes though the bridge.  So, we must extend
> xp10_bridge_wrapper.v with an additional internal interface for HQ
> memory operations.  If we need high thoughput, is there any alternative
> to two extra FIFOs using two BRAMs?  Any yet another two for PCI?

Because of the combination of need for being asynchronous and to cross
clock domains, I can't see an alternative.  These can just be 16-entry
distributed-RAM fifos.

> Since HQs BRAM has an unused port with it's own clock domain, it may be
> possible to let the bridge read and write data directly to HQ memory.
> That is, for memory-write, HQ prepares the data in a subrange of it's
> BRAM, and tells the bridge to transmit the range to a given memory
> address.  For memory-read, HQ tells the bridge to transfer a memory
> range to a BRAM range.  That could also work for PCI, though we'd need
> to extend HQ internal memory with another BRAM due to the separate clock
> domains.

That could be very useful, for performance and more asynchrony.
However, the bypass won't work that way, so we'd have to implement
both mechanisms.  We should start with the dumber one that works with
bypass and see if we can really benefit from the optimization
afterwards.

BTW, there are some facts about the bus protocol that we might want to
change.  When accessing the bridge, the first cycle is the address,
and the flag bits indicate the target (memory or config registers).

For reads, the subsequent cycle is the word count, after which the bus
switches direction and waits.

For writes, subsequent cycles are data, flags indicate which bytes are
valid, and the address auto-increments.

The address counter in the S3 auto-increments, but it only increments
the lower 7 bits of the word address.  So every 128 32-bit words, it's
required that a new address be sent.  That happens automatically with
PCI due to the way this target is designed, but HQ will have to
enforce it in the program.

One thing we may want to change is how the target flags are presented.
 Right now, they're separate from the address, but the address isn't
32 bits, so they could be prepended.  However, it may actually be
faster to make them separate, potentially saving some HQ code to
extract them.

>> >> The next tasks, things we really need help with, include:
>> >>
>> >> - HQ microcode.
>> >
>> > When we start on the microcode it may be an idea to come up with an
>> > (manually enforced) ABI for utilise the registers as best as we can
>> > without ending up with a web of dependencies to rework if we need to
>> > change something.  Is there a register-based ABI/practices we can adapt?
>> > Otherwise, I can write up some ideas.
>>
>> Are you referring to the assignment of "names" to scratch space
>> addresses?
>
> That's an issue, too.  For the moment, I was just considering
> parameters, results, and scratch registers for subroutine calls.  Since
> our programs are small, it's probably not a big issue.  It may suffice
> with a single level of calls to rather simple subroutines using only a
> few registers, and once written the register usage of the subroutine is
> unlikely to change.

We're extremely constrained, so we need to pick something that's very
efficient, even it's more challenging to program.  We could think like
fortran 77, where parameters are passed by reference and live at fixed
addresses.  So if you call function X, then you know exactly where in
scratch space to dump its parameters.  Exactly what to do with the
return/continuation address is a question, but if we do it this way,
there's a place in scratch memory to put it (if the subroutine needs
the register for something else).  Not having a stack does pose some
challenges.  We can simulate a stack, but the instruction overhead
could hurt performance.  We could also consider implementing a
16-entry stack, although I don't want to if we can avoid it.

If we were ever to take this basic architecture and scale it to a more
powerful processor design in the future, it would not be binary
compatible.  Moreover, different revisions of OGA do not have to have
binary-compatible HQs.  As long as the instruction sets are
well-documented, the system and BIOS can handle figuring out which
code file to load for any given purpose.

>> We definitely need something like that, but it may be very
>> program-dependent.  Unless things turn out to be surprisingly small,
>> we'll have one program for VGA text, one for VGA graphics, and
>> eventually, one for DMA.  BIOS or kernel can reload the program as
>> necessary.
>>
>> I'm always in favor of creating good design structures.  512 program
>> words doesn't seem like a lot, but that's part of the challenge --
>> fitting a program into that space.  To keep our sanity, we really need
>> to be organized about it.  This is especially important for us and our
>> progeny to be able to maintain it later.  Unfortunately, I'm not sure
>> what pre-existing paradigms might apply here, so lets develop
>> something new.
>
> This is what I had in mind.  We allocate from r0 upwards in the
> following order with possible overlaps (s = scratch, r = read, w =
> write)
>
>                                        caller  callee
>  1. Scratch registers.                 s       s
>  2. Result registers.                  s/r     s/w
>  3. Scratched parameter registers.     s/w     s/r
>  4. Preserved parameter registers.     s/w     r
>  5. Continuation address register.     s/w     r
>  6. Callee preserved registers.        s       -
>  7. Caller preserved registers.        -       -
>
> Relating to stack-based ABI, regs 2 to 5 makes up the current frame, and
> higher registers are higher stack frames.  Relating to CPS-based ABI,
> regs 4 to 7 are the continuation and regs 2 are the parameters passed to
> the continuation.

Continuations are interesting for high-level programming, but I'm not
even thinking of doing a proper stack-based ABI.

A simple approach:

Every function has access to every register for any purpose, but for
any register it's going to use, it is responsible for storing it to a
fixed (computed by the assembler) location in scratch space, and
restoring it before returning.  This includes the call return address.
 We can devise assembler directives that indicate which registers hold
which parameters (no scratch space needed), which registers are for
local variables (requiring a backing store in scratch space), and how
many additional scratch locations are necessary for the task to
compute.
Parameters can be passed in registers, and we are free to restrict it
so that no more than some number of parameters can be passed, owing to
certain registers being reserved (particularly the return address).
Recursion is not allowed.  Indeed, disallowing recursion may help us
simplify things further.

It's slightly faster to use registers than scratch space, so we should
see how we can optimize to use registers more.

> E.g. "z += x*y" would be (cf hqlib/mulu.asm though it doesn't use this
> convention),
>
>    r0 - parameter and result z
>    r1 - parameter x
>    r3 - parameter y
>    r4 - continuation address
>    r5..r31 - preserved
>
> If we had a subroutine using the above, it's usage may be
>
>    r0..r4 - scratch
>    r5..r[N-1] - output, input, cont
>    r[N]..r31 - preserved
>
> This facilitates bottom-up coding.  As long as we are dealing with
> simple and well-defined subroutines, we'll be able to allocate registers
> precisely.  For higher level subroutines, we can think forward and set a
> side some extra scratch registers.

I like this.  We can put some intelligence into the assembler (or
compiler?) that allows a function to state which registers it's going
to use.  Then the CALLER is responsible for saving and restoring, and
this makes room for the caller to simply avoid certain registers,
rather than the callee having to save and restore registers even if
nothing useful is in them.  So for leaf nodes and their parents, this
can result in some significant optimizations, although that will
diminish dramatically for the next level up.


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to