Re: [Open-graphics] HQ assembly code, bypass

Petter Urkedal Thu, 21 Aug 2008 07:00:42 -0700

On 2008-08-20, Timothy Normand Miller wrote:
> I've quoted the code and inserted questions and comments.
> 
> 
> > let G_POLL_BASE             = 0  ; FIXME
> >
> > ;; Global Parameters
> > let G_POLL_MEM_TRANS        = G_POLL_BASE + 0
> > let G_POLL_VMEM_TRANS       = G_POLL_BASE + 1
> > let G_POLL_IO_TRANS = G_POLL_BASE + 2
> >
> > ;; Global State
> > let G_POLL_TARGET   = G_POLL_BASE + 3
> > let G_POLL_ADDR             = G_POLL_BASE + 4
> 
> Ok, so this must be where you're keeping your state.


Yes, and as an apropos to something you said below, we need some way to
initialise some of these.  Could we map part of the HQ-internal memory
to it's IO space?  The host would, after loading the program, initialise
the globals by regular target writes to IO space.  The presumes the IO
space is big enough.  Otherwise, we could introduce data and address
ports and a dedicated IO function for writing to HQ's internal memory.
If the upload code itself depends on a few globals, those can be
initialised by a few instructions run after HQ starts up.

> Something occurs to me.  When you get the INFO, you're then masking to
> get fields.  How about this... for reading from the PCI command fifo,
> INFO could be split up so there's one port where you can read all the
> bits (command and flags/target), and there are two more ports where
> you can read them separately, so there are no instructions wasted on
> masking and shifting.  (Although I like how you did the xor thing to
> test the commands.)  For writing to the bridge, we don't need that,
> since they're encoded in the I/O address.

I think that's better, and how about also dropping the combined
flags/target port.  I don't think there is any case we need to deal with
the combination of the two.  Wasn't the main reason for combining them
to reduce the number of ports, and thus reduce the MUX in the RTL?

> > ;;; ------------------------------------------------------------------------
> > ;;; poll_pci(r4: continuation)
> >
> >     frame
> >     alias q0..q1 = r3..r4
> >     alias p0 = r5
> >     protect r6..r31
> > poll_pci:
> 
> I inserted some comments...
> 
> >     ;; Find out how many commands we have to process
> >     move [PCI_T_CMD_COUNT], r0
> >     
> >     ;; If there are none, return
> >     jzero r0, p0

Not sure how to best comment the code without blowing up its size, but I
agree we need many comments.  Two alternatives for the above could be to
utilise the horizontal space when possible

        move [PCI_T_CMD_COUNT], r0 ; Find out how many commands to process
        jzero r0, p0               ; If there are none, return

or to group related comments

        ;; Find out how many words to process.  If none, return.
        move [PCI_T_CMD_COUNT], r0
        jzero r0, p0

> >     ;; Switch between Command Types
> >     ;; ----------------------------
> >     move [PCI_T_CMD_INFO], r0
> >     and r0, PCI_TCINFO_TYPE_MASK, r1
> 
> So, like I say, we can eliminate this one AND by having extra I/O ports.
> 
> >     xor r1, PCI_TCINFO_TYPE_ADDR, r1
> >     jzero r1, poll_pci_addr
> >       xor r1, PCI_TCINFO_TYPE_RCOUNT ^ PCI_TCINFO_TYPE_ADDR, r1
> >     jzero r1, poll_pci_rcount
> >       xor r1, PCI_TCINFO_TYPE_WDATA ^ PCI_TCINFO_TYPE_RCOUNT, r1
> >     jzero r1, poll_pci_wdata
> >       noop
> >
> 
> Need something to do in case command is zero (although no point until
> there really is a null command).
> 
> One idea that occurs to me, but I don't know if it'll be efficient is
> something like:
> 
>     move [PCI_T_CMD_INFO_TYPE], r0
>     lsr r0, 5, r0
>     add poll_pci__base, r0, r0
>     jmp r0
> 
> This assumes that each case only needs 32 instructions.

A straight forward jump table would be

        move [PCI_T_CMD_TYPE], r0
        shift r0, 1, r0
        add r0, poll_pci_switch, r0
        jump r0
          noop
poll_pci_switch:
        jump p0  ; If idle, exit.
          noop
        jump poll_pci_addr
          noop
        jump poll_pci_rcount
          noop
        jump poll_pci_wdata
          noop
        ...

but it's not that efficient.  We can get rid of the "shift" by keeping
PCI_T_CMD_TYPE shifted by one bit in hardware.

> Another
> alternative would be:
> 
>     move [PCI_T_CMD_INFO_TYPE], r0
>     add r0, G_POLL_CASE_ADDR, r0
>     move [r0], r0
>     jmp r0
> 
> Basically, just use a lookup table to get the branch address.

That's better.  We must be careful to initialise the IO write case from
a HQ init function, as noted above.  However, there needs to be a noop
before and after the jump.  The one before is due to the fact that
fetches can't be forwarded.

Counting from the "move [PCI_T_CMD_TYPE], r0" instruction, the
if-then-else-if version has 4 instruction before reaching the target
code for the address first case, then 6 for read and 8 for write (we may
want to reorder them).  Your lookup table has 6 including two noop which
may or may be utilised.  My jump table has 6 instructions if we
eliminate the shift, but in this case we can in fact utilise every delay
slot of the jump table, making it 5 instructions.

> BTW, I'm not accounting at all for the fact that results of a given
> instruction may not be available to the next instruction.  Did you add
> forwarding to the design?

Yes, I added register forwarding, very much tanks to you description.
But, after a memory fetch, we need one noop, because the memory-read
happens two stages down from the ALU, so with only one delay slot there
is no way we can forward it.  This will in fact affect a lot of the code
cited below.

> >     ;; Address Command
> >     ;; ---------------
> > poll_pci_addr:
> >     ;; Have r0 = [PCI_T_CMD_INFO]
> >     ;; Will set r1 to the address correction.
> >     and r0, PCI_TCINFO_FLAGS_MASK, r0       ; target
> 
> Now, this AND can be replaced by just a fetch from PCI_T_CMD_INFO_FLAGS.
> 
> >     move r0, [G_POLL_TARGET]
> >     sub r0, PCI_TARGET_MEM, r0
> 
> Why did you use xor above and sub here?

That's because I started to replace "sub" with "xor" only after reading
your post about widening the instruction word, and thinking that

        xor r1, PCI_TCINFO_TYPE_RCOUNT ^ PCI_TCINFO_TYPE_ADDR, r1

is more elegant than

        sub r1, PCI_TCINFO_TYPE_RCOUNT - PCI_TCINFO_TYPE_ADDR, r1

because ^ is commutative.
 
> >     jzero r0, poll_pci_save_addr
> >       move [G_POLL_MEM_TRANS], r1
> >     sub r0, PCI_TARGET_VMEM - PCI_TARGET_MEM, r0
> >     jzero r0, poll_pci_save_addr
> >       move [G_POLL_VMEM_TRANS], r1
> >     sub r0, PCI_TARGET_IO - PCI_TARGET_VMEM, r0
> >     jzero r0, poll_pci_save_addr
> >       move [G_POLL_IO_TRANS], r1
> 
> I like how you document the delay slot by indenting.
> 
> >     move 0, r1
> > poll_pci_save_addr:
> >     ;; Save address and tail call.
> >     move [PCI_T_CMD_DATA], r0               ; dequeue address
> >     add r0, r1, r0
> >     move r0, [G_POLL_ADDR]
> >     jump poll_pci
> >     noop
> 
> I wonder if you couldn't just look up the target in a table.
>     move [PCI_T_CMD_INFO_FLAGS], r0
>     add TABLE_BASE, r0, r0
>     move [r0], r0  ;; Now have address translation
>     move r0, [G_POLL_ADDR]
> 
> This means that scratch memory needs to be writable from PCI, just
> like the program memory.  Either that, or we need a way to get data
> out of program memory, but that seems like it would be harder.
> 
> Also, I'm again probably violating data dependencies.

I think we should solve the problem of initialising scratch memory from
the host, anyway.  Initialising it with instructions will take up a lot
of program memory.

Your address correction code above looks good, but also here we need a
single noop after the fetch, since the fetch cannot be forwarded.

BTW, I just discovered some weirdness when a noop follows a fetch to r0.
It's due to the implementation of "noop" as "move r0, r0".  Consider,

        move [SOME_ADDRESS], r0
        move r0, r0

When the fetch instruction reaches the IO unit, the address is on the
input port, and the value we want will only appear on the output on the
next cycle.  But, it this point, the "move r0, r0" is ready to be
executed by the ALU, which enables register forwarding from the ALU
output/IO input to the ALU input.  Thus, SOME_ADDRESS will be stored in
r0.

This can be solved by the assembler by issuing difference "noop"
instructions if the previous instruction was a fetch to r0.  However,
that leaves one bad case when a delay slot contains a fetch to r0 and
the target of the jump starts with a "noop".  That can't be detected.
 
> >     ;; Read Command
> >     ;; ------------
> >     ;;
> >     ;; For simplicity, we only fetch up to MEM_GRANULE_SIZE words.
> > poll_pci_rcount:
> >     ;; Write request address
> >     move [G_POLL_ADDR], r0
> >     and r0, MEM_GRANULE_SIZE - 1, r1 ; words to skip
> >     and r0, ~(MEM_GRANULE_SIZE - 1), r0 ; aligned address
> >     move [G_POLL_TARGET], r2
> >     xor r2, PCI_TARGET_ENG, r2
> >     jnzero r2, poll_pci_rcount_not_eng
> >       move r0, [MEM_SEND_ADDR_MEM]
> >     move r0, [MEM_SEND_ADDR_ENG]
> 
> This would send two addresses when the test fails.  But that's fine.
> You can send extraneous address commands over the bridge with no
> consequence, as long as the bridge is not busy.

That was what I though ;-)

> When do you need to read a block whose size is different from the
> request made by the address decoder?  With engine addresses, you can
> only request the 1 anyhow, and with memory, you can just request as
> many as the address_decode asks for (which is 16).  So I'm not
> entirely sure what this is for.  Is this for handling uncached memory
> reads?  If so, then it makes sense.
> 
> I think if you were to request more than 1 word for an engine access,
> the bus logic would hang.
> 
> Note that caching or not would be specific to the HQ program that's
> running.  In VGA text mode, we can cache, so there's no need to do
> anything special with memory reads.  In VGA graphics mode, there MAY
> be side-effects, so we may have to turn the cache off, in which case,
> we need to do what you're doing... but ONLY that.  You'd read 8 words,
> possibly cache them, or possibly throw away the ones you don't want.

But as you say, this will always be a request for a single word, right?

> > poll_pci_rcount_not_eng:
> >     ;; Let q0 be the numbers to transfer.
> >     move [PCI_T_CMD_DATA], q0 ; the requested count
> >     jzero q0, poll_pci ; or can we assume this is nonzero?
> >       noop
> 
> The request count is either 1 (for engine or uncached) or 16 (for
> memory, cached).  No other values are possible.  If we change that,
> then you'll have to change the assumptions, but I doubt that'll
> happen.

So,  1 or 16, that's much simpler.  We just write a specialised version
for each case.
 
> >     jump mem_small_skip, r2
> 
> I forget....  Is the second argument where the return address is stored?

Yes, and for conditional jumps, it's the third argument.

> >     ;; Write Commands
> >     ;; --------------
> > poll_pci_wdata:
> >     ;; Send address to bridge and adjust [G_POLL_ADDR].
> >     move [G_POLL_ADDR], r2
> >     move r2, [MEM_SEND_ADDR_MEM]
> >     ;; Prepare for the first transfer.  We know the next PCI command is a
> >     ;; read, and r0 already contains PCI_T_CMD_INFO.
> 
> How do we know the next PCI command is a read?

Sorry, we know it's a *write*, because we arrived from the command-type
switch.
 
> >     and r0, PCI_TCINFO_FLAGS_MASK, r1 ; byte enables
> > poll_pci_wdata_next:
> >     move [PCI_T_CMD_DATA], r0
> >     move r0, [add MEM_SEND_DATA_0000, r1]
> >     add r2, 1, r2
> >     ;; Repeat as long as we receive write commands.
> >     ;; CHECKME: Do we need to test [PCI_T_CMD_COUNT]?
> 
> You either repeatedly check the count to be nonzero, or you check it
> once and count down.  You may find that always checking the count is
> less instructions.

We can't count down because we don't know if all instructions in the
pipe are the same type.  In fact I think the PCI_T_CMD_COUNT is useless
for anything but checking whether it is one or zero.  Therefore, we
can save instructions by encoding this information in PCI_T_CMD_TYPE.

> >     move [PCI_T_CMD_INFO], r0
> >     and r0, PCI_TCINFO_TYPE_MASK, r1
> >     xor r1, PCI_TCINFO_TYPE_WDATA, r1
> >     jzero r1, poll_pci_wdata_next
> >       and r0, PCI_TCINFO_FLAGS_MASK, r1 ; byte enables
> >     ;; Save the address in case there consecutive write commands which have
> >     ;; just not entered the pipe yet, and recheck PCI queue.
> >     move r2, [G_POLL_ADDR]
> >     jump poll_pci
> >       noop
> >     endframe
> >
> >
> > ;;; ------------------------------------------------------------------------
> > ;;; mem_small_skip(r1: count, r2: cont)
> > ;;;
> > ;;; Drops count words from MEM_READQ_DATA, where 0 ‚â§ count ‚â§ 8.
> >
> >     frame
> >     alias p0..p1 = r1..r2
> >     protect r3..r31
> > mem_small_skip_next: ; Not the entry point!
> >     sub p0, r0, p0 ; decrement counter by available words
> >     jnneg p0, mem_small_skip_no_trunc
> >       noop
> >     add p0, r0, r0
> >     move 0, p0
> > mem_small_skip_no_trunc:
> >     sub mem_small_skip, r0, r0
> >     jump r0
> >       noop
> >     move [MEM_READQ_DATA], r0
> >     move [MEM_READQ_DATA], r0
> >     move [MEM_READQ_DATA], r0
> >     move [MEM_READQ_DATA], r0
> >     move [MEM_READQ_DATA], r0
> >     move [MEM_READQ_DATA], r0
> >     move [MEM_READQ_DATA], r0
> >     move [MEM_READQ_DATA], r0
> > mem_small_skip:
> >     jnzero p0, mem_small_skip_next
> >       move [MEM_READQ_AVAIL], r0
> >     jump p1
> >       noop
> >     endframe
> 
> It looks like you're trying to toss words, so this must be the routine
> you are using to help pick a single word out of a larger granule.

Almost, but I'm picking out an arbitrary number of words.

> (I'm tired, so I'm having trouble with the high-level.)  Those
> consecutive moves are going to work or not work, depending on whether
> or not there's data in the return queue... but you can't tell how many
> worked and how many didn't.

Note that the entry point to the subroutine is right below the unroll.
There, the available count is read, we jump to mem_small_skip_next,
which checks it against the number we want to skip.  The minimum of the
two is used to do a computed jump into the unrolled move instructions.

> You know what we really need here.... some more abstract commands to
> the bridge and pci queues.  Such as:
> 
> - Drop N words from queue (if there are none available, it just waits)
> - Pipe N words from queue X to queue Y (as they arrive in X and as
> long as Y is not full)
> 
> I can't think of more off the top of my head.  But what we can do is
> this... say you're moving 16 data words from bridge to PCI, then you
> can just tell it that you're expecting to move 16.  As they appear in
> one pipe, they're automatically dumped into the other.  But only when
> you set it up to do so, and only as many as you specify.  Now all you
> have to do is make sure this "move engine" is not busy when you want
> to program it again.

Since reads are now always 1 or 16, things are a bit simpler.  Esp when
the count is 16 we just fetch everything.  We can try to do it in
software first, but if it's too slow, we can add the hardware to do
automatic skips and transfer.

> > ;;; ------------------------------------------------------------------------
> > ;;; mem_to_pci_xfer(r1: count, r2: cont)
> > ;;;
> > ;;; Transfer count words from memory to PCI, where the transferred block is
> > ;;; assumed to be aligned on MEM_GRANULE_SIZE and to be confined to a single
> > ;;; block.
> >
> >     frame
> >     alias p0..p1 = r1..r2
> >     protect r3..r31
> >     move [MEM_READQ_DATA], r0
> >     move r0, [PCI_TR_DATA]
> >     move [MEM_READQ_DATA], r0
> >     move r0, [PCI_TR_DATA]
> >     move [MEM_READQ_DATA], r0
> >     move r0, [PCI_TR_DATA]
> >     move [MEM_READQ_DATA], r0
> >     move r0, [PCI_TR_DATA]
> >     move [MEM_READQ_DATA], r0
> >     move r0, [PCI_TR_DATA]
> >     move [MEM_READQ_DATA], r0
> >     move r0, [PCI_TR_DATA]
> >     move [MEM_READQ_DATA], r0
> >     move r0, [PCI_TR_DATA]
> >     move [MEM_READQ_DATA], r0
> >     move r0, [PCI_TR_DATA]
> 
> This moves 8 words.  But you're not ensuring that 8 words are
> available in the data queue.  (Are you?)  Also, it would be cool if
> there were a move command that would move directly from I/O port to
> I/O port.  I think this was an idea you had suggested a while back.

As above, I'm doing a computed jump into this unrolled loop.  Which I
don't take into account is that the PCI return pipe may fill up, but I
think you said that was not an issue.

I think it's difficult to adjust the current HQ to allow moves directly
from IO to IO, but we could add specialised streaming instructions, or
introduce a command port which effects transfers or skips when written
to.

BTW, this is maybe the most severe case of register dependency after a
fetch.  If we don't introduce a special mechanism to deal with
transfers, we'll have to pair up the instructions like

        move [MEM_READQ_DATA], r0
        move [MEM_READQ_DATA], r1
        move r0, [PCI_TR_DATA]
        move r1, [PCI_TR_DATA]
        move [MEM_READQ_DATA], r0
        move [MEM_READQ_DATA], r1
        move r0, [PCI_TR_DATA]
        move r1, [PCI_TR_DATA]
        ...

or some similar trick.  That complicates the computed jump, but since
this code will only be used when we have 16 to fetch, I don't think it's
that bad.

> My idea was to turn this into a sort of select, where you read the
> available count from the bridge read data queue, read the free count
> from the pci return queue, take the min of the two, then use that
> number to compute an address that you jump to where that many words
> gets moved.
> 
> Pseudo code:
>     get count into r0
>     r0 = 16 - 2*r0 + base_address
>     jump r0
>     noop
> base_address:
>     move [source], r1
>     move r1, [dest]
>     move [source], r1
>     move r1, [dest]
> [...]

Yes, that's what I did :-)
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] HQ assembly code, bypass

Reply via email to