[Open-graphics] HQ assembly code, bypass

Timothy Normand Miller Wed, 20 Aug 2008 20:15:45 -0700

I've quoted the code and inserted questions and comments.


> let G_POLL_BASE               = 0  ; FIXME
>
> ;; Global Parameters
> let G_POLL_MEM_TRANS  = G_POLL_BASE + 0
> let G_POLL_VMEM_TRANS = G_POLL_BASE + 1
> let G_POLL_IO_TRANS   = G_POLL_BASE + 2
>
> ;; Global State
> let G_POLL_TARGET     = G_POLL_BASE + 3
> let G_POLL_ADDR               = G_POLL_BASE + 4

Ok, so this must be where you're keeping your state.

Something occurs to me.  When you get the INFO, you're then masking to
get fields.  How about this... for reading from the PCI command fifo,
INFO could be split up so there's one port where you can read all the
bits (command and flags/target), and there are two more ports where
you can read them separately, so there are no instructions wasted on
masking and shifting.  (Although I like how you did the xor thing to
test the commands.)  For writing to the bridge, we don't need that,
since they're encoded in the I/O address.

>
> ;;; ------------------------------------------------------------------------
> ;;; poll_pci(r4: continuation)
>
>     frame
>       alias q0..q1 = r3..r4
>       alias p0 = r5
>       protect r6..r31
> poll_pci:

I inserted some comments...

>       ;; Find out how many commands we have to process
>       move [PCI_T_CMD_COUNT], r0
>       
>       ;; If there are none, return
>       jzero r0, p0
>
>       ;; Switch between Command Types
>       ;; ----------------------------
>       move [PCI_T_CMD_INFO], r0
>       and r0, PCI_TCINFO_TYPE_MASK, r1

So, like I say, we can eliminate this one AND by having extra I/O ports.

>       xor r1, PCI_TCINFO_TYPE_ADDR, r1
>       jzero r1, poll_pci_addr
>         xor r1, PCI_TCINFO_TYPE_RCOUNT ^ PCI_TCINFO_TYPE_ADDR, r1
>       jzero r1, poll_pci_rcount
>         xor r1, PCI_TCINFO_TYPE_WDATA ^ PCI_TCINFO_TYPE_RCOUNT, r1
>       jzero r1, poll_pci_wdata
>         noop
>

Need something to do in case command is zero (although no point until
there really is a null command).

One idea that occurs to me, but I don't know if it'll be efficient is
something like:

    move [PCI_T_CMD_INFO_TYPE], r0
    lsr r0, 5, r0
    add poll_pci__base, r0, r0
    jmp r0

This assumes that each case only needs 32 instructions.  Another
alternative would be:

    move [PCI_T_CMD_INFO_TYPE], r0
    add r0, G_POLL_CASE_ADDR, r0
    move [r0], r0
    jmp r0

Basically, just use a lookup table to get the branch address.

BTW, I'm not accounting at all for the fact that results of a given
instruction may not be available to the next instruction.  Did you add
forwarding to the design?

>
>       ;; Address Command
>       ;; ---------------
> poll_pci_addr:
>       ;; Have r0 = [PCI_T_CMD_INFO]
>       ;; Will set r1 to the address correction.
>       and r0, PCI_TCINFO_FLAGS_MASK, r0       ; target

Now, this AND can be replaced by just a fetch from PCI_T_CMD_INFO_FLAGS.

>       move r0, [G_POLL_TARGET]
>       sub r0, PCI_TARGET_MEM, r0

Why did you use xor above and sub here?

>       jzero r0, poll_pci_save_addr
>         move [G_POLL_MEM_TRANS], r1
>       sub r0, PCI_TARGET_VMEM - PCI_TARGET_MEM, r0
>       jzero r0, poll_pci_save_addr
>         move [G_POLL_VMEM_TRANS], r1
>       sub r0, PCI_TARGET_IO - PCI_TARGET_VMEM, r0
>       jzero r0, poll_pci_save_addr
>         move [G_POLL_IO_TRANS], r1

I like how you document the delay slot by indenting.

>       move 0, r1
> poll_pci_save_addr:
>       ;; Save address and tail call.
>       move [PCI_T_CMD_DATA], r0               ; dequeue address
>       add r0, r1, r0
>       move r0, [G_POLL_ADDR]
>       jump poll_pci
>       noop

I wonder if you couldn't just look up the target in a table.
    move [PCI_T_CMD_INFO_FLAGS], r0
    add TABLE_BASE, r0, r0
    move [r0], r0  ;; Now have address translation
    move r0, [G_POLL_ADDR]

This means that scratch memory needs to be writable from PCI, just
like the program memory.  Either that, or we need a way to get data
out of program memory, but that seems like it would be harder.

Also, I'm again probably violating data dependencies.


>
>       ;; Read Command
>       ;; ------------
>       ;;
>       ;; For simplicity, we only fetch up to MEM_GRANULE_SIZE words.
> poll_pci_rcount:
>       ;; Write request address
>       move [G_POLL_ADDR], r0
>       and r0, MEM_GRANULE_SIZE - 1, r1 ; words to skip
>       and r0, ~(MEM_GRANULE_SIZE - 1), r0 ; aligned address
>       move [G_POLL_TARGET], r2
>       xor r2, PCI_TARGET_ENG, r2
>       jnzero r2, poll_pci_rcount_not_eng
>         move r0, [MEM_SEND_ADDR_MEM]
>       move r0, [MEM_SEND_ADDR_ENG]

This would send two addresses when the test fails.  But that's fine.
You can send extraneous address commands over the bridge with no
consequence, as long as the bridge is not busy.

When do you need to read a block whose size is different from the
request made by the address decoder?  With engine addresses, you can
only request the 1 anyhow, and with memory, you can just request as
many as the address_decode asks for (which is 16).  So I'm not
entirely sure what this is for.  Is this for handling uncached memory
reads?  If so, then it makes sense.

I think if you were to request more than 1 word for an engine access,
the bus logic would hang.

Note that caching or not would be specific to the HQ program that's
running.  In VGA text mode, we can cache, so there's no need to do
anything special with memory reads.  In VGA graphics mode, there MAY
be side-effects, so we may have to turn the cache off, in which case,
we need to do what you're doing... but ONLY that.  You'd read 8 words,
possibly cache them, or possibly throw away the ones you don't want.

> poll_pci_rcount_not_eng:
>       ;; Let q0 be the numbers to transfer.
>       move [PCI_T_CMD_DATA], q0 ; the requested count
>       jzero q0, poll_pci ; or can we assume this is nonzero?
>         noop

The request count is either 1 (for engine or uncached) or 16 (for
memory, cached).  No other values are possible.  If we change that,
then you'll have to change the assumptions, but I doubt that'll
happen.

>       ;; Let q1 be the number of words to skip after the transfer ...
>       add q0, r1, q1
>       sub MEM_GRANULE_SIZE, q1, q1
>       ;; ... if it's negative, truncate request count.  The request is done
>       ;; here to utilise the delay slot.
>       move MEM_GRANULE_SIZE, r0
>       jnneg q1, poll_pci_rcount_no_trunc
>         move r0, [MEM_SEND_READ_COUNT]
>       add q0, q1, q0  ; Reduce the transfer count to fit the granule.
>       move 0, q1      ; No final words to skip.
> poll_pci_rcount_no_trunc:
>       ;; Initial skip.  The count argument register r1 is already set above.

>       jump mem_small_skip, r2

I forget....  Is the second argument where the return address is stored?

>         noop
>       ;; The transfer to PCI.
>       jump mem_to_pci_xfer, r2
>         move q0, r1
>       ;; The final skip.
>       jump mem_small_skip, r2 ; Final skip.
>         move q1, r1
>       jump poll_pci
>         noop
>
>       ;; Write Commands
>       ;; --------------
> poll_pci_wdata:
>       ;; Send address to bridge and adjust [G_POLL_ADDR].
>       move [G_POLL_ADDR], r2
>       move r2, [MEM_SEND_ADDR_MEM]
>       ;; Prepare for the first transfer.  We know the next PCI command is a
>       ;; read, and r0 already contains PCI_T_CMD_INFO.

How do we know the next PCI command is a read?

>       and r0, PCI_TCINFO_FLAGS_MASK, r1 ; byte enables
> poll_pci_wdata_next:
>       move [PCI_T_CMD_DATA], r0
>       move r0, [add MEM_SEND_DATA_0000, r1]
>       add r2, 1, r2
>       ;; Repeat as long as we receive write commands.
>       ;; CHECKME: Do we need to test [PCI_T_CMD_COUNT]?

You either repeatedly check the count to be nonzero, or you check it
once and count down.  You may find that always checking the count is
less instructions.

>       move [PCI_T_CMD_INFO], r0
>       and r0, PCI_TCINFO_TYPE_MASK, r1
>       xor r1, PCI_TCINFO_TYPE_WDATA, r1
>       jzero r1, poll_pci_wdata_next
>         and r0, PCI_TCINFO_FLAGS_MASK, r1 ; byte enables
>       ;; Save the address in case there consecutive write commands which have
>       ;; just not entered the pipe yet, and recheck PCI queue.
>       move r2, [G_POLL_ADDR]
>       jump poll_pci
>         noop
>     endframe
>
>
> ;;; ------------------------------------------------------------------------
> ;;; mem_small_skip(r1: count, r2: cont)
> ;;;
> ;;; Drops count words from MEM_READQ_DATA, where 0 ‚â§ count ‚â§ 8.
>
>     frame
>       alias p0..p1 = r1..r2
>       protect r3..r31
> mem_small_skip_next: ; Not the entry point!
>       sub p0, r0, p0 ; decrement counter by available words
>       jnneg p0, mem_small_skip_no_trunc
>         noop
>       add p0, r0, r0
>       move 0, p0
> mem_small_skip_no_trunc:
>       sub mem_small_skip, r0, r0
>       jump r0
>         noop
>       move [MEM_READQ_DATA], r0
>       move [MEM_READQ_DATA], r0
>       move [MEM_READQ_DATA], r0
>       move [MEM_READQ_DATA], r0
>       move [MEM_READQ_DATA], r0
>       move [MEM_READQ_DATA], r0
>       move [MEM_READQ_DATA], r0
>       move [MEM_READQ_DATA], r0
> mem_small_skip:
>       jnzero p0, mem_small_skip_next
>         move [MEM_READQ_AVAIL], r0
>       jump p1
>         noop
>     endframe

It looks like you're trying to toss words, so this must be the routine
you are using to help pick a single word out of a larger granule.
(I'm tired, so I'm having trouble with the high-level.)  Those
consecutive moves are going to work or not work, depending on whether
or not there's data in the return queue... but you can't tell how many
worked and how many didn't.

You know what we really need here.... some more abstract commands to
the bridge and pci queues.  Such as:

- Drop N words from queue (if there are none available, it just waits)
- Pipe N words from queue X to queue Y (as they arrive in X and as
long as Y is not full)

I can't think of more off the top of my head.  But what we can do is
this... say you're moving 16 data words from bridge to PCI, then you
can just tell it that you're expecting to move 16.  As they appear in
one pipe, they're automatically dumped into the other.  But only when
you set it up to do so, and only as many as you specify.  Now all you
have to do is make sure this "move engine" is not busy when you want
to program it again.

>
> ;;; ------------------------------------------------------------------------
> ;;; mem_to_pci_xfer(r1: count, r2: cont)
> ;;;
> ;;; Transfer count words from memory to PCI, where the transferred block is
> ;;; assumed to be aligned on MEM_GRANULE_SIZE and to be confined to a single
> ;;; block.
>
>     frame
>       alias p0..p1 = r1..r2
>       protect r3..r31
>       move [MEM_READQ_DATA], r0
>       move r0, [PCI_TR_DATA]
>       move [MEM_READQ_DATA], r0
>       move r0, [PCI_TR_DATA]
>       move [MEM_READQ_DATA], r0
>       move r0, [PCI_TR_DATA]
>       move [MEM_READQ_DATA], r0
>       move r0, [PCI_TR_DATA]
>       move [MEM_READQ_DATA], r0
>       move r0, [PCI_TR_DATA]
>       move [MEM_READQ_DATA], r0
>       move r0, [PCI_TR_DATA]
>       move [MEM_READQ_DATA], r0
>       move r0, [PCI_TR_DATA]
>       move [MEM_READQ_DATA], r0
>       move r0, [PCI_TR_DATA]

This moves 8 words.  But you're not ensuring that 8 words are
available in the data queue.  (Are you?)  Also, it would be cool if
there were a move command that would move directly from I/O port to
I/O port.  I think this was an idea you had suggested a while back.

My idea was to turn this into a sort of select, where you read the
available count from the bridge read data queue, read the free count
from the pci return queue, take the min of the two, then use that
number to compute an address that you jump to where that many words
gets moved.

Pseudo code:
    get count into r0
    r0 = 16 - 2*r0 + base_address
    jump r0
    noop
base_address:
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    move [source], r1
    move r1, [dest]
    loop back up and get another count

Although, if we have those meta commands, we don't need this.

> mem_to_pci_xfer:
>       jzero p0, p1
>         move [MEM_READQ_AVAIL], r0
>       sub p0, r0, p0  ; Decrement by the number of words we'll transfer.
>       ;; If the pending count is negative, we readjust to transfer the rest.
>       jnneg p0, mem_to_pci_xfer_not_last
>         noop
>       add p0, r0, r0  ; Set r0 to the remainder of the request.
>       move 0, p0      ; Clear the next remainder.
> mem_to_pci_xfer_not_last:
>       shift r0, 1, r0 ; Two instruction slots per transfer.
>       sub mem_to_pci_xfer, r0, r0
>       jump r0
>         noop
>     endframe
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

[Open-graphics] HQ assembly code, bypass

Reply via email to