Re: [Open-graphics] HQ assembly code, bypass

Petter Urkedal Fri, 22 Aug 2008 06:45:37 -0700

On 2008-08-21, Timothy Normand Miller wrote:
> > Yes, and as an apropos to something you said below, we need some way to
> > initialise some of these.  Could we map part of the HQ-internal memory
> > to it's IO space?  The host would, after loading the program, initialise
> > the globals by regular target writes to IO space.  The presumes the IO
> > space is big enough.  Otherwise, we could introduce data and address
> > ports and a dedicated IO function for writing to HQ's internal memory.
> > If the upload code itself depends on a few globals, those can be
> > initialised by a few instructions run after HQ starts up.
> 
> Are you asking to give the host access to the scratch memory, or are
> you asking to give HQ data (as opposed to just program) access to the
> program file?


The former.

> The former could be implemented in microcode.  What you could do is
> have an entry point in the microcode whose job is to initialize.  It
> traps all engine and memory access and responds to certain commands
> that arrive in the form of engine access, except that in this mode it
> doesn't pass anything through.  It just effectively maps the engine
> space to the scratch buffer by grabbing addresses and data and
> servicing the reads and writes out of the whole scratch space.  Then
> when some particular address outside of that range is accessed, it
> bails out into the "real" main loop that can poll PCI properly.  If
> the size of the code to handle this is smaller than the size of the
> code that would be necessary to fill it from program immediates, then
> it's a win, especially if you're tight on program space.

Yes, that's another options; I'm not sure which is best.  My suggestion
of mapping scratch space to an IO port range have the advantage that it
can also be used after initialisation to update parameters.  But, if the
whole IO address space is dictated by the VGA specs, my solution won't
work.

I think either solution will save us program memory, considering how
expensive it is to initialise using immediates:

        move HIGH_BITS, r0 ; Adjust HIGH_BITS if LOW_BITS are negative!
        shift r0, 16, r0
        add r0, LOW_BITS, r0
        move r0, [GLOBAL]

> Giving HQ data access to the program file could be quite a bit harder.

I agree.

>  What we could do there is run the pci-program-microcode logic at
> engine speed, then we can have it listen to a fifo (how PCI fills the
> program file) and also to part of the I/O space.

I think this would require its own hardware logic, since the two BRAMs
are accessed from different part of the RTL.

> >> Something occurs to me.  When you get the INFO, you're then masking to
> >> get fields.  How about this... for reading from the PCI command fifo,
> >> INFO could be split up so there's one port where you can read all the
> >> bits (command and flags/target), and there are two more ports where
> >> you can read them separately, so there are no instructions wasted on
> >> masking and shifting.  (Although I like how you did the xor thing to
> >> test the commands.)  For writing to the bridge, we don't need that,
> >> since they're encoded in the I/O address.
> >
> > I think that's better, and how about also dropping the combined
> > flags/target port.  I don't think there is any case we need to deal with
> > the combination of the two.  Wasn't the main reason for combining them
> > to reduce the number of ports, and thus reduce the MUX in the RTL?
> 
> Yeah.  But we're going to end up with a lot more ports anyhow.  If
> there's no use in having the combined port, then ditch it.  If we find
> a use for it, we can put it back later.

Good, I have prepared to commit this to the port decode:

        PCI_T_CMD_TYPE:
            hqio_inport = pci2hq_cmd_type & {32{pci2hq_cmd_valid}};
        PCI_T_CMD_FLAGS:
            hqio_inport = pci2hq_cmd_flags;

Then we also have the bit to avoid checking PCI_T_CMD_COUNT, as
discussed.  I'm assuming pci2hq_cmd_valid is the same as
pci2hq_cmd_count != 0.

> > A straight forward jump table would be
> >
> >        move [PCI_T_CMD_TYPE], r0
> >        shift r0, 1, r0
> >        add r0, poll_pci_switch, r0
> >        jump r0
> >          noop
> > poll_pci_switch:
> >        jump p0  ; If idle, exit.
> >          noop
> >        jump poll_pci_addr
> >          noop
> >        jump poll_pci_rcount
> >          noop
> >        jump poll_pci_wdata
> >          noop
> >        ...
> >
> > but it's not that efficient.
> 
> Yeah, because it's a sieve, and I'm looking for a computed branch,
> although see below...
> 
> >  We can get rid of the "shift" by keeping
> > PCI_T_CMD_TYPE shifted by one bit in hardware.
> 
> True, but that bothers me.  If we change granularity, then we'll have
> to do a shift anyhow.

The granularity of a jump table is fixed by the constraints of the HQ
architecture, but of we switch between jump-table to a table-lookup,
then yes.  So, we should make up our mind whichever of the two is best.
The if-then-else-if variant doesn't care.

> > BTW, I just discovered some weirdness when a noop follows a fetch to r0.
> > It's due to the implementation of "noop" as "move r0, r0".  Consider,
> >
> >        move [SOME_ADDRESS], r0
> >        move r0, r0
> >
> > When the fetch instruction reaches the IO unit, the address is on the
> > input port, and the value we want will only appear on the output on the
> > next cycle.  But, it this point, the "move r0, r0" is ready to be
> > executed by the ALU, which enables register forwarding from the ALU
> > output/IO input to the ALU input.  Thus, SOME_ADDRESS will be stored in
> > r0.
> >
> > This can be solved by the assembler by issuing difference "noop"
> > instructions if the previous instruction was a fetch to r0.  However,
> > that leaves one bad case when a delay slot contains a fetch to r0 and
> > the target of the jump starts with a "noop".  That can't be detected.
> 
> Don't you have an instruction bit that specifies whether or not a
> write-back is allowed?  Can you turn that off in the noop?

The instruction word wasn't wide enough for a dedicated disable bit, but
write-back is disabled for stores, and branch instructions have a
write-back disable bit.  Branch instructions are not suitable for noop,
but we can use a store:

        move r0, [-1]

That is, we reserve port -1 for this purpose.

> >> This would send two addresses when the test fails.  But that's fine.
> >> You can send extraneous address commands over the bridge with no
> >> consequence, as long as the bridge is not busy.
> >
> > That was what I though ;-)
> 
> I didn't notice a check to make sure the bridge isn't busy.  The
> situations where it matters are kinda rare, but you could have some
> long video read going on that clogs up the memory system, so the queue
> in the S3 fills, making the bridge busy, and then you can't issue
> writes or read requests. (I'm not sure if it's okay to issue
> addresses.)

But we are handling all access to memory, right?  So if there is a long
video read, HQ will be stuck in a loop serving it.

> >> When do you need to read a block whose size is different from the
> >> request made by the address decoder?  With engine addresses, you can
> >> only request the 1 anyhow, and with memory, you can just request as
> >> many as the address_decode asks for (which is 16).  So I'm not
> >> entirely sure what this is for.  Is this for handling uncached memory
> >> reads?  If so, then it makes sense.
> >>
> >> I think if you were to request more than 1 word for an engine access,
> >> the bus logic would hang.
> >>
> >> Note that caching or not would be specific to the HQ program that's
> >> running.  In VGA text mode, we can cache, so there's no need to do
> >> anything special with memory reads.  In VGA graphics mode, there MAY
> >> be side-effects, so we may have to turn the cache off, in which case,
> >> we need to do what you're doing... but ONLY that.  You'd read 8 words,
> >> possibly cache them, or possibly throw away the ones you don't want.
> >
> > But as you say, this will always be a request for a single word, right?
> 
> What will be?  Memory when the cache is off?  Then yes.  Otherwise, no.

Sorry, I was referring the VGA graphics mode (cache off).  I just want
to be sure there are no cases where we receive a read count other than 1
or 16.  You already said that, but what confused me was that you
mentioned 8 word reads above.

> >> >     jump mem_small_skip, r2
> >>
> >> I forget....  Is the second argument where the return address is stored?
> >
> > Yes, and for conditional jumps, it's the third argument.
> 
> And if that argument is missing, then you just disable the write-back, right?

Right.  That's the branch instruction has a WB-disable bit.
 
> > BTW, this is maybe the most severe case of register dependency after a
> > fetch.  If we don't introduce a special mechanism to deal with
> > transfers, we'll have to pair up the instructions like
> >
> >        move [MEM_READQ_DATA], r0
> >        move [MEM_READQ_DATA], r1
> >        move r0, [PCI_TR_DATA]
> >        move r1, [PCI_TR_DATA]
> >        move [MEM_READQ_DATA], r0
> >        move [MEM_READQ_DATA], r1
> >        move r0, [PCI_TR_DATA]
> >        move r1, [PCI_TR_DATA]
> >        ...
> >
> > or some similar trick.  That complicates the computed jump, but since
> > this code will only be used when we have 16 to fetch, I don't think it's
> > that bad.
> 
> Yes, although in that case you would have to wait until the queue had
> exactly 16 entries in it or perhaps 8 and repeat.  This is definitely
> going to be somewhat painful.

Or, we adjust the current code to round down the count to and even
number.

> Can you forward from MEM to MEM?

Assuming your first MEM is chip RAM and the second MEM is scratch space?
That could be a way to implement 1 word reads:  Transfer to scrach and
then just look up the right word.  However, it's probably not the most
efficient, since it takes two instructions per read, even when the read
data is thrown away.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] HQ assembly code, bypass

Reply via email to