Re: [Open-graphics] HQ assembly code, bypass

Timothy Normand Miller Fri, 22 Aug 2008 09:00:37 -0700

On Fri, Aug 22, 2008 at 9:44 AM, Petter Urkedal <[EMAIL PROTECTED]> wrote:


>> The former could be implemented in microcode.  What you could do is
>> have an entry point in the microcode whose job is to initialize.  It
>> traps all engine and memory access and responds to certain commands
>> that arrive in the form of engine access, except that in this mode it
>> doesn't pass anything through.  It just effectively maps the engine
>> space to the scratch buffer by grabbing addresses and data and
>> servicing the reads and writes out of the whole scratch space.  Then
>> when some particular address outside of that range is accessed, it
>> bails out into the "real" main loop that can poll PCI properly.  If
>> the size of the code to handle this is smaller than the size of the
>> code that would be necessary to fill it from program immediates, then
>> it's a win, especially if you're tight on program space.
>
> Yes, that's another options; I'm not sure which is best.  My suggestion
> of mapping scratch space to an IO port range have the advantage that it
> can also be used after initialisation to update parameters.  But, if the
> whole IO address space is dictated by the VGA specs, my solution won't
> work.

There's some confusion.  When I refer to I/O space, I mean the
non-scratch memory space as seen by HQ, like how we access fifos.

It occurs to me that, as long as HQ is either off or not accessing
scratch space at a given moment, we could at least write to it from
the host.  What we need is a fifo that hooks into the MEM stage.  The
MEM stage would look at the instruction word and the fifo.  If the
instruction has to access the memory, then that gets serviced.
Otherwise, it looks at the fifo.

But that's only if we're using both ports on the RAM.  If we're using
only one, then this is trivial... just hook to the other port, using
the same mechanism that we use to write to the program file from the
host.

> I think either solution will save us program memory, considering how
> expensive it is to initialise using immediates:
>
>        move HIGH_BITS, r0 ; Adjust HIGH_BITS if LOW_BITS are negative!
>        shift r0, 16, r0
>        add r0, LOW_BITS, r0
>        move r0, [GLOBAL]

I agree.  Let's hook up the scratch memory to PCI.

>>  What we could do there is run the pci-program-microcode logic at
>> engine speed, then we can have it listen to a fifo (how PCI fills the
>> program file) and also to part of the I/O space.
>
> I think this would require its own hardware logic, since the two BRAMs
> are accessed from different part of the RTL.

The RAMs are dual-ported.  One port is accessed in the fetch stage,
while the other is accessed in the MEM stage.

>> Yeah.  But we're going to end up with a lot more ports anyhow.  If
>> there's no use in having the combined port, then ditch it.  If we find
>> a use for it, we can put it back later.
>
> Good, I have prepared to commit this to the port decode:
>
>        PCI_T_CMD_TYPE:
>            hqio_inport = pci2hq_cmd_type & {32{pci2hq_cmd_valid}};
>        PCI_T_CMD_FLAGS:
>            hqio_inport = pci2hq_cmd_flags;
>
> Then we also have the bit to avoid checking PCI_T_CMD_COUNT, as
> discussed.  I'm assuming pci2hq_cmd_valid is the same as
> pci2hq_cmd_count != 0.

Yes.  valid means count != 0.  But the problem is that now you can't
dequeue a null command.  Have we decided not to do the null command?

>>
>> True, but that bothers me.  If we change granularity, then we'll have
>> to do a shift anyhow.
>
> The granularity of a jump table is fixed by the constraints of the HQ
> architecture, but of we switch between jump-table to a table-lookup,
> then yes.  So, we should make up our mind whichever of the two is best.
> The if-then-else-if variant doesn't care.

The jump table approach appeals to be because it's so much more
flexible.  I think we should definitely have host access to scratch
(at least for writing).

Also, we do have more block RAMs available for program and scratch.
It just means more MUXing after the registered outputs.

>> Don't you have an instruction bit that specifies whether or not a
>> write-back is allowed?  Can you turn that off in the noop?
>
> The instruction word wasn't wide enough for a dedicated disable bit, but
> write-back is disabled for stores, and branch instructions have a
> write-back disable bit.  Branch instructions are not suitable for noop,
> but we can use a store:
>
>        move r0, [-1]
>
> That is, we reserve port -1 for this purpose.

As long as it never corresponds to anything, that's fine.  Hey, how
about a branch instruction that always comes out false?  How would
that work?  It would be odd to have a branch in a delay slot... but
this is one that would always do nothing.  What would happen?

>> I didn't notice a check to make sure the bridge isn't busy.  The
>> situations where it matters are kinda rare, but you could have some
>> long video read going on that clogs up the memory system, so the queue
>> in the S3 fills, making the bridge busy, and then you can't issue
>> writes or read requests. (I'm not sure if it's okay to issue
>> addresses.)
>
> But we are handling all access to memory, right?  So if there is a long
> video read, HQ will be stuck in a loop serving it.

I meant a long video read being done by the video controller in the
S3.  We're only intercepting PCI access to memory.

>>
>> What will be?  Memory when the cache is off?  Then yes.  Otherwise, no.
>
> Sorry, I was referring the VGA graphics mode (cache off).  I just want
> to be sure there are no cases where we receive a read count other than 1
> or 16.  You already said that, but what confused me was that you
> mentioned 8 word reads above.

8 would just be the minimum you can fetch.  So if you wanted to cache,
you could cache 8 words.  Otherwise, throwing them away is fine.

>> Yes, although in that case you would have to wait until the queue had
>> exactly 16 entries in it or perhaps 8 and repeat.  This is definitely
>> going to be somewhat painful.
>
> Or, we adjust the current code to round down the count to and even
> number.

Yes, unless you get a count of 1, which you need to handle separately.

>> Can you forward from MEM to MEM?
>
> Assuming your first MEM is chip RAM and the second MEM is scratch space?
> That could be a way to implement 1 word reads:  Transfer to scrach and
> then just look up the right word.  However, it's probably not the most
> efficient, since it takes two instructions per read, even when the read
> data is thrown away.

I meant I/O to I/O.  But never mind.  I think that what we really
want, ultimately, is a way to tell the fifos to move data in the
background.  This will be useful immediately for returning read data,
but later, it'll be critical for DMA.


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] HQ assembly code, bypass

Reply via email to