Re: [Open-graphics] HQ assembly code, bypass

Petter Urkedal Fri, 22 Aug 2008 13:15:40 -0700

On 2008-08-22, Timothy Normand Miller wrote:
> On Fri, Aug 22, 2008 at 1:42 PM, Petter Urkedal <[EMAIL PROTECTED]> wrote:
> >> But that's only if we're using both ports on the RAM.  If we're using
> >> only one, then this is trivial... just hook to the other port, using
> >> the same mechanism that we use to write to the program file from the
> >> host.
> >
> > Okay, that'll work.  Maybe it's not a big issue, but for ASIC it may
> > incur more logic since we could have gotten away with single-ported
> > memory.
> 
> Certain kinds of ASICs have dedicated RAM blocks too, and they're
> typically dual-ported.


Fair enough.

> >> The RAMs are dual-ported.  One port is accessed in the fetch stage,
> >> while the other is accessed in the MEM stage.
> >
> > Yes, but this does not have to be efficient at all, right?  Why not use
> > a HQIO handler?  That is, no changes to the RTL.
> 
> Yeah, that's basically what I had in mind, although I thought of it
> more like how we access scratch memory.
> 
> With the use of a fifo, we can have both data access to the program
> file from microcode as well as host access to the scratch memory.

Yes, using the dual ports, it's not as expensive as I first though.
(It will prevent us from using that other port for direct bridge to
scratch transfers, as I proposed before we integrated HQ in the bridge
wrapper, but probably we won't take that path.)

> For the moment, let's just do the easy thing, which is to hook the
> unused port of the scratch memory up to the same top-level HQ module
> ports that let us load the program file.  Just one extra address bit
> and an extra gate on each write-enable.
> 
> If space gets tight, we can look into program access to the program
> file, via the HQIO handler.

I think we're still speaking different languages, but I'm not sure where
the misunderstanding lies.  When I refer the HQIO handler I mean a
fragment of HQ-code which handles a PCI target commands addressed to
TARGET_IO (as defined in pci_address_decode.v).  It has been my
assumption that TARGET_IO is the mechanism by which the host
communicates and controls HQ.  Maybe that's my misunderstanding.  I have
not at all considered accessing the program file at from HQ code, and I
don't see how a HQIO handler, which is a host-to-HQ channel, could do
that.
 
> >> >> Yeah.  But we're going to end up with a lot more ports anyhow.  If
> >> >> there's no use in having the combined port, then ditch it.  If we find
> >> >> a use for it, we can put it back later.
> >> >
> >> > Good, I have prepared to commit this to the port decode:
> >> >
> >> >        PCI_T_CMD_TYPE:
> >> >            hqio_inport = pci2hq_cmd_type & {32{pci2hq_cmd_valid}};
> >> >        PCI_T_CMD_FLAGS:
> >> >            hqio_inport = pci2hq_cmd_flags;
> >> >
> >> > Then we also have the bit to avoid checking PCI_T_CMD_COUNT, as
> >> > discussed.  I'm assuming pci2hq_cmd_valid is the same as
> >> > pci2hq_cmd_count != 0.
> >>
> >> Yes.  valid means count != 0.  But the problem is that now you can't
> >> dequeue a null command.  Have we decided not to do the null command?
> >
> > I didn't think of the dequeuing issue.  But, yes I don't see a use for a
> > null command if to terminate writes.  Isn't it so that a PCI target
> > write of N words followed by one of M words in a continuous range should
> > be considered equivalent to a singe write of N + M words?  If that's the
> > PCI semantics, then I don't think a termination command carries any
> > meaning.
> 
> All the termination tells you is that the next thing you'll get is an
> address command.  But as you have designed this, you maintain enough
> state that it doesn't matter.  It looks like you grab the address into
> a global, then when you get write commands, you send the address over
> the bridge for the first one, then forward writes until you run out,
> at which point you bail.  If the next thing you get after an idle
> moment is a write command, you send the address again from the global.
> 
> Don't forget to count how many words you sent and increment the
> address.  The fact that you have to do this is unfortunate and makes
> me ponder to have the null command so we don't have to bother storing
> and incrementing the address.

I have not forgotten the increment, but thanks for the reminder.  I
wouldn't worry about that single instruction to store the address, since
it's not part of the inner write loop.  If the inner write loop ever
exits, then the write commands are coming in slow enough that we can
keep up, anyway, and we could possibly benefit from exiting poll_pci()
to do some other work.

The address increment, on the other hand, costs us one instruction in
the inner loop.  But remember, we have optimised out two instructions,
namely one to fetch of PCI_T_CMD_COUNT and one for testing it for zero.

> Yeah.  But who would ever need more than 512 words of program memory?
> Maybe we need 640.  :)

And like internet will ever catch on :)

> I vaguely recall from my computer arch. class some statistics about
> delay slots.  They're only usefully filled about 50% of the time.  I
> may have it backwards, but I think you can only put an instruction in
> there 80% of the time, and only about 60% of the time are the results
> of that instruction actually used.  Yeah, I may have that backwards...
> but the 50% is right.

Interesting.

> > some clever tricks with it like executing a single instruction somewhere
> > followed by an immediate jump to some other location.  E.g. the
> > following code executes an arithmetic operation which is encoded in r0,
> > applies it to r1 and r2 and stores the result in r1:
> >
> >        add r0, apply_operator, r0
> >        jump r0
> >          jump cont
> > cont:
> >        ...
> >
> > apply_operator:
> >        add r1, r2, r1  ; If r0 = 0, then add
> >        sub r1, r2, r1  ; If r0 = 1, then substract
> >        and r1, r2, r1  ; etc
> >        or r1, r2, r1
> >        xor r1, r2, r1
> 
> 
> That sounds clever, although the semantics of a taken jump in the
> delay slot of a taken jump is something that I would have to
> understand more.  Are you sure that the instruction target of the
> first jump will actually get executed?
> 
> Let's see...
> Jump 1 gets executed while jump 2 gets fetched
> Jump 2 gets executed while target gets fetched
> 
> Ok, so basically, the target of the first jump ends up filling the
> time slot of the delay slot of the second jump?

Don't worry, I have tested it.  Have you managed to build the
assembler and simulator?  If so, try to save the following as
"test_delay_hack.asm" under tools/oga1hq:

include SIM

        ;; r0 is the operator to apply, r1 and r2 are the operands.
        move 77, r1
        move 70, r2
        move 1, r0
        add r0, apply_operator, r0
        jump r0
          jump cont
cont:
        noop
        noop
        noop
        jump SIM_DUMP
        noop
        jump SIM_HALT
        noop
apply_operator:
        add r1, r2, r1  ; If r0 = 0, then add
        sub r1, r2, r1  ; If r0 = 1, then substract
        and r1, r2, r1  ; etc
        or r1, r2, r1
        xor r1, r2, r1

Then run

    shell$ ./runsim test_delay_hack.hex
    
This will invoke the assembler to compile the .asm file to .hex, then
run the simulator on it.  Pass an option "-s" to enter the debugger on
startup.

> >> I meant a long video read being done by the video controller in the
> >> S3.  We're only intercepting PCI access to memory.
> >
> > In the current bridge wrapper HQ intercepts all or nothing.  Will we
> > change that, or will the driver switch modes at will even while the
> > bridge is active reading?
> 
> It's always all or nothing.  HQ becomes a gatekeeper for the bridge
> over to the S3.  It's only safe to switch when there's absolutely
> nothing pending on the bridge.  So we'll need some way, via PCI, to
> tell the microcode to go into a safe state, then we wait until it's in
> that state before changing modes.

Sorry, I misread your original comment.  But in any case, if the video
controller is doing a long read, does it affect HQ in any other way that
delaying the eventual MEM_READQ_AVAIL becoming non-zero?
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] HQ assembly code, bypass

Reply via email to