On 2008-07-19, Timothy Normand Miller wrote:
> [...] We need to consider the consequences
> of having HQ intercept EVERY access to the bridge.
So if I understand this correctly, the current plan is full-intercept or
no intercept, but it's something we may need to reconsider. I guess for
VGA, full-intercept is okay since most data is translated, but if we use
HQ in GPU mode, then full-intercept would be a major bottleneck.
> Adding in HQ, some things change. The command queue (write data and
> read requests) can simply be reconnected to HQ (but with a bypass too,
> of course). A read return queue from HQ to PCI only actually
> necessary because we're crossing clock domains; otherwise, it's waste.
> HQ writes (and read commands) to the bridge could be direct, but the
> bridge can go busy, for instance if video is making a long read, and
> queues in the S3 fill, so we should have a queue there, where we can
> read its fullness in software and push or not push writes (and read
> commands). Read data from the bridge to HQ has to be queued so that
> HQ can request reads, go off and do something else, then come back and
> read the data (or wait for it anyhow).
>
> So that leaves us with four queues:
>
> PCI write/cmd ---> HQ
> HQ write/cmd ---> bridge write
> bridge read data ---> HQ
> HQ read data --> PCI
>
> With hop-overs when HQ is disabled.
That looks quite strait forward, so we could go with this for now. If
HQ does not run on the bridge clock, would it still be feasible to
re-use the PCI FIFOs for HQ. That is, can one end be switched between
the clock-domains of HQ and the bridge?
PCI <===>|-----------|--> bridge ("<===>" means FIFO,
|-> HQ <===>| "|" indicates bypass/intercept switch)
But there is no way to do the clock-switch properly, is there?
> > I assume memory access goes though the bridge. So, we must extend
> > xp10_bridge_wrapper.v with an additional internal interface for HQ
> > memory operations. If we need high thoughput, is there any alternative
> > to two extra FIFOs using two BRAMs? Any yet another two for PCI?
>
> Because of the combination of need for being asynchronous and to cross
> clock domains, I can't see an alternative. These can just be 16-entry
> distributed-RAM fifos.
Good, no BRAM needed to pass clock domains.
> > Since HQs BRAM has an unused port with it's own clock domain, it may be
> > possible to let the bridge read and write data directly to HQ memory.
> > That is, for memory-write, HQ prepares the data in a subrange of it's
> > BRAM, and tells the bridge to transmit the range to a given memory
> > address. For memory-read, HQ tells the bridge to transfer a memory
> > range to a BRAM range. That could also work for PCI, though we'd need
> > to extend HQ internal memory with another BRAM due to the separate clock
> > domains.
>
> That could be very useful, for performance and more asynchrony.
> However, the bypass won't work that way, so we'd have to implement
> both mechanisms. We should start with the dumber one that works with
> bypass and see if we can really benefit from the optimization
> afterwards.
So, data always passes though HQs pipes and clock domain even in bypass
mode? That solved the clock-switching issue. It adds latency for
bypass mode, but it's probably negligible overall.
> BTW, there are some facts about the bus protocol that we might want to
> change. When accessing the bridge, the first cycle is the address,
> and the flag bits indicate the target (memory or config registers).
These flag bits sound like a natural extension as the highest bits of
the address.
> For reads, the subsequent cycle is the word count, after which the bus
> switches direction and waits.
>
> For writes, subsequent cycles are data, flags indicate which bytes are
> valid, and the address auto-increments.
So, these flags can't be combined with the other data. I guess the
common case is that all are 1, so shall we
* write an optional byte-enable before write with default 1111, and
then it applies to all data, or
* add a write-mode where byte-enables and data are interlaced?
> The address counter in the S3 auto-increments, but it only increments
> the lower 7 bits of the word address. So every 128 32-bit words, it's
> required that a new address be sent. That happens automatically with
> PCI due to the way this target is designed, but HQ will have to
> enforce it in the program.
I think we can manage that.
> One thing we may want to change is how the target flags are presented.
> Right now, they're separate from the address, but the address isn't
> 32 bits, so they could be prepended. However, it may actually be
> faster to make them separate, potentially saving some HQ code to
> extract them.
I'm not sure either. If the flags are encoded in the address in such a
way that it does not affect the use of the address, and if the common
usage for flags is to test them individually, then combining flags and
addresses can save register usage and fetch commands.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)