Re: [Dri-devel] r200 optimization

Jens David Wed, 26 Nov 2003 13:40:12 -0800

Hi guys,

On Wednesday, 26. November 2003 18:54, you wrote:
> On Wed, 2003-11-26 at 00:51, Alan Cox wrote:
> > On Maw, 2003-11-25 at 20:26, Michel Dänzer wrote:
> > >       * Our drivers do something which makes newer chips perform
> > > very poorly with PCI GART, be they AGP or PCI
> > >
> > > The former wouldn't necessarily say anything about PCI cards, but
> > > I'm not sure how to determine which it is (and what exactly the
> > > actual cause is).
> >
> > Is there code reading from PCI space as it builds stuff or uploads
> > textures. Anything that ends up doing
> >
> >
> >     *pciaddr=foo
> >     [blah]
> >     if(x)
> >             *pciaddr|=FOOFLAG;
> >
> > kills your performance on PCI bus.
>
> Thanks for the idea. I can't think of any code like this offhand. Not
> that this means a lot :), but it wouldn't explain the performance gap
> between newer and older chips, would it?


It could.
As someone who designed and debugged the PCI programming
model for a high-speed wireless LAN card from scratch, let me
generalize this a bit and give you some insight on what actually
goes on in a PCI transfer.

PCI uses a transport protocol which seems very simple on the first
glance, but has several gotchas that can easily kill performance
or cause misbehaviour on different machines.

If a transfer is initiated, there's a "master" and a "target". Basically
all PCI devices can be both. During a control access (say, configuring
a control register on the device), the CPU is the master and the
addressed card is the target. This is commonly referred to as
"target access". During a DMA transfer (no matter if FROM or
TO a peripheral device (say, an ethernet board in RX and TX direction))
the peripheral device is the master (also generally referred to as 
busmaster in a so-called "master" or "busmaster" access), the host
bridge on the mainboard plays the role of the target, piping the data
from (ethernet card example: TX direction) or to (e: RX) the main
memory banks. As all modern peripherals that have to move
large chunks of data support the latter (busmaster) mode of
operation host bridges are usually not optimized for quick target
mode accesses and usually insert unneccessary wait cycles or
limit burst transfers to only a couple (<16) dwords of data. The target
device can force waits and disconnects, too, and therefore slow
down such an operation as well. Depending on how much care
the designer of the device of hardware in question put into the
specific device (particularly into streamlining the read/write
register pipelines) accesses will be slower or faster.

What happens on a standard MEMORY WRITE PCI access is
the following (there are also IO read/write accesses for outl, inl
commands on i386 but not explained in this mail):

1.) Master must request bus if not already granted. Pulls a
slot-specific REQUEST signal that tells the bus arbiter that 
the device needs the bus. Bus arbiter is on mainboard.

2.) Bus arbiter (after some time if bus is already in use) grants
access to the bus by pulling a GRANT signal.

These steps are omitted if the bus is "PARKED" at the device
that wants the bus. PARK = GRANT signal active without REQUEST.
Normally bus is parked at host bridge for CPU access.

3.) Now that the master (for example host bridge i.e. CPU) has the
bus, it drives the address signals, and the so called FRAME signal,
telling each other device on the bus which address it wants to write.

4.) PCI uses a distributed address decoding architecture, i.e.
each device checks the address and decides if it itself is addressed.
The address range of a device is determined with so-called config
accesses which are not the subject here. However, as a device recognizes
that it is addressed it responds with the DEVSEL signal. It has a couple
of bus-clock cycles time to do so (forgot how many exactly, see PCI
spec for detail), some are fast, some are slower. If a target responds
with DEVSEL the procedure proceeds. If not, a bus error exception is
generated (supposing master is CPU/host bridge) by the chipset
(on decent hardware at least) and transfer aborted.

5.) The target device must have latched and decoded the requested
address by now. So the address lines can now be used as DATA
lines (PCI uses same pins for address and data). So master applies
the dword to be written (only 32bit (=dword) transfers considered here) 
to the A/D pins and pulls low the IRDY (initiator ready) signal to show
that data is ready. Slave, when ready, pulls low the TRDY (target ready)
signal. According to PCI 1.x and later, it has 16 cycles to do so. If
it does, so if both IRDY and TRDY signal were asserted during a
bus cycle that data was written. A burst can be performed if master
simply keeps FRAME low, applies next data word and keeps IRDY
asserted. Target, again, can signal either immediately acceptance or
introduce more wait cycles by using TRDY. Target can abort transfer
(and MUST do so if it cannot guarantee the 16 wait-cycles-max-limit)
using another signal, the STOP signal. In this case the bus transfer
ends and if master still has data waiting it must initiate another
transfer including address cycle (and previously re-arbitration if
the bus arbiter meanwhile dropped its bus master privilege). The
latter process (aborting a cycle by master with/without data acceptance)
is called "target disconnect with/without data". For a target it is
possible to block even the first access using STOP. A so-called
retry must occur, i.e. master must retry until success. If target
NEVER accepts data, retry will run forever, lock the bus and the
CPU (assuming it is the master via the host bridge) is locked up on a
assembler-instruction-basis. On professional hardware there's
sometimes a counter in the host bridge that counts the number
of retries and if a transfer has not been completed within a
reasonable amount of retries (say 2^16 or so) then the transfer
is discarded and system exception is generated. The OS can then
disable the device and -driver. But this is very uncommon on
PC-crap hardware.

6.) The transfer ends normally when the master drops assertion of
the FRAME signal. The cycle after that then is the last cycle.

Notes:
- Read access is similar, only that the side that drives the 
data/address bus changes.
- There are a lot of factors that can influence burst speed by
introducing wait cycles. That is DEVSEL timing, IRDY and TRDY
non-assertion and forcing retries. The latter is especially common
when register writes to (or even worse reads on) PCI devices are
performed where the actual location of the register is behind a
clock domain boundary. In order to safely cross a clock domain border
multiple resamplings of the data and strobe/acknowledge signals
using master/slave flipflops must be perfomed which costs cycles.
- Host bridges are optimized for being SLAVE when large chunks
of data needs to be transported. Hence the popularity of busmaster
operation of peripherals.
- Even in SLAVE mode they usually abort bursts after a transfer of 
16 or so dwords, therefore reaching only 75-95% theoretical PCI bus
throughput capacity.
- Peripheral devices are usually optimized for being MASTER during
bulk data transfers, i.e. by being bus masters, but even then can
introduce grave amounts of wait cycles.
- A device that can be master, must, even when acting as master
always be able to accept a target access. This can occur if its
master cycle is aborted by the respective target, it loses the bus
and another master performs a target access on its registers. It
may not block the access until it finished its own transaction,
otherwise -> deadlock condition. Can happen with broken devices
if not treated (=programmed) carefully.
- The length of a burst can also be limited by the so-called latency
timer. After the arbiter removed the GRANT signal from the master's
slot pin, the master needs not give up the bus immediately, but
can continue the transfer for up and including the latency counter's
content number of cycles. The latency counter is a register in config
space and is written by the OS. During boot the OS should read
the "latency counter request" register of the device and program
the "latency count" register accordingly after considering possible
real-time requirements and other master-capable devices on the bus.
- For PC-crap hardware it is common that the arbiter on the mainboard
removes the bus GRANT immediately after FRAME has been asserted
by the new master and thus a transaction has been initiated.
- A word about memory/cache consistency. Sooner or later every
device driver programmer who works with busmaster-capable hardware
on a PC-crap platform will run into the following problem/symptom:

Memory    Data ->
offset

0x0000: 0000 0000 0000 0000 0000 0000 0000 0000
0x0010: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF

1.)  Host CPU writes 0x0000 through 0x000F. Say this is a complete
     command packet for a SCSI controller or something.
1a.) Host CPU tells device to advance exactly one command packet,
     i.e. 16 bytes.
2.)  Bustmaster-device reads 0x0000 through 0x000F in burst
     transfer using 4-dword-burst.
3.)  Host CPU writes next command packet at 0x0010:

0x0010: 1111 1111 1111 1111 1111 1111 1111 1111

3a.) Host CPU tells device to read next command packet, starting
     at 0x0010.
4.) Surprise! Busmaster-device reads next command packet, gets
     old contents (0xFFFF-stuff) and behaves accordingly, thus
     overwriting the holy MP3 collection and discarding the beloved
     Quake3 high score list that was intended to be presented to
     the guys on the next LAN party...

What happened? A lot of things are possible. The easiest possibility
is that the CPU just wrote to the cache (write-back mode) and the
host bridge that was target during the busmaster transaction fetched
the data directly from the memory bank. This cannot happen on i386,
though, because the architecture guarantees cache/memory/PCI
consistency, even to the CPU's L1 cache. It's usually the first
thing that fails during CPU bus overclocking, by the way. But I
won't get into this.
However, the same effect is sometimes caused by a buggy hostbridge
that somehow thinks it can just prefetch the second
16-byte into an internal buffer under some conditions. Many cardbus
bridges whose developers thought they were especially smart, for
example, prefetch the second 16 bytes, too. This is speculative for the
case that the external device should request a read of them afterwards.
This saves a lot of wait cycles because then the bridge does
not have to arbitrate the internal PCI bus and forward the request to
the memory/cache controller. This behaviour occurs especially often
with PCI masters that keep the FRAME signal asserted one cycle too
long during a read burst and discard the last dword read.
As a programmer you cannot change the circuit's behaviour (unless the 
bridge's parameters are software-configurable that is) so you need to
work around this problem, which may occur only once in a while on
certain hosts at a particular moon phase. One of the following might
help:

- additional target accesses to the device before triggering next
command read
- somehow leaving several cache lines long "guard interval" between
command packets (of course only implementable if linked-list structure
or variable-length packets)
- IO (inl/outl) accesses to device
- bridge config read/write in between which might cause an internal
cache flush in case of cardbus-bridge problem


Now I wrote much more then a originally intended to.
Anyway, I hope I gave you guys a little bit insight on what is actually
going on on the PCI/AGP bus when the CPU executes your driver code.
I have high respect for you having come this far without access to
expensive logic analyzer equipment to debug your code. 
Keep up the good work and thank you,

  -- Jens

-- 
Jens David, DG1KJD
Email: [EMAIL PROTECTED]
http://www.afthd.tu-darmstadt.de/~dg1kjd
Work: +49 351 80800 527  ---  Home/Mobile: +49 173 6394993


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [Dri-devel] r200 optimization

Reply via email to