Hi guys, On Wednesday, 26. November 2003 18:54, you wrote: > On Wed, 2003-11-26 at 00:51, Alan Cox wrote: > > On Maw, 2003-11-25 at 20:26, Michel Dänzer wrote: > > > * Our drivers do something which makes newer chips perform > > > very poorly with PCI GART, be they AGP or PCI > > > > > > The former wouldn't necessarily say anything about PCI cards, but > > > I'm not sure how to determine which it is (and what exactly the > > > actual cause is). > > > > Is there code reading from PCI space as it builds stuff or uploads > > textures. Anything that ends up doing > > > > > > *pciaddr=foo > > [blah] > > if(x) > > *pciaddr|=FOOFLAG; > > > > kills your performance on PCI bus. > > Thanks for the idea. I can't think of any code like this offhand. Not > that this means a lot :), but it wouldn't explain the performance gap > between newer and older chips, would it?
It could. As someone who designed and debugged the PCI programming model for a high-speed wireless LAN card from scratch, let me generalize this a bit and give you some insight on what actually goes on in a PCI transfer. PCI uses a transport protocol which seems very simple on the first glance, but has several gotchas that can easily kill performance or cause misbehaviour on different machines. If a transfer is initiated, there's a "master" and a "target". Basically all PCI devices can be both. During a control access (say, configuring a control register on the device), the CPU is the master and the addressed card is the target. This is commonly referred to as "target access". During a DMA transfer (no matter if FROM or TO a peripheral device (say, an ethernet board in RX and TX direction)) the peripheral device is the master (also generally referred to as busmaster in a so-called "master" or "busmaster" access), the host bridge on the mainboard plays the role of the target, piping the data from (ethernet card example: TX direction) or to (e: RX) the main memory banks. As all modern peripherals that have to move large chunks of data support the latter (busmaster) mode of operation host bridges are usually not optimized for quick target mode accesses and usually insert unneccessary wait cycles or limit burst transfers to only a couple (<16) dwords of data. The target device can force waits and disconnects, too, and therefore slow down such an operation as well. Depending on how much care the designer of the device of hardware in question put into the specific device (particularly into streamlining the read/write register pipelines) accesses will be slower or faster. What happens on a standard MEMORY WRITE PCI access is the following (there are also IO read/write accesses for outl, inl commands on i386 but not explained in this mail): 1.) Master must request bus if not already granted. Pulls a slot-specific REQUEST signal that tells the bus arbiter that the device needs the bus. Bus arbiter is on mainboard. 2.) Bus arbiter (after some time if bus is already in use) grants access to the bus by pulling a GRANT signal. These steps are omitted if the bus is "PARKED" at the device that wants the bus. PARK = GRANT signal active without REQUEST. Normally bus is parked at host bridge for CPU access. 3.) Now that the master (for example host bridge i.e. CPU) has the bus, it drives the address signals, and the so called FRAME signal, telling each other device on the bus which address it wants to write. 4.) PCI uses a distributed address decoding architecture, i.e. each device checks the address and decides if it itself is addressed. The address range of a device is determined with so-called config accesses which are not the subject here. However, as a device recognizes that it is addressed it responds with the DEVSEL signal. It has a couple of bus-clock cycles time to do so (forgot how many exactly, see PCI spec for detail), some are fast, some are slower. If a target responds with DEVSEL the procedure proceeds. If not, a bus error exception is generated (supposing master is CPU/host bridge) by the chipset (on decent hardware at least) and transfer aborted. 5.) The target device must have latched and decoded the requested address by now. So the address lines can now be used as DATA lines (PCI uses same pins for address and data). So master applies the dword to be written (only 32bit (=dword) transfers considered here) to the A/D pins and pulls low the IRDY (initiator ready) signal to show that data is ready. Slave, when ready, pulls low the TRDY (target ready) signal. According to PCI 1.x and later, it has 16 cycles to do so. If it does, so if both IRDY and TRDY signal were asserted during a bus cycle that data was written. A burst can be performed if master simply keeps FRAME low, applies next data word and keeps IRDY asserted. Target, again, can signal either immediately acceptance or introduce more wait cycles by using TRDY. Target can abort transfer (and MUST do so if it cannot guarantee the 16 wait-cycles-max-limit) using another signal, the STOP signal. In this case the bus transfer ends and if master still has data waiting it must initiate another transfer including address cycle (and previously re-arbitration if the bus arbiter meanwhile dropped its bus master privilege). The latter process (aborting a cycle by master with/without data acceptance) is called "target disconnect with/without data". For a target it is possible to block even the first access using STOP. A so-called retry must occur, i.e. master must retry until success. If target NEVER accepts data, retry will run forever, lock the bus and the CPU (assuming it is the master via the host bridge) is locked up on a assembler-instruction-basis. On professional hardware there's sometimes a counter in the host bridge that counts the number of retries and if a transfer has not been completed within a reasonable amount of retries (say 2^16 or so) then the transfer is discarded and system exception is generated. The OS can then disable the device and -driver. But this is very uncommon on PC-crap hardware. 6.) The transfer ends normally when the master drops assertion of the FRAME signal. The cycle after that then is the last cycle. Notes: - Read access is similar, only that the side that drives the data/address bus changes. - There are a lot of factors that can influence burst speed by introducing wait cycles. That is DEVSEL timing, IRDY and TRDY non-assertion and forcing retries. The latter is especially common when register writes to (or even worse reads on) PCI devices are performed where the actual location of the register is behind a clock domain boundary. In order to safely cross a clock domain border multiple resamplings of the data and strobe/acknowledge signals using master/slave flipflops must be perfomed which costs cycles. - Host bridges are optimized for being SLAVE when large chunks of data needs to be transported. Hence the popularity of busmaster operation of peripherals. - Even in SLAVE mode they usually abort bursts after a transfer of 16 or so dwords, therefore reaching only 75-95% theoretical PCI bus throughput capacity. - Peripheral devices are usually optimized for being MASTER during bulk data transfers, i.e. by being bus masters, but even then can introduce grave amounts of wait cycles. - A device that can be master, must, even when acting as master always be able to accept a target access. This can occur if its master cycle is aborted by the respective target, it loses the bus and another master performs a target access on its registers. It may not block the access until it finished its own transaction, otherwise -> deadlock condition. Can happen with broken devices if not treated (=programmed) carefully. - The length of a burst can also be limited by the so-called latency timer. After the arbiter removed the GRANT signal from the master's slot pin, the master needs not give up the bus immediately, but can continue the transfer for up and including the latency counter's content number of cycles. The latency counter is a register in config space and is written by the OS. During boot the OS should read the "latency counter request" register of the device and program the "latency count" register accordingly after considering possible real-time requirements and other master-capable devices on the bus. - For PC-crap hardware it is common that the arbiter on the mainboard removes the bus GRANT immediately after FRAME has been asserted by the new master and thus a transaction has been initiated. - A word about memory/cache consistency. Sooner or later every device driver programmer who works with busmaster-capable hardware on a PC-crap platform will run into the following problem/symptom: Memory Data -> offset 0x0000: 0000 0000 0000 0000 0000 0000 0000 0000 0x0010: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 1.) Host CPU writes 0x0000 through 0x000F. Say this is a complete command packet for a SCSI controller or something. 1a.) Host CPU tells device to advance exactly one command packet, i.e. 16 bytes. 2.) Bustmaster-device reads 0x0000 through 0x000F in burst transfer using 4-dword-burst. 3.) Host CPU writes next command packet at 0x0010: 0x0010: 1111 1111 1111 1111 1111 1111 1111 1111 3a.) Host CPU tells device to read next command packet, starting at 0x0010. 4.) Surprise! Busmaster-device reads next command packet, gets old contents (0xFFFF-stuff) and behaves accordingly, thus overwriting the holy MP3 collection and discarding the beloved Quake3 high score list that was intended to be presented to the guys on the next LAN party... What happened? A lot of things are possible. The easiest possibility is that the CPU just wrote to the cache (write-back mode) and the host bridge that was target during the busmaster transaction fetched the data directly from the memory bank. This cannot happen on i386, though, because the architecture guarantees cache/memory/PCI consistency, even to the CPU's L1 cache. It's usually the first thing that fails during CPU bus overclocking, by the way. But I won't get into this. However, the same effect is sometimes caused by a buggy hostbridge that somehow thinks it can just prefetch the second 16-byte into an internal buffer under some conditions. Many cardbus bridges whose developers thought they were especially smart, for example, prefetch the second 16 bytes, too. This is speculative for the case that the external device should request a read of them afterwards. This saves a lot of wait cycles because then the bridge does not have to arbitrate the internal PCI bus and forward the request to the memory/cache controller. This behaviour occurs especially often with PCI masters that keep the FRAME signal asserted one cycle too long during a read burst and discard the last dword read. As a programmer you cannot change the circuit's behaviour (unless the bridge's parameters are software-configurable that is) so you need to work around this problem, which may occur only once in a while on certain hosts at a particular moon phase. One of the following might help: - additional target accesses to the device before triggering next command read - somehow leaving several cache lines long "guard interval" between command packets (of course only implementable if linked-list structure or variable-length packets) - IO (inl/outl) accesses to device - bridge config read/write in between which might cause an internal cache flush in case of cardbus-bridge problem Now I wrote much more then a originally intended to. Anyway, I hope I gave you guys a little bit insight on what is actually going on on the PCI/AGP bus when the CPU executes your driver code. I have high respect for you having come this far without access to expensive logic analyzer equipment to debug your code. Keep up the good work and thank you, -- Jens -- Jens David, DG1KJD Email: [EMAIL PROTECTED] http://www.afthd.tu-darmstadt.de/~dg1kjd Work: +49 351 80800 527 --- Home/Mobile: +49 173 6394993 ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel