On 15 June 2015 at 16:33, Bill M <[email protected]> wrote: > After reading through a bit more in the TRM about the PRU UART, I don't > think a PRU UART will be feasible since it looks like they top out at > around 300Kbs >
Hmm, where'd you get that number? The PRU UART looks like the highest performance UART: it receives a 192 MHz functional clock and the datasheet specs 12 Mbps max (that would be using a /1 divider and 16x oversampling). The other UARTs receive a 48 MHz functional clock and spec max 3.6864 Mbps (/1 divider and 13x oversampling, so that would get you 3.6923 Mbps to be precise). I've also noticed that UART0 cannot cope too many consecutive writes, even if there's enough fifo space: the fifo pointers seem to get corrupted or something (I'm guessing a bug in the synchronization logic between the interface and functional clock domains). This only appears as issue when trying to rapidly fill the UART fifo from the cortex-a8 in a tight loop (using posted writes). Inserting some dummy register write between consecutive data bytes fixes the issue, as does slowing down the loop in some other way. Using EDMA would probably also solve the problem. I haven't tested the other UARTs, but I'd guess the other UARTs will have the same behaviour except for the PRUSS UART (due to ick/fck ratio). I know things will run more slowly if I don't use caching, but if I disable > caching, does that eliminate any pipelining? I'm a noob when it comes to > pipelining and caching, since I've only ever hacked on AVR microcontrollers > and a Cortex M3, where those weren't considerations. > Heh, yeah I personally went from ARM7TDMI-based microcontrollers to the DM814x, a Cortex-A8 based TI SoC closely related to the AM335x... quite a bit of culture-shock there. "Wait, is this still an ARM processor?" o.O I'm not sure what you mean by "eliminate any pipelining": pipelining is an intrinsic part of the design of almost any modern CPU, even the AVR (2-stage) and Cortex-M3 (3-stage), although they pale in comparison to the Cortex-A8 (14-stage, plus 10 more for NEON instructions). The PRU is a notable exception for being non-pipelined, which is deeply impressive considering it runs at 200 MHz and has 32-bit compare-and-branch instructions. In general pipelining becomes most visible in unpredictable branches, which take 1 cycle on PRU, 2 on AVR, 3 on the M3, and 14 on the A8. However, especially since the A8 executes strictly in-order, memory accesses can stall the pipeline for quite a while, and I suspect this is what you mean. This is highly dependent on memory region attributes (including cacheability), which also means setting up MMU and caches is absolutely essential on the Cortex-A8. This isn't very hard: for a baremetal application it typically suffices to setup the section translation table with the desired attributes (see http://community.arm.com/docs/DOC-10098 for an example), set L2 cache enable bit in the Auxiliary Control Register (if not already set), and the M, C, Z, and I bits of the Control Register (Z and I are already set by bootrom iirc). One of the easiest ways to murder write-performance is by marking memory as "strongly ordered", which is the default for data access if the MMU is disabled. This makes the cpu wait synchronously on writes, so then you're looking at about 150-200 ns (= cycles @ 1 GHz) for each write, depending on the "ping time" from the cpu to the target. In contrast, writes to device or normal memory are buffered and therefore take 1 execution cycle as long as the buffer isn't full. The limiting factor in draining the buffer is that the cpu can only have one device write and one normal uncacheable (or write-through) write outstanding on the AXI bus, but almost immediately (afaik as soon as the write is accepted by the async bridge to the L3) the write is "posted" (i.e. becomes fire-and-forget) and acked to the cpu. In case of normal memory, small writes to sequential addresses are automatically coalesced to larger writes when possible. This isn't done for device and strongly-ordered memory, so using aligned dword (strd) and quadword (neon) writes when possible will get you significant performance gain there. In case of non-Neon reads, the cpu has to wait for the data to become available, so caches obviously have a huge impact: L1 cache hit = 1 cycle, L1 miss L2 hit = 9 cycles, L2 miss (or uncacheable) = ping time to target. If they miss the caches, reads from normal memory still have the benefit of overtaking buffered writes, while device reads aren't allowed to overtake device writes. The situation with Neon is more complicated and I never fully figured out what goes on there. For example, some timings for a simplistic memory copy using Neon (vld1, subs, vst1, bne) on a DM814x (A8 @ 600 MHz) targeting DDR3-800: from strongly ordered to strongly ordered: 17.76 cycles/byte from device to device: 12.77 cycles/byte from device to uncacheable: 9.02 cycles/byte from uncacheable to uncacheable: 1.31 cycles/byte from uncacheable to device: 1.10 cycles/byte from L2 miss to uncacheable: 1.06 cycles/byte from L2 miss to device: 0.99 cycles/byte from L2 hit to device or uncacheable: 0.50 cycles/byte "L2 miss" refers to the first access of each cacheline (i.e. one out of four loads). Of course for most peripheral targets caching is not an option. You could probably often get away with marking them normal uncacheable instead of device, though this may require introducing memory barriers and I don't know how expensive they are. It would also be highly Cortex-A8 specific: architectually an ARM CPU is allowed to perform arbitrary reads from normal memory, and many perform speculative reads for example. > Matthijs, does EDMA offer that big a performance boost? > After giving it more thought I'm actually not sure whether EDMA would achieve higher throughput than writes by a PRU core, since PRU is a direct initiator on the L3F while EDMA has to go through the L4HS to reach PRUSS. Having EDMA perform the transfer would however free up PRU's precious time. After setting things up, PRU could request EDMA transfers with a single write to EDMA, or using the PRUSS interrupt controller. Another point of some importance is that since EDMA uses non-posted writes you would actually be sure the data has reached its destination when it signals completion. If PRU writes data to RAM, then signals the A8 using an interrupt, which subsequently proceeds to read from the same location, it is not guaranteed to actually read the data written by PRU: this data may still be in some queue on route from PRUSS to EMIF, while the A8 has a private hotline to EMIF that bypasses it. For other situations the benefits are more clear: for example it can read data from a peripheral in response to its dma request and directly deliver it into PRUSS, and send notification to PRU when a certain amount of data has been transferred. This can save PRU from having to perform reads over the L3 interconnect. EDMA also has a staggering amount of bandwidth. While its reads are limited by latency just like other initiators, the max size of a single access by EDMA is 64 bytes, so for example it can slurp the whole content of an ADC FIFO with a single read access. It is synchronous to the L3, avoiding the latency of an async bridge. Although it uses non-posted writes, it can have four writes + a read outstanding simultaneously. And all this describes a single Transfer Controller (TC), EDMA has three of these. Total theoretical bandwidth is just under 8 gigabytes per second, though I don't know how much is achievable in practice. I think I had more stuff I wanted to say, but this email is already long enough and been sitting in Drafts for too long, so I'll just press "Send" now ;-) Matthijs -- For more options, visit http://beagleboard.org/discuss --- You received this message because you are subscribed to the Google Groups "BeagleBoard" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
