On 15 June 2015 at 16:33, Bill M <[email protected]> wrote:

> After reading through a bit more in the TRM about the PRU UART, I don't
> think a PRU UART will be feasible since it looks like they top out at
> around 300Kbs
>

Hmm, where'd you get that number? The PRU UART looks like the highest
performance UART: it receives a 192 MHz functional clock and the datasheet
specs 12 Mbps max (that would be using a /1 divider and 16x oversampling).
The other UARTs receive a 48 MHz functional clock and spec max 3.6864 Mbps
(/1 divider and 13x oversampling, so that would get you 3.6923 Mbps to be
precise).

I've also noticed that UART0 cannot cope too many consecutive writes, even
if there's enough fifo space: the fifo pointers seem to get corrupted or
something (I'm guessing a bug in the synchronization logic between the
interface and functional clock domains). This only appears as issue when
trying to rapidly fill the UART fifo from the cortex-a8 in a tight loop
(using posted writes). Inserting some dummy register write between
consecutive data bytes fixes the issue, as does slowing down the loop in
some other way. Using EDMA would probably also solve the problem.

I haven't tested the other UARTs, but I'd guess the other UARTs will have
the same behaviour except for the PRUSS UART (due to ick/fck ratio).


I know things will run more slowly if I don't use caching, but if I disable
> caching, does that eliminate any pipelining? I'm a noob when it comes to
> pipelining and caching, since I've only ever hacked on AVR microcontrollers
> and a Cortex M3, where those weren't considerations.
>

Heh, yeah I personally went from ARM7TDMI-based microcontrollers to the
DM814x, a Cortex-A8 based TI SoC closely related to the AM335x... quite a
bit of culture-shock there.  "Wait, is this still an ARM processor?" o.O

I'm not sure what you mean by "eliminate any pipelining": pipelining is an
intrinsic part of the design of almost any modern CPU, even the AVR
(2-stage) and Cortex-M3 (3-stage), although they pale in comparison to the
Cortex-A8 (14-stage, plus 10 more for NEON instructions). The PRU is a
notable exception for being non-pipelined, which is deeply impressive
considering it runs at 200 MHz and has 32-bit compare-and-branch
instructions. In general pipelining becomes most visible in unpredictable
branches, which take 1 cycle on PRU, 2 on AVR, 3 on the M3, and 14 on the
A8.

However, especially since the A8 executes strictly in-order, memory
accesses can stall the pipeline for quite a while, and I suspect this is
what you mean. This is highly dependent on memory region attributes
(including cacheability), which also means setting up MMU and caches is
absolutely essential on the Cortex-A8. This isn't very hard: for a
baremetal application it typically suffices to setup the section
translation table with the desired attributes (see
http://community.arm.com/docs/DOC-10098 for an example), set L2 cache
enable bit in the Auxiliary Control Register (if not already set), and the
M, C, Z, and I bits of the Control Register (Z and I are already set by
bootrom iirc).

One of the easiest ways to murder write-performance is by marking memory as
"strongly ordered", which is the default for data access if the MMU is
disabled. This makes the cpu wait synchronously on writes, so then you're
looking at about 150-200 ns (= cycles @ 1 GHz) for each write, depending on
the "ping time" from the cpu to the target. In contrast, writes to device
or normal memory are buffered and therefore take 1 execution cycle as long
as the buffer isn't full. The limiting factor in draining the buffer is
that the cpu can only have one device write and one normal uncacheable (or
write-through) write outstanding on the AXI bus, but almost immediately
(afaik as soon as the write is accepted by the async bridge to the L3) the
write is "posted" (i.e. becomes fire-and-forget) and acked to the cpu.

In case of normal memory, small writes to sequential addresses are
automatically coalesced to larger writes when possible. This isn't done for
device and strongly-ordered memory, so using aligned dword (strd) and
quadword (neon) writes when possible will get you significant performance
gain there.

In case of non-Neon reads, the cpu has to wait for the data to become
available, so caches obviously have a huge impact: L1 cache hit = 1 cycle,
L1 miss L2 hit = 9 cycles, L2 miss (or uncacheable) = ping time to target.
If they miss the caches, reads from normal memory still have the benefit of
overtaking buffered writes, while device reads aren't allowed to overtake
device writes. The situation with Neon is more complicated and I never
fully figured out what goes on there. For example, some timings for a
simplistic memory copy using Neon (vld1, subs, vst1, bne) on a DM814x (A8 @
600 MHz) targeting DDR3-800:

from strongly ordered to strongly ordered: 17.76 cycles/byte
from device to device: 12.77 cycles/byte
from device to uncacheable: 9.02 cycles/byte
from uncacheable to uncacheable: 1.31 cycles/byte
from uncacheable to device: 1.10 cycles/byte
from L2 miss to uncacheable: 1.06 cycles/byte
from L2 miss to device: 0.99 cycles/byte
from L2 hit to device or uncacheable: 0.50 cycles/byte

"L2 miss" refers to the first access of each cacheline (i.e. one out of
four loads).

Of course for most peripheral targets caching is not an option. You could
probably often get away with marking them normal uncacheable instead of
device, though this may require introducing memory barriers and I don't
know how expensive they are. It would also be highly Cortex-A8 specific:
architectually an ARM CPU is allowed to perform arbitrary reads from normal
memory, and many perform speculative reads for example.



> Matthijs, does EDMA offer that big a performance boost?
>

After giving it more thought I'm actually not sure whether EDMA would
achieve higher throughput than writes by a PRU core, since PRU is a direct
initiator on the L3F while EDMA has to go through the L4HS to reach PRUSS.
Having EDMA perform the transfer would however free up PRU's precious time.
After setting things up, PRU could request EDMA transfers with a single
write to EDMA, or using the PRUSS interrupt controller.

Another point of some importance is that since EDMA uses non-posted writes
you would actually be sure the data has reached its destination when it
signals completion. If PRU writes data to RAM, then signals the A8 using an
interrupt, which subsequently proceeds to read from the same location, it
is not guaranteed to actually read the data written by PRU: this data may
still be in some queue on route from PRUSS to EMIF, while the A8 has a
private hotline to EMIF that bypasses it.

For other situations the benefits are more clear: for example it can read
data from a peripheral in response to its dma request and directly deliver
it into PRUSS, and send notification to PRU when a certain amount of data
has been transferred. This can save PRU from having to perform reads over
the L3 interconnect.

EDMA also has a staggering amount of bandwidth.  While its reads are
limited by latency just like other initiators, the max size of a single
access by EDMA is 64 bytes, so for example it can slurp the whole content
of an ADC FIFO with a single read access. It is synchronous to the L3,
avoiding the latency of an async bridge. Although it uses non-posted
writes, it can have four writes + a read outstanding simultaneously. And
all this describes a single Transfer Controller (TC), EDMA has three of
these. Total theoretical bandwidth is just under 8 gigabytes per second,
though I don't know how much is achievable in practice.

I think I had more stuff I wanted to say, but this email is already long
enough and been sitting in Drafts for too long, so I'll just press "Send"
now ;-)

Matthijs

-- 
For more options, visit http://beagleboard.org/discuss
--- 
You received this message because you are subscribed to the Google Groups 
"BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to