Re: [beagleboard] Re: Load and execute PRU code from bare-metal application

Bill M Fri, 19 Jun 2015 09:35:19 -0700

Wow. At this point I feel like I should be paying you tuition ^_^.

Apparently while I was falling asleep reading the TRM in bed late at 
night, I totally misread and misinterpreted the UART divisor tables on pg. 
238. Thanks for pointing that out, and for the heads up about the pointer 
corruption issue. I'll probably still try to use one of the non-PRU UARTS 
first (in case I want to dedicate the other PRU to other sensors or 
processing), and fall back to the PRU one if I'm having too much trouble 
getting smooth real time operation.


Before getting into microcontroller programming for robots about 2 years 
ago, I hadn't done any hardware level programming since I was a kid 30 
years ago on 6502 processors. Didn't really have to think much about 
pipelining, caching, or memory management back then ^_^. I do line of 
business desktop and web programming for my day job.

I'm probably using the term pipelining too casually/incorrectly. I know the 
hardware will simultaneously execute one instruction while decoding the 
next one and fetching the one after that. I was kind of including dealing 
with what is loaded in cache, how things slow down with cache misses, etc.. 
My first couple of 'hello world' type programs I've written for this didn't 
even use caching, and even now I'm online using instruction caching (since 
the SDK code for that is super easy and enabling it sped things up 
considerably). I tried to set up the MMU, but it was hanging my program, 
and I didn't want to get bogged down in trying to debug that yet, at least 
not until I learn a LOT more.

The way I am trying to set things up now, just so I can see if the camera 
is working or will work, the PRU will only ever write to the picture 
memory, and the main core will only ever read it. So if the main core 
stalls while reading it, that is no big deal. What will be critical is that 
the PRU can write the data coming from the camera (at about 9MB a second) 
to memory dependably.

I have a lot more to say/ask too, and I can't thank you enough for all the 
help and info you've given me so far, but I'm writing this from work and I 
think if I want to keep this job for a while longer I better get back to 
it. Talk more soon...

On Thursday, June 18, 2015 at 10:03:23 AM UTC-4, Matthijs van Duin wrote:

> On 15 June 2015 at 16:33, Bill M <[email protected] <javascript:>> 
> wrote:
>
>> After reading through a bit more in the TRM about the PRU UART, I don't 
>> think a PRU UART will be feasible since it looks like they top out at 
>> around 300Kbs
>>
>
> Hmm, where'd you get that number? The PRU UART looks like the highest 
> performance UART: it receives a 192 MHz functional clock and the datasheet 
> specs 12 Mbps max (that would be using a /1 divider and 16x oversampling). 
> The other UARTs receive a 48 MHz functional clock and spec max 3.6864 Mbps 
> (/1 divider and 13x oversampling, so that would get you 3.6923 Mbps to be 
> precise).
>
> I've also noticed that UART0 cannot cope too many consecutive writes, even 
> if there's enough fifo space: the fifo pointers seem to get corrupted or 
> something (I'm guessing a bug in the synchronization logic between the 
> interface and functional clock domains). This only appears as issue when 
> trying to rapidly fill the UART fifo from the cortex-a8 in a tight loop 
> (using posted writes). Inserting some dummy register write between 
> consecutive data bytes fixes the issue, as does slowing down the loop in 
> some other way. Using EDMA would probably also solve the problem.
>
> I haven't tested the other UARTs, but I'd guess the other UARTs will have 
> the same behaviour except for the PRUSS UART (due to ick/fck ratio).
>
>
> I know things will run more slowly if I don't use caching, but if I 
>> disable caching, does that eliminate any pipelining? I'm a noob when it 
>> comes to pipelining and caching, since I've only ever hacked on AVR 
>> microcontrollers and a Cortex M3, where those weren't considerations.
>>
>
> Heh, yeah I personally went from ARM7TDMI-based microcontrollers to the 
> DM814x, a Cortex-A8 based TI SoC closely related to the AM335x... quite a 
> bit of culture-shock there.  "Wait, is this still an ARM processor?" o.O
>
> I'm not sure what you mean by "eliminate any pipelining": pipelining is an 
> intrinsic part of the design of almost any modern CPU, even the AVR 
> (2-stage) and Cortex-M3 (3-stage), although they pale in comparison to the 
> Cortex-A8 (14-stage, plus 10 more for NEON instructions). The PRU is a 
> notable exception for being non-pipelined, which is deeply impressive 
> considering it runs at 200 MHz and has 32-bit compare-and-branch 
> instructions. In general pipelining becomes most visible in unpredictable 
> branches, which take 1 cycle on PRU, 2 on AVR, 3 on the M3, and 14 on the 
> A8.
>
> However, especially since the A8 executes strictly in-order, memory 
> accesses can stall the pipeline for quite a while, and I suspect this is 
> what you mean. This is highly dependent on memory region attributes 
> (including cacheability), which also means setting up MMU and caches is 
> absolutely essential on the Cortex-A8. This isn't very hard: for a 
> baremetal application it typically suffices to setup the section 
> translation table with the desired attributes (see 
> http://community.arm.com/docs/DOC-10098 for an example), set L2 cache 
> enable bit in the Auxiliary Control Register (if not already set), and the 
> M, C, Z, and I bits of the Control Register (Z and I are already set by 
> bootrom iirc).
>
> One of the easiest ways to murder write-performance is by marking memory 
> as "strongly ordered", which is the default for data access if the MMU is 
> disabled. This makes the cpu wait synchronously on writes, so then you're 
> looking at about 150-200 ns (= cycles @ 1 GHz) for each write, depending on 
> the "ping time" from the cpu to the target. In contrast, writes to device 
> or normal memory are buffered and therefore take 1 execution cycle as long 
> as the buffer isn't full. The limiting factor in draining the buffer is 
> that the cpu can only have one device write and one normal uncacheable (or 
> write-through) write outstanding on the AXI bus, but almost immediately 
> (afaik as soon as the write is accepted by the async bridge to the L3) the 
> write is "posted" (i.e. becomes fire-and-forget) and acked to the cpu.
>
> In case of normal memory, small writes to sequential addresses are 
> automatically coalesced to larger writes when possible. This isn't done for 
> device and strongly-ordered memory, so using aligned dword (strd) and 
> quadword (neon) writes when possible will get you significant performance 
> gain there.
>
> In case of non-Neon reads, the cpu has to wait for the data to become 
> available, so caches obviously have a huge impact: L1 cache hit = 1 cycle, 
> L1 miss L2 hit = 9 cycles, L2 miss (or uncacheable) = ping time to target. 
> If they miss the caches, reads from normal memory still have the benefit of 
> overtaking buffered writes, while device reads aren't allowed to overtake 
> device writes. The situation with Neon is more complicated and I never 
> fully figured out what goes on there. For example, some timings for a 
> simplistic memory copy using Neon (vld1, subs, vst1, bne) on a DM814x (A8 @ 
> 600 MHz) targeting DDR3-800:
>
> from strongly ordered to strongly ordered: 17.76 cycles/byte
> from device to device: 12.77 cycles/byte
> from device to uncacheable: 9.02 cycles/byte
> from uncacheable to uncacheable: 1.31 cycles/byte
> from uncacheable to device: 1.10 cycles/byte
> from L2 miss to uncacheable: 1.06 cycles/byte
> from L2 miss to device: 0.99 cycles/byte
> from L2 hit to device or uncacheable: 0.50 cycles/byte
>
> "L2 miss" refers to the first access of each cacheline (i.e. one out of 
> four loads).
>
> Of course for most peripheral targets caching is not an option. You could 
> probably often get away with marking them normal uncacheable instead of 
> device, though this may require introducing memory barriers and I don't 
> know how expensive they are. It would also be highly Cortex-A8 specific: 
> architectually an ARM CPU is allowed to perform arbitrary reads from normal 
> memory, and many perform speculative reads for example.
>
>  
>
>> Matthijs, does EDMA offer that big a performance boost?
>>
>
> After giving it more thought I'm actually not sure whether EDMA would 
> achieve higher throughput than writes by a PRU core, since PRU is a direct 
> initiator on the L3F while EDMA has to go through the L4HS to reach PRUSS. 
> Having EDMA perform the transfer would however free up PRU's precious time. 
> After setting things up, PRU could request EDMA transfers with a single 
> write to EDMA, or using the PRUSS interrupt controller.
>
> Another point of some importance is that since EDMA uses non-posted writes 
> you would actually be sure the data has reached its destination when it 
> signals completion. If PRU writes data to RAM, then signals the A8 using an 
> interrupt, which subsequently proceeds to read from the same location, it 
> is not guaranteed to actually read the data written by PRU: this data may 
> still be in some queue on route from PRUSS to EMIF, while the A8 has a 
> private hotline to EMIF that bypasses it.
>
> For other situations the benefits are more clear: for example it can read 
> data from a peripheral in response to its dma request and directly deliver 
> it into PRUSS, and send notification to PRU when a certain amount of data 
> has been transferred. This can save PRU from having to perform reads over 
> the L3 interconnect.
>
> EDMA also has a staggering amount of bandwidth.  While its reads are 
> limited by latency just like other initiators, the max size of a single 
> access by EDMA is 64 bytes, so for example it can slurp the whole content 
> of an ADC FIFO with a single read access. It is synchronous to the L3, 
> avoiding the latency of an async bridge. Although it uses non-posted 
> writes, it can have four writes + a read outstanding simultaneously. And 
> all this describes a single Transfer Controller (TC), EDMA has three of 
> these. Total theoretical bandwidth is just under 8 gigabytes per second, 
> though I don't know how much is achievable in practice.
>
> I think I had more stuff I wanted to say, but this email is already long 
> enough and been sitting in Drafts for too long, so I'll just press "Send" 
> now ;-)
>
> Matthijs
>

-- 
For more options, visit http://beagleboard.org/discuss
--- 
You received this message because you are subscribed to the Google Groups 
"BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [beagleboard] Re: Load and execute PRU code from bare-metal application

Reply via email to