I'd have to see actual code myself, before I could even think about considering what could be the problem. Earlier, what I mentioned about running the executable from memory, is really a stab in the dark, and is most likely not the case. My reasoning with that is if the executable is running from flash media *potentially* part of it could be loading from flash as the program is running. But probably not.
Another thing. Where are you copying to ? To a file on disk ? Or are you just dumping to printf() ? Because printf() will also slow you down considerably. The thing with printf() though, is you can just pipe the output of the executable to a file, and the bottleneck will mostly go away. So that can be tested for easily. Anyway, if you're storing the data on disk, you're probably going to want to avoid that by creating a tmpfs file. Then let another process deal with the data outside of the main application loop. I know this may sound a bit odd, but this can actually increase performance. You'll get a little bit of a process context switching penalty, but it'll barely be perceivable. Meanwhile, the data collecting process will be allowed to just plow right through the data. On Tue, Apr 18, 2017 at 11:10 AM, Charles Steinkuehler < [email protected]> wrote: > On 4/18/2017 6:38 AM, [email protected] wrote: > > Hello, > > I've encountered a problem with very slow reading speed from memory > allocated by > > pru kernel driver uio_pruss comparing to reading from usual address > spaces. Here > > is an performance tests on my Beagle Bone black: > > > > Average memcpy from pru DDR start address to application virtual address > (300 kB > > of data): 10.4781ms > > Average cv::Mat.copyTo (300 kB of data): 11.0681ms > > Average memcpy from one virtual address to another (300 kB of data): > 0.510001ms > > > > Kernel version is 4.4.12-bone11 > > > > Can somebody explain the issue? May be I should have used new pru rpmesg > rproc > > driver? > > Like William said, we can't really answer your question without more > detail, but I'll take a guess. The DRAM that's shared with the PRU is > marked as non-cachable memory since the PRU can modify it. That means > for a typical memory copy loop *EACH* word read from DRAM is going to > turn into a full round-trip CPU to DRAM to CPU read latency rather > than the first read triggering a cache-line fill. > > You probably want to use a memory copy that uses a bunch more > registers and does burst reads from the PRU memory region (as big as > you can for performance, but at least a cache line long). There are > several useful routines from the ARM folks themselves: > > http://infocenter.arm.com/help/index.jsp?topic=/com.arm. > doc.faqs/ka13544.html > > ...along with the benefits and drawbacks of each. > > -- > Charles Steinkuehler > [email protected] > > -- > For more options, visit http://beagleboard.org/discuss > --- > You received this message because you are subscribed to the Google Groups > "BeagleBoard" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/beagleboard/5c302592-7edc-caa1-2e87-d47250fc5843%40steinkuehler.net. > For more options, visit https://groups.google.com/d/optout. > -- For more options, visit http://beagleboard.org/discuss --- You received this message because you are subscribed to the Google Groups "BeagleBoard" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/beagleboard/CALHSORqKpvshcDpvnQmNfr6EYRzsKueLKbcv4dFhQgV8pC3TcA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
