Cool, thanks for the update Carlos! I thought a little about your problem 
and was wondering whether you are really at the limit of the PRUs 
capability:
You say you have roughly 20kSps with 10bit resolution for four outputs, so 
roughly 200k updates of the output pins per second. That means you need 
roughly 10 instructions for changing the PWM outputs to their next value. 
What do you think of the following implementation? Is it not optimal but I 
think its performance should be comparable to yours but roughly 10x faster:

START:
XIN 10,r0,120 //fetches data from scratchpad0 - each register holds 
8x4bits, where bit 0,4,8,12 defines four consecutive values for output 0, 
bit 1,5,9,13 for output 1 and so on. 
AND R30.b0, r0.b0, 0x0F  //move the bits 0:3 of r0.b0 to r30.b0 (the other 
bits of R30.b0 are set to zero by the AND) - here it is assumed that your 
output pins correspond to r30[0:3]. That can easily be adapted to any other 
consecutive 4 bits by using a LSL instruction
LSR R30.b0,r0.b0,4      //move the bits 4:7 of r0.b0 to r30.b0 (the other 
bits of R30.b0 are set to zero by the LSR)
AND R30.b0, r0.b1, 15  //same for r0.b1
LSR R30.b0,r0.b1,4      
//repeat this instruction for all other bytes up to...
AND R30.b0, r29.b3, 15  
LSR R30.b0,r29.b3,4      
//up to here, 30*8=240 sets of 4 bits were written to the output
XIN 11,r0,120 //fetches data from scratchpad1
//repeat the write operations
XIN 12,r0,120 //fetches data from scratchpad2
//repeat the write operations
XIN 14,r0,120 //fetches data from other PRU
//repeat the write operations
JMP START
//here a total of 4*240=960 writes on 4 digital outputs r30[0:3] has been 
executed. So we have already about 9.9 bits resolution and - given an 
overhead of 5 cycles - an update rate of 200MHz*960/965 = roughly 199MHz. 
That is the actual output will be roughly 200kSps. 

The 960 cycles leave plenty of time for PRU1 to perform 4 LBBO operations 
that are 120byte wide: Typically each one should take less than 80 cyccles. 
So the code will look sth like
LBBO from DRAM filling all registers r0:r29
XOUT 10,r0,120
LBBO from DRAM filling all registers r0:r29
XOUT 11,r0,120
LBBO from DRAM filling all registers r0:r29
XOUT 12,r0,120
LBBO from DRAM filling all registers r0:r29
XOUT 14,r0,120
//here the code will stall until PRU0 requests the data with the 
corresponding XIN operation. 

This is the simplest implementation. You can make it more sophisticated by 
inserting some logic into PRU1 code that modulates the output data over 
several cycles to get a higher output resolution. But even in this 
implementation, the only irregularity comes from the XIN's and the final 
jump operation in PRU0, which should be taken into account in the algorithm 
that generates the data which fills all of the registers (the value of 
register r29[28:31] counts twice if its in the scratchpad, and threefold if 
its in PRU1). This is a complication, but actually leads to a neat trick: 
If you want to invest some of the gained speed into higher resolution, you 
can increase the weight of any 4bit datapoint by inserting a loop after it 
that just holds that particular output value for that number of cycles (you 
need to reserve a register for the counter in that case, but you will still 
gain in resolution). If one loop of the PRU0 code takes much more than 1024 
cycles, you should also insert a loop of similar length into PRU1 code in 
order to not have PRU1 stall more than 1024cycles for PRU0's XIN command. 

-- 
For more options, visit http://beagleboard.org/discuss
--- 
You received this message because you are subscribed to the Google Groups 
"BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to