Dear Mat, sorry for the delay in the answer.
In this article: http://www.acmesystems.it/?id=161 <http://www.acmesystems.it/?id=161> we say that foxbone syscalls are faster than exchanging data with the foxbone driver who use ioctl calls. Ioctl calls are using the standard GPIO kernel driver for I/O so there is a higher amount of overhead for each I/O instruction. Foxbone syscalls instead are passing directly data to the FPGA with a specialized kernel driver who tries to do the fastest operation. Nevertheless we are still passing data using Axis I/O pins to the FPGA so it is correct that the FPGA is waiting for data to arrive. My tests say that with Foxbone syscalls we can transfer 16 bits of data in 80 nsec as peak throughput. This throughput is done using the fastest C way we could implement to write/read in a port I/O register of the Axis CPU. When you are inside the kernel driver, the OS is not anymore the bottleneck since the bottleneck is happening only during the transfer of data between userspace and kernelspace. If you transfer one word from your userspace program to the FPGA you have to pass that word to the kernel foxbone driver or the kernel foxbone syscall driver and then the driver will write the data to the FPGA and return. Passing instead for example 16 words in a single call will dramatically improve the transfer rate since there will be only 1/16 (more or less) of the OS overhead. In your case when you say you want to fill up several registers at a time the approach could be to organize inside the FPGA a logic of addressing like the following: Let's say you need to fill up 16 registers inside the FPGA (the reading of 16 registers can be done reversing the approach); 1) build two foxbone registers say 0x1000 and 0x1001. 0x1000 will be the command register and 0x1001 the data register. 2) writing 0x0000 in register 0x1000 would reset the FPGA internal addressing scheme (that you have to build on purpose) so that a write on 0x1001 will write data inside the first of the 16 fpga registers. 3) writing inside the data register #1001 will fill up consecutively the registers. Practically there will be a 4 bit counter that is reset to 0x0 when you write in the 0x1000 foxbone register. The output of the 4 bit counter are the address of the internal FPGA register created. The writing on foxbone register 1001 is acting on the register addressed by the 4 bit counter. In the end you will not need a "real" 0x1000 foxbone register but only the 0x1000 decoding signal to be able to reset the 4 bit counter In this way you cannot access the single register at random but only sequentially. However if you need to transfer always 16 register at a time this would be much faster since in terms of foxbone instructions you need to do only 17 instructions. One to 0x1000 for resetting the address (4 bit counter internal the FPGA) and 16 consecutive writing on the same foxbone register (0x1001) to transfer the real data. Of course the right way would be to realize an extension of the foxbone driver with a specialized call so that you can pass all the 16 words from userspace to kernel foxbone syscall driver and flush all the data in a single call. In general is correct that FPGA registers are much faster respect CPU since FPGA has faster logic and do not have all the overhead of the CPU architecture. With the Fox + FoxVHDL interfacing we are limited from the fact that we are not using the FPGA directly on the CPU memory bus (we don't have the bus on J6 and J7 of the Fox Board) but we have connected the FPGA on I/O lines that have to be addressed in special registers of the CPU. This I/O interfacing is the main limitation on the speed of the interfacing. However the Fox VHDL VGA driver (that uses the foxbone syscalls inner instructions approach) is able to transfer an average of 5 Megabytes/sec (40 Megabit/s) to the FPGA during normal working of the board, this speed is more than adequate for a lot of applications. I have not yet experimented with asm inlines or DMA to see if we can achieve more. I am working on a sort of double data rate way of trasferring data between CPU and FPGA, where the writedata line is sampled from the FPGA on either front, in order to reduce the number of I/O instructions of the inner loop. I am open to collaborate in these forums to improve the foxbone driver so please feel free to experiment and report on the fox-vhdl group for discussion on these matters. Best regards, Roberto Asquini --- In [email protected], "matelectron" <[EMAIL PROTECTED]> wrote: > > > I've tried to use the vhdl-board in order to extend some function > using fpga-mapping. > Now, using foxbone syscalls, it costs many cycles to store/read fpga > registers. From my estimates (by mapping a simple adder into the > fpga), I get that it takes about 900 fpga clock cycles to do one > foxbone write. > > Is it faster to write an assembler inline which does the transfer? > In one doc page you say syscalls are faster than /dev/foxbone driver. > Or are syscalls just slow because of the OS, and they're just easier > to write. > > Moreover, since I need to load a whole lot of fpga registers, one > could think of using the AXIS processor in DMA mode. > > I am grateful for any explanation of embedded device programming. > > Thank You. > > Mat. >
