[foxboard] Re: foxbone maximum performance

Roberto Asquini Sun, 20 Jan 2008 03:13:38 -0800

Dear Mat,

sorry for the delay in the answer.

In this article:
http://www.acmesystems.it/?id=161 <http://www.acmesystems.it/?id=161>

we say that foxbone syscalls are faster than exchanging data with the
foxbone driver who use ioctl calls.
Ioctl calls are using the standard GPIO kernel driver for I/O so there
is a higher amount of overhead for each I/O instruction.
Foxbone syscalls instead are passing directly data to the FPGA with a 
specialized kernel driver who tries to do the fastest operation.

Nevertheless we are still passing data using Axis I/O pins to the FPGA
so it is correct that the FPGA is waiting for data to arrive.
My tests say that with Foxbone syscalls we can transfer 16 bits of data
in 80 nsec as peak throughput.
This throughput is done using the fastest C way we could implement to
write/read in a port I/O register of the Axis CPU.
When you are inside the kernel driver, the OS is not anymore the
bottleneck since the bottleneck is happening only during the transfer of
data between userspace and kernelspace.

If you transfer one word from your userspace program to the FPGA you
have to pass that word to the kernel foxbone driver or the kernel
foxbone syscall driver and then the driver will write the data to the
FPGA and return. Passing instead for example 16 words in a single call
will dramatically improve the transfer rate since there will be only
1/16 (more or less) of the OS overhead.

In your case when you say you want to fill up several registers at a
time the approach could be to organize inside the FPGA a logic of
addressing  like the following:

Let's say you need to fill up 16 registers inside the FPGA (the reading
of 16 registers can be done reversing the approach);
1) build two foxbone registers say 0x1000 and 0x1001. 0x1000 will be the
command register and 0x1001 the data register.
2) writing 0x0000 in register 0x1000 would reset the FPGA internal
addressing scheme (that you have to build on purpose) so that a write on
0x1001 will write data inside the first of the 16 fpga registers.
3) writing inside the data register #1001 will fill up consecutively the
registers.
Practically there will be a 4 bit counter that is reset to 0x0 when you
write in the 0x1000 foxbone register. The output of the 4 bit counter 
are the address of the internal FPGA register created.
The writing on foxbone register 1001 is acting on the register addressed
by the 4 bit counter.
In the end you will not need a "real" 0x1000 foxbone register but only
the 0x1000 decoding signal to be able to reset the 4 bit counter

In this way you cannot access the single register at random but only
sequentially. However if you need to transfer always 16 register at a
time this would be much faster since in terms of foxbone instructions
you need to do only 17 instructions. One to 0x1000 for resetting the
address (4 bit counter internal the FPGA) and 16 consecutive writing on
the same foxbone register (0x1001) to transfer the real data.
Of course the right way would be to  realize an extension of the foxbone
driver with a specialized call so that you can pass all the 16 words
from userspace to kernel foxbone syscall driver and flush all the data
in a single call.

In general is correct that FPGA registers are much faster respect CPU
since FPGA has faster logic and do not have all the overhead of the CPU
architecture. With the Fox + FoxVHDL interfacing we are limited from the
fact that we are not using the FPGA directly on the CPU memory bus (we
don't have the bus on J6 and J7 of the Fox Board) but we have connected
the FPGA on I/O lines that have to be addressed in special registers of
the CPU. This I/O interfacing is the main limitation on the speed of the
interfacing. However the Fox VHDL VGA driver (that  uses the foxbone
syscalls inner instructions approach) is able to transfer an average of
5 Megabytes/sec (40 Megabit/s) to the FPGA during normal working of the
board, this speed is more than adequate for a lot of applications.

I have not yet experimented with asm inlines or DMA to see if we can
achieve more. I am working on a sort of double data rate way of
trasferring data between CPU and FPGA, where the writedata line is
sampled from the FPGA on either front, in order to reduce the number of
I/O instructions of the inner loop.

I am open to collaborate in these forums to improve the foxbone driver
so please feel free to experiment and report on the fox-vhdl group for
discussion on these matters.

Best regards,

Roberto Asquini

--- In [email protected], "matelectron" <[EMAIL PROTECTED]> wrote:
>
>
> I've tried to use the vhdl-board in order to extend some function
> using fpga-mapping.
> Now, using foxbone syscalls, it costs many cycles to store/read fpga
> registers. From my estimates (by mapping a simple adder into the
> fpga), I get that it takes about 900 fpga clock cycles to do one
> foxbone write.
>
> Is it faster to write an assembler inline which does the transfer?
> In one doc page you say syscalls are faster than /dev/foxbone driver.
> Or are syscalls just slow because of the OS, and they're just easier
> to write.
>
> Moreover, since I need to load a whole lot of fpga registers, one
> could think of using the AXIS processor in DMA mode.
>
> I am grateful for any explanation of embedded device programming.
>
> Thank You.
>
> Mat.
>

[foxboard] Re: foxbone maximum performance

Reply via email to