Thanks Marcus!  I do know what the root cause is in the OpenCL
implementation of the poor performance.  Maybe it'll help provide some
background.  (I've actually been working on the gr-clenabled GNURadio
blocks [in pybombs now] OpenCL study I published a month or so ago for
about 4 months).  For OpenCL the massively parallel processing across a
number of lower-throughput cores on data sets where the data can all be
processed in parallel works well.  For instance calculations such as a[i] =
b[i] + c[i].  All calculations can be handled in parallel and the lower
performance of each core is offset by having 10's or 100's running at the
same time for a good throughput boost.

For calculations such as a Costas Loop where an error is calculated for
each point then used in the next calculation, you can't run the
calculations in parallel and they have to be done in order to get the right
results. You can switch OpenCL to a task-parallel mode with a work set size
of 1, but for GNURadio what it really amounts to because each block just
gets 1 thread is running the same function on a single lower performance
GPU core.  In that case the single-core GPU performance is an order of
magitude worse than a general CPU core for the same task.

I know there's a number of IP cores for FPGA's focused on DSP, so my
thought / hope was that for those algorithms that couldn't be done in
parallel, that moving from CPU-speed to hardware speed on the FPGA would
run faster.  Kind of like with RFNoC, just for more general purpose
FPGA's.

I think I'd still be okay if I had to pull the DSP blocks together in an
FPGA dev environment like Xilinx Vivado as long as it could help generate
the C++ interface code (I did see one article someone wrote on doing
something like this), then just having to write the GNURadio block to
interface with it.  I just don't know FPGA's well enough (and I know it's
not a simple learning curve) to know.


---------- Forwarded message ----------
From: Marcus Müller <[email protected]>
Date: Wed, Apr 26, 2017 at 7:31 AM
Subject: Re: [Discuss-gnuradio] OpenCL FPGA Recommendation?
To: [email protected]


Dear Ghost,


On 04/26/2017 01:01 PM, GhostOp14 wrote:
> I tested it as a single task in OpenCL on a GPU and the performance
> was horrible so I want to get the same algorithm running on an FPGA
> and see if the performance significantly improves.
Gut feeling: I wouldn't spend any money on an FPGA implementation before
I have not understood why it worked so terribly on GPU, and have a good
reason why it should work better on FPGA. Frankly, I don't think you
realize how hard it is to properly optimize things for specific
architectures, and OpenCL on an FPGA will not be easier to "get right"
than OpenCL on a GPU.
>
> Given some high-bandwidth goals, I'm actually thinking either USB 3.0
> or PCIe would be the requirement.  I was looking at the Opal Kelly
> line like the one they have based on the Xilinx Artix-7.  I actually
> think the USB 3.0 interface if I can transfer runtime data to/from it
> at USB 3.0 speeds would be more portable (say laptop/desktop).  I'm
> still new to FPGA's so any other thoughts are much appreciated.  It
> looks like I may still have to work in Vivado and build the FPGA code
> but then I could interface with it from C++ and a GNURadio block?
Probably! Don't know the FPGA manufacturer's OpenCL tools and whether
they offer an easy-to-use interface to PC software.
>
> Am I on the right track?
Don't know – again, I'd recommend going into a much deeper analysis of
why things work badly on your CPU and GPU, and why an FPGA should make
that better.

Best regards,
Marcus


_______________________________________________
Discuss-gnuradio mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
_______________________________________________
Discuss-gnuradio mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Reply via email to