Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible improvements

Yu-Hua Yang Wed, 01 Jul 2009 02:40:30 -0700

Thank you very much for your reply. I have some more questions. I understand
you wont be looking/updating this branch anymore due to the release of
OpenCL, but thanks for helping right now!


Before you dive in, one thing I am having concerns is this output that
happens every time I run CUDA-enable code:

gr_vmcircbuf_cuda::copy_buf_to_buf() error cudaMemcpy() returned 0x3 = 3
initialization error
gr_vmcircbuf_cuda::copy_buf_to_buf recreating buffer d_base=0x3820000
size=32768 *2

Are these 2 outputs normal? Are they suppose to happen? If not, how do I fix
this?

2009/6/30 Martin DvH <[email protected]>

> On Tue, 2009-06-30 at 02:52 -0400, Yu-Hua Yang wrote:
> > Upon a closer look into cuda_muiltiply_const_ff_kernel.cu, there
> > exists 5 different kernel functions to do the multiplication, where
> > the default one,
> >
> >          __global__ void
> >         cuda_multiply_const_ff_kernel(const float* g_idata, float*
> >         g_odata,const int noutput_items,const float konst)
> >
> > is completely blank. But regardless, nobody calls these kernel
> > functions.
> cuda_multiply_const_ff was never finished or I accidently removed some
> code here.
> Either way, this is clearly a bug.
> The empty kernel should call one of the implemented kernels.
>
>
> > Then, in the same file, which is called by cuda_multiply_const_ff.cc,
> > in this function
> >
> >         int
> >         get_cuda_multiply_const_ff_kernel_params
> >         ( cuda_multiply_const_ff_kernel_params *params )
> >         {
> >           int result=0;
> >           //const unsigned int max_num_threads_per_block  =
> >         MAX_NUM_THREADS_ALL;   //can use the maximum number of threads
> >         if wanted
> >           //unsigned int max_num_blocks         = MAX_NUM_BLOCKS_ALL;
> >
> >           unsigned int num_blocks=4096 ;// =
> >         gridDim.x;                                 //NUM_CUDABLOCKS
> >           unsigned int num_threads_per_block=512;//  =
> >         blockDim.x;                     //NUM_THREADS;
> >           unsigned int num_outputs_per_block=num_threads_per_block;
> >
> >           const unsigned int num_outputs_per_grid=
> >         num_outputs_per_block*num_blocks;  //(blockDim.x)*gridDim.x
> >
> >           size_t dynamic_shared_mem_size =
> >
> 0;//256*sizeof(float);//0;//num_threads_per_block*sizeof(gr_complex);
> >           dim3  griddim( num_blocks, 1, 1);
> >           dim3  threaddim( num_threads_per_block, 1, 1);
> >
> >           params->griddim=griddim;
> >           params->threaddim=threaddim;
> >           params->dynamic_shared_mem_size=dynamic_shared_mem_size;
> >           params->num_outputs_padded=num_outputs_per_grid;
> >           params->num_inputs_padded=num_outputs_per_grid;
> >           params->num_inputs=0;//num_outputs_per_grid;//num_outputs;
> >           params->num_outputs=0;//num_outputs_per_grid;//num_outputs;
> >
> >           //Now you can do the kernel invocation like this:
> >           //cuda_multiply_const_ff_filter_kernel<<< params->griddim,
> >         params->threaddim, params->dynamic_shared_mem_size
> >         >>>(g_idata, g_odata,  params->num_outputs_padded*X,konst);
> >           return result;
> >         }
> >
> > The kernel invocation is completely commented out! The result is
> > initialized as 0 at the top and returns it. All the work in between to
> > specify and allocate thread, block sizes does not seem to matter. Not
> > sure why this code exists this way, did someone make an edit or did
> > Martin specifically commented out the kernel invocation?
> Yes the kernel invocation is specifically commented out.
> The commented out kernel invocation  is only here as documentation on
> how to do the actual kernel invocation.
>
> There is still a small typo.
> It should refer to cuda_multiply_const_ff_kernel and not
> cuda_multiply_const_ff_filter_kernel
>
>
>
> This methods name is get_cuda_multiply_const_ff_kernel_params
> It does just that, determine the needed params for the kernel
> invocation.
>
> The actual kernel invocation is done in:
> cuda_multiply_const_ff_work_device
>
>
> Both get_cuda_multiply_const_ff_kernel_params and
> cuda_multiply_const_ff_work_device are called from normal C++ code in
> cuda_multiply_const_ff_kernel.cu
>

Here you mean called by cuda_multiply_const_ff.cc right? the kernel itself
seems to just define the kernel functions and everything is called by
cuda_multiply_const_ff.cc, maybe I am wrong because this leads me to my
question which is, where in cuda_multiply_const_ff calls
cuda_multiply_const_ff_work_device? doesn't seem like it does but maybe its
some OOP process that I missed...anyways...not that important at the
moment....

>
> Better examples to look at and benchmark would be
> cuda_quadrature_demod_cuda_cf
> cuda_fir_filter_fff
>
> These are both used in
> testbed/wfm/cuda_wfm_rcv.py
> Which calls the complete cuda implementation of a WFM receiver in
> cuda_wfm_rcv_cuda.py
>
> You will notice  a cuda_multiply_const_ff block is instantiated as
> volume_control but not used in the final connect because it didn't work.
> Now I know this is because of the bug you found.
>
>
> >  Is this suppose to be this way? I don't see how this can be a proper
> > benchmarking if it seems that we just test about allocating threads
> > and blocks on the device and memory access times, but really don't do
> > any computation.
> That is not a good benchmark indeed.
> cuda_multiply_const_ff should be fixed first.
>
> The fir-filter is however the thing you really want to benchmark.
> (Which is also in there)
> The more taps the fir-filter has, the more computations it will have to
> make.
> Note that the cuda implementation is at the moment limited in the number
> of taps it can support at a certain decimation factor.
> If I remember well decimationfactor*numberoftaps should stay below 2048
> (or was it 512)
>
> Ways of improving the resuls include having gnuradio do calculations in
> bigger chunks.
> This can be done in several ways.
> One way is to have a block at the end of the processing chain which has
> a large output_multiple requirement which needs 1 input for every
> output.
> In other words. Connect a block to the end of the processing chain which
> has set_output_multiple(large_number) in its constructor.
>
> You can use gr_test for this.


I decided to abandon and comment out all the cuda.multiply_const_ff function
calls and concentrate on cuda.fir_filter_fff as suggested. Things I got
questions/concerns

1. I increased output_multiple by doing "options.output_multiple = xxx" and
this has no effect on the computing time of either CUDA or CPU. Did I do
something wrong?
2.  I increased the taps by doing "taps = range(1,256)" and also increasing
number of blocks of fir_filter in the code and voila, I am now able to get
CUDA to be faster than just CPU. However, if I implement something like
"taps = range(1,512)" the CUDA part would be extremely slow (~20 seconds)
while the CPU is still cool (~ 2 sec). Why? But this maybe related to what
you were saying about max number of taps...although why is CPU able to still
compute?
3. I had to increase the number of fir_filter blocks to 14 blocks before I
can start seeing CUDA out-perform CPU. Experimentally its fine, I achieved
my objective, but how is this "increased computation" justified in a normal
GNURadio operation? I mean, when would a normal GNURadio operation require a
chain of 14 fir_filters? I guess this is going beyond just "benchmarking"
and asking where else can I take advantage of CUDA's computation power in
GNURadio in a "normal" operation?
4. Looking at cuda_fir_fff_7_kernel, which I believe is the core of
cuda_fir_filter, it seems you are using shared memory right? Just making
sure we are not using global or local memory which would disastrously slow
down the CUDA computations.

>
>
> There are other ways which include changing the scheduler and the buffer
> instantiating code.
>

I rather not pursue these ways because of the increased complexity. I rather
increase computation, seems fastest/easiest to me, just not sure where.


>
> I hope this helps,
>
> Martin


It has been most helpful! Thank you again!

>
>
> > I am probably way off here, doesnt make any sense......someone please
> > clarify!
> >
> > _______________________________________________
> > Discuss-gnuradio mailing list
> > [email protected]
> > http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>
>

_______________________________________________
Discuss-gnuradio mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible improvements

Reply via email to