Thank you very much for your reply. I have some more questions. I understand you wont be looking/updating this branch anymore due to the release of OpenCL, but thanks for helping right now!
Before you dive in, one thing I am having concerns is this output that happens every time I run CUDA-enable code: gr_vmcircbuf_cuda::copy_buf_to_buf() error cudaMemcpy() returned 0x3 = 3 initialization error gr_vmcircbuf_cuda::copy_buf_to_buf recreating buffer d_base=0x3820000 size=32768 *2 Are these 2 outputs normal? Are they suppose to happen? If not, how do I fix this? 2009/6/30 Martin DvH <[email protected]> > On Tue, 2009-06-30 at 02:52 -0400, Yu-Hua Yang wrote: > > Upon a closer look into cuda_muiltiply_const_ff_kernel.cu, there > > exists 5 different kernel functions to do the multiplication, where > > the default one, > > > > __global__ void > > cuda_multiply_const_ff_kernel(const float* g_idata, float* > > g_odata,const int noutput_items,const float konst) > > > > is completely blank. But regardless, nobody calls these kernel > > functions. > cuda_multiply_const_ff was never finished or I accidently removed some > code here. > Either way, this is clearly a bug. > The empty kernel should call one of the implemented kernels. > > > > Then, in the same file, which is called by cuda_multiply_const_ff.cc, > > in this function > > > > int > > get_cuda_multiply_const_ff_kernel_params > > ( cuda_multiply_const_ff_kernel_params *params ) > > { > > int result=0; > > //const unsigned int max_num_threads_per_block = > > MAX_NUM_THREADS_ALL; //can use the maximum number of threads > > if wanted > > //unsigned int max_num_blocks = MAX_NUM_BLOCKS_ALL; > > > > unsigned int num_blocks=4096 ;// = > > gridDim.x; //NUM_CUDABLOCKS > > unsigned int num_threads_per_block=512;// = > > blockDim.x; //NUM_THREADS; > > unsigned int num_outputs_per_block=num_threads_per_block; > > > > const unsigned int num_outputs_per_grid= > > num_outputs_per_block*num_blocks; //(blockDim.x)*gridDim.x > > > > size_t dynamic_shared_mem_size = > > > 0;//256*sizeof(float);//0;//num_threads_per_block*sizeof(gr_complex); > > dim3 griddim( num_blocks, 1, 1); > > dim3 threaddim( num_threads_per_block, 1, 1); > > > > params->griddim=griddim; > > params->threaddim=threaddim; > > params->dynamic_shared_mem_size=dynamic_shared_mem_size; > > params->num_outputs_padded=num_outputs_per_grid; > > params->num_inputs_padded=num_outputs_per_grid; > > params->num_inputs=0;//num_outputs_per_grid;//num_outputs; > > params->num_outputs=0;//num_outputs_per_grid;//num_outputs; > > > > //Now you can do the kernel invocation like this: > > //cuda_multiply_const_ff_filter_kernel<<< params->griddim, > > params->threaddim, params->dynamic_shared_mem_size > > >>>(g_idata, g_odata, params->num_outputs_padded*X,konst); > > return result; > > } > > > > The kernel invocation is completely commented out! The result is > > initialized as 0 at the top and returns it. All the work in between to > > specify and allocate thread, block sizes does not seem to matter. Not > > sure why this code exists this way, did someone make an edit or did > > Martin specifically commented out the kernel invocation? > Yes the kernel invocation is specifically commented out. > The commented out kernel invocation is only here as documentation on > how to do the actual kernel invocation. > > There is still a small typo. > It should refer to cuda_multiply_const_ff_kernel and not > cuda_multiply_const_ff_filter_kernel > > > > This methods name is get_cuda_multiply_const_ff_kernel_params > It does just that, determine the needed params for the kernel > invocation. > > The actual kernel invocation is done in: > cuda_multiply_const_ff_work_device > > > Both get_cuda_multiply_const_ff_kernel_params and > cuda_multiply_const_ff_work_device are called from normal C++ code in > cuda_multiply_const_ff_kernel.cu > Here you mean called by cuda_multiply_const_ff.cc right? the kernel itself seems to just define the kernel functions and everything is called by cuda_multiply_const_ff.cc, maybe I am wrong because this leads me to my question which is, where in cuda_multiply_const_ff calls cuda_multiply_const_ff_work_device? doesn't seem like it does but maybe its some OOP process that I missed...anyways...not that important at the moment.... > > Better examples to look at and benchmark would be > cuda_quadrature_demod_cuda_cf > cuda_fir_filter_fff > > These are both used in > testbed/wfm/cuda_wfm_rcv.py > Which calls the complete cuda implementation of a WFM receiver in > cuda_wfm_rcv_cuda.py > > You will notice a cuda_multiply_const_ff block is instantiated as > volume_control but not used in the final connect because it didn't work. > Now I know this is because of the bug you found. > > > > Is this suppose to be this way? I don't see how this can be a proper > > benchmarking if it seems that we just test about allocating threads > > and blocks on the device and memory access times, but really don't do > > any computation. > That is not a good benchmark indeed. > cuda_multiply_const_ff should be fixed first. > > The fir-filter is however the thing you really want to benchmark. > (Which is also in there) > The more taps the fir-filter has, the more computations it will have to > make. > Note that the cuda implementation is at the moment limited in the number > of taps it can support at a certain decimation factor. > If I remember well decimationfactor*numberoftaps should stay below 2048 > (or was it 512) > > Ways of improving the resuls include having gnuradio do calculations in > bigger chunks. > This can be done in several ways. > One way is to have a block at the end of the processing chain which has > a large output_multiple requirement which needs 1 input for every > output. > In other words. Connect a block to the end of the processing chain which > has set_output_multiple(large_number) in its constructor. > > You can use gr_test for this. I decided to abandon and comment out all the cuda.multiply_const_ff function calls and concentrate on cuda.fir_filter_fff as suggested. Things I got questions/concerns 1. I increased output_multiple by doing "options.output_multiple = xxx" and this has no effect on the computing time of either CUDA or CPU. Did I do something wrong? 2. I increased the taps by doing "taps = range(1,256)" and also increasing number of blocks of fir_filter in the code and voila, I am now able to get CUDA to be faster than just CPU. However, if I implement something like "taps = range(1,512)" the CUDA part would be extremely slow (~20 seconds) while the CPU is still cool (~ 2 sec). Why? But this maybe related to what you were saying about max number of taps...although why is CPU able to still compute? 3. I had to increase the number of fir_filter blocks to 14 blocks before I can start seeing CUDA out-perform CPU. Experimentally its fine, I achieved my objective, but how is this "increased computation" justified in a normal GNURadio operation? I mean, when would a normal GNURadio operation require a chain of 14 fir_filters? I guess this is going beyond just "benchmarking" and asking where else can I take advantage of CUDA's computation power in GNURadio in a "normal" operation? 4. Looking at cuda_fir_fff_7_kernel, which I believe is the core of cuda_fir_filter, it seems you are using shared memory right? Just making sure we are not using global or local memory which would disastrously slow down the CUDA computations. > > > There are other ways which include changing the scheduler and the buffer > instantiating code. > I rather not pursue these ways because of the increased complexity. I rather increase computation, seems fastest/easiest to me, just not sure where. > > I hope this helps, > > Martin It has been most helpful! Thank you again! > > > > I am probably way off here, doesnt make any sense......someone please > > clarify! > > > > _______________________________________________ > > Discuss-gnuradio mailing list > > [email protected] > > http://lists.gnu.org/mailman/listinfo/discuss-gnuradio > >
_______________________________________________ Discuss-gnuradio mailing list [email protected] http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
