I dont recommend using the extra blocks, that would probably cause more
overhead. Looking at gr_quadrature_demod_cf::work, it looks like you can
vectorize the operation of the conjugate multiply, then the atan, then
the gain scaler. So, that would be one for loop that operates on 4
samples at a time, and calls 3 volk functions.


Hi, Josh. I implemented a quadrature_demod_cf block (please find it in the attachment). Since the Volk atan2 function is currently only for SSE as Nick said, and there is no conjugate-multiply function for FC32 inputs, I use Gnuradio's built-in conjugate and fast_atan_2f functions, plus two volk multiply functions. The for loop is timed by high_res_timer. Besides, the work function of gr_quadrature_demod_cf is timed for comparison purpose (also attached). Each of these two blocks is connected to a file_source which provides modulated data.

I tested two blocks individually, firstly on a PC with Intel processor, then on E100. On PC, it always take volk-based block less time to demodulate a same-size-buffer of data (i.e. for 4096 input items, it takes the original quadrature_demod_cf block 0.185 ms but takes volk-based block only 0.163 ms to demodulate).

However, the results are different on E100: sometimes the original block runs faster, sometimes the volk-based block does. I ran the tests for several times, although the recorded time changes by some tens (occasionally a few handreds) of nanoseconds, but neither block is always faster than the other.

Now I'm confused by the results, since I expected the volk-ified demodulator to be faster. Could you give me some help on this issue? Thanks.


Best Regards,

Terry
  int work(int noutput_items,
	   gr_vector_const_void_star &input_items,
	   gr_vector_void_star &output_items)
  {
    const std::complex<float> *in = reinterpret_cast<const std::complex<float> *>(input_items[0]);
    float *out = reinterpret_cast<float *>(output_items[0]);
    std::complex<float> conjin[4], prod[4];
    float angle[4];
    gruel::high_res_timer_type t1, t2, tit;
    
    t1 = gruel::high_res_timer_now();

    for(int i = 0; i < noutput_items; i+=4){
      conjin[0] = conj(in[i]);
      conjin[1] = conj(in[i+1]);
      conjin[2] = conj(in[i+2]);
      conjin[3] = conj(in[i+3]);

      const std::complex<float> *p1 = reinterpret_cast<const std::complex<float> *>(&in[i+1]);
      const std::complex<float> *p2 = reinterpret_cast<const std::complex<float> *>(conjin);
      volk_32fc_x2_multiply_32fc_a(prod, p1, p2, 4);

      for(int j = 0; j < 4; j++)
	angle[j] = gr_fast_atan2f(imag(prod[j]), real(prod[j]));

      const float *p3 = reinterpret_cast<const float *>(angle);
      volk_32f_s32f_multiply_32f_a(&out[i], p3, d_gain, 4);
    }

    t2 = gruel::high_res_timer_now();

    tit = t2 - t1;

    printf(">> time to demod one [%d] buffer: %lld ticks\n", noutput_items, tit);

    return noutput_items;
  }
  int work(int noutput_items,
	   gr_vector_const_void_star &input_items,
	   gr_vector_void_star &output_items)
  {
    const std::complex<float> *in = reinterpret_cast<const std::complex<float> *>(input_items[0]);
    float *out = reinterpret_cast<float *>(output_items[0]);
    gruel::high_res_timer_type t1, t2, tit;
    
    t1 = gruel::high_res_timer_now();

    for(int i = 0; i < noutput_items; i++){
      std::complex<float> product = in[i+1] * conj(in[i]);
      out[i] = d_gain * gr_fast_atan2f(imag(product), real(product));
    }

    t2 = gruel::high_res_timer_now();

    tit = t2 - t1;

    printf(">> time to demod one [%d] buffer: %lld ticks\n", noutput_items, tit);

    return noutput_items;
  }
_______________________________________________
Discuss-gnuradio mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Reply via email to