Thank you very much for your quick answers! Marcus (Leech), I found the function you mentioned minutes after I sent the mail. Although it apparently works, Performance Monitor is behaving really weird when I use it. I have to look up that. Marcus (Müller), a very informative answer indeed. I will see if I can get that endless fame you mention :-). In any case, I'll post what I finally did and the performance gain achieved. Best Federico
2016-05-11 17:47 GMT-03:00 Marcus Müller <[email protected]>: > Hi Federico, > > > On 11.05.2016 21:09, Federico Larroca wrote: > > Hello everyone, > We are on the stage of optimizing our project (gr-isdbt). > > Awesome! > > One of the most consuming blocks is OFDM synchronization, and in > particular the equalization phase. This is simply the division between the > input signal and the estimated channel gains (two modestly big arrays of > ~5000 complexes for each OFDM symbol). > Until now, this was performed by a for loop, so my plan was to change it > for a volk function. However, there is no complex division in VOLK. So I've > done a rather indirect operation using the property that a/b = > a*conj(b)/|b|^2, resulting in six lines of code (a multiply conjugate, a > magnitude squared, a deinterleave, a couple of float divisions and an > interleave). Obviously the performance gain (measured with the Performance > Monitor) is marginal (to be optimistic)... > > I have to admit, I'd expect your "simple" for loop doing something like > > void yourclass::normalize(std::complex<float> *a, std::complex<float> *b) { > for(size_t idx; idx < a_len; ++idx) > a[idx] /= b[idx]; > } > > > to be neatly optimizable by the compiler, at least if it knows that a and > b aren't pointing at the same memory- > > Your approach, > [image: $\frac ab = a \cdot \frac{b^*}{|b|^2}= a \cdot \frac{b^*}{b\,b^*} > = a \cdot \frac 1b$] > is correct; however, in C++ with std::complex<> > > a/b > > pretty much does that already (ugly std lib C++ ahead, from > /usr/include/c++/<version>/complex): > > // XXX: This is a grammar school implementation. > template<typename _Tp> > template<typename _Up> > complex<_Tp>& > complex<_Tp>::operator/=(const complex<_Up>& __z) > { > const _Tp __r = _M_real * __z.real() + _M_imag * __z.imag(); > const _Tp __n = std::norm(__z); > _M_imag = (_M_imag * __z.real() - _M_real * __z.imag()) / __n; > _M_real = __r / __n; > return *this; > } > > And the problem is that while doing that for every a and b separately > might mean you can't make full use of SIMD instructions to eg. do four > complex divisions at once, it avoids having to load and store original / > intermediate values from/to RAM. Basically, your CPU might not be the > bottleneck – RAM could be, and doing everything you need for a single > division at once, even if done without any optimization, might be faster > than incurring additional memory transfers. That's because your memory > controller pre-fetches whole cache lines worth of values when getting the > first elements of a and b, and working on values from cache is > significantly (read: factor > 50) than a single memory transfer. > > So, my immediate recommendation really is to keep your loop as minimal as > possible, giving your compiler a solid chance to see the potential for > optimization. There might not be much you can do. Even hand-written VOLK > kernels aren't always faster than automatically generated optimized machine > code. > > Does anyone has a better idea? Implementing a new kernel is simply out of > my knowledge scope. > > Ha! But it would mean endless (additional) fame! > Soooo: look at the volk_32fc_x2_multiply_conjugate_32fc.h kernel source. > Specifically, at the SSE3 implementation, > volk_32fc_x2_multiply_conjugate_32fc_u_sse3(…). > You'll notice line 134: > > z = _mm_complexconjugatemul_ps(x, y); > > > As you can see, there's a a "VOLK intrinsic", > > _mm_complexconjugatemul_ps > > which is defined in volk_intrinsics.h. That same file contains > _mm_magnitudesquared_ps_sse3 . Maybe you can make something clever out of > that :) > > Best regards, > Marcus > > > [1] https://gcc.gnu.org/onlinedocs/gcc/Restricted-Pointers.html > > _______________________________________________ > Discuss-gnuradio mailing list > [email protected] > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio > >
_______________________________________________ Discuss-gnuradio mailing list [email protected] https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
