Yeah, that makes sense. With all the auto-vectorization and SIMD support is recent versions of gcc, it seems a better approach is to tailor the C code to work well with SIMD-aware compilers.
.hc On 01/12/2013 04:45 PM, katja wrote: > It's interesting, but rather compiler-and-processor-specific. Such > code is maintanance-intensive. At the moment, ARM processors are > screaming loudest for optimization. Best thing for a community project > is probably plain C code which reckons with parallel processing, > because that won't go away for the next few decades. Functions like > copy_perform8(), times_perform8() etc. can profit from SIMD > instructions without a need for compiler intrinsics and asm code. > Well-structured data storage and access can make a 50 % or more > performance gain, in my experience. > > Another important thing: avoid float precision conversions. Throughout > Pd there are many untyped float defines and literal constants which > default to double, and I have introduced more when making libs > double-ready. Not good. I'll come back to this in another thread. > > Katja > > > On Sat, Jan 12, 2013 at 8:14 PM, Hans-Christoph Steiner <[email protected]> wrote: >> >> If you are interested, there is still the hand-coded SIMD stuff from >> pd-devel: >> https://pure-data.svn.sourceforge.net/svnroot/pure-data/branches/pd-devel/v0-39 >> >> .hc >> >> On 01/12/2013 09:34 AM, katja wrote: >>> Function copy_perform8() is also eligible for SIMD processing. I used >>> memcpy() because it is straightforward to use, while Pd's functions >>> pointed to the wrong locations for this case. On the reverb's total >>> load there is no significant performance difference. >>> >>> Katja >>> >>> >>> On Sat, Jan 12, 2013 at 1:00 AM, Hans-Christoph Steiner <[email protected]> >>> wrote: >>>> >>>> I recently learned that libc's memcpy actually uses things like SSE2 or >>>> SSSE2 >>>> so it can be quite fast on CPUs from the past 10 years, especially of the >>>> last >>>> 5 years. >>>> >>>> It would be worth profiling to see if that's noticeable. >>>> >>>> .hc >>>> >>>> On 01/11/2013 05:12 PM, katja wrote: >>>>> Ok so I did the ugly thing with the right channel input and output >>>>> pointers: >>>>> >>>>> memcpy(outR, inR, vectorsize * sizeof(t_float)); >>>>> inR = outR; >>>>> >>>>> Works like a charm, thanks again. >>>>> >>>>> Katja >>>>> >>>>> >>>>> >>>>> On Fri, Jan 11, 2013 at 10:05 PM, Miller Puckette <[email protected]> wrote: >>>>>> copy_perform assumes the data is 4-byte aligned so might save a test >>>>>> or two compared to memcopy() - but I really don't know. I never >>>>>> benchmarked the two against each other :) >>>>>> >>>>>> M >>>>>> >>>>>> On Fri, Jan 11, 2013 at 09:36:41PM +0100, katja wrote: >>>>>>> Hi Miller, >>>>>>> >>>>>>> Thanks for the solution. The routines are in place so copying the >>>>>>> right channel input to output should do it. Is there any reason to >>>>>>> prefer copy_perform() over memcpy()? I'm trying to make the most >>>>>>> efficient reverb for RPi & Co. >>>>>>> >>>>>>> Katja >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 11, 2013 at 7:57 PM, Miller Puckette <[email protected]> wrote: >>>>>>>> Hi Katja - >>>>>>>> >>>>>>>> There's one example of this in sigfft_dspx() - a complex FFT that >>>>>>>> 'natively' >>>>>>>> works on 2 signals in-place but has to deal with various cases in which >>>>>>>> buffers get re-used. It's ugly but the basic idea is first to get the >>>>>>>> inputs copied to the outputs (unless they're already there in the >>>>>>>> correct >>>>>>>> order in which case nothing needs to be done) and then run the in-place >>>>>>>> algorithm. >>>>>>>> >>>>>>>> If the algo only works out-of-place (i.e. you need 4 distinct buffers, >>>>>>>> 2 >>>>>>>> in and 2 out) the only way out is (at least conditionally) allocate >>>>>>>> temporary >>>>>>>> copies of the inputs before writing to any outputs. >>>>>>>> >>>>>>>> I may be able to add an optional way tilde objects can request that >>>>>>>> output >>>>>>>> buffers be distinct from input ones sometime in the future - but this >>>>>>>> is a >>>>>>>> couple of steps away for me right now :) >>>>>>>> >>>>>>>> M >>>>>>>> >>>>>>>> On Fri, Jan 11, 2013 at 03:32:09PM +0100, katja wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I'm working on a Pd class with stereo channels (reverb), and the >>>>>>>>> routine happens to be most efficient when iterating over the samples >>>>>>>>> per channel, instead of left and right together in the perform loop. >>>>>>>>> However, when doing two while loops in one object, one for left and >>>>>>>>> one for right, the right channel samples get overwritten because of >>>>>>>>> sample-wise in-place computation. Is this an inescapable truth? I >>>>>>>>> mean, I could write a left channel class and a right channel class >>>>>>>>> (actually did that to verify that it works), but it's inconvenient to >>>>>>>>> use. What could be an efficient way to get them in one object? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Katja >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> [email protected] mailing list >>>>>>>>> UNSUBSCRIBE and account-management -> >>>>>>>>> http://lists.puredata.info/listinfo/pd-list >>>>>>> >>>>>>> _______________________________________________ >>>>>>> [email protected] mailing list >>>>>>> UNSUBSCRIBE and account-management -> >>>>>>> http://lists.puredata.info/listinfo/pd-list >>>>> >>>>> _______________________________________________ >>>>> [email protected] mailing list >>>>> UNSUBSCRIBE and account-management -> >>>>> http://lists.puredata.info/listinfo/pd-list >>>>> >>>> >>>> _______________________________________________ >>>> [email protected] mailing list >>>> UNSUBSCRIBE and account-management -> >>>> http://lists.puredata.info/listinfo/pd-list _______________________________________________ [email protected] mailing list UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list
