On May 02 14:29:26, h...@stare.cz wrote:
> On May 02 13:04:54, s...@spacehopper.org wrote:
> > On 2024/05/01 21:04, Jan Stary wrote:
> > > On May 01 11:00:12, s...@spacehopper.org wrote:
> > > > On 2024/05/01 11:21, Jan Stary wrote:
> > > > > Hi,
> > > > > 
> > > > > On Apr 26 20:46:51, b...@comstyle.com wrote:
> > > > > > Implement SSE2 lrint() and lrintf() on amd64.
> > > > > 
> > > > > I don't think this is worth the added complexity:
> > > > > seven more patches to have a different lrint()?
> > > > > Does it make the resampling noticably better/faster?
> > > > 
> > > > Playing with the benchmark mentioned in
> > > > https://github.com/libsndfile/libsamplerate/issues/187
> > > > suggests that it's going to be *hugely* faster with clang (and a bit
> > > > faster with gcc).
> > > 
> > > This talks about a MSVC build compared to a MinGW64 build on windows.
> > > Is that also relevant to an AMD64 build on OpenBSD? I just rebuilt
> > > with the diff - what would be a good way to test the actual performance
> > > before and after?
> > 
> > oh, actually it was the bench in the linked PR that I played with,
> > 
> > https://github.com/libsndfile/libsndfile/pull/663
> > -> https://quick-bench.com/q/OabKT-gEOZ8CYDriy1JEwq1lEsg
> > 
> > where there's a huge difference in clang builds.
> 
> Sorry, I don't understand at all how this concerns
> the OpenBSD port of libsamplerate: the Benchmark does not
> mention an OS or an architecture, so what is this being run on?
> 
> Anyway, just running it (Run Benchmark) gives the result
> of cpu_time of 722.537 for BM_d2les_array (using lrint)
> and cpu_time of 0 for BM_d2les_array_sse2 (using psf_lrint),
> reporting a speedup ratio of 200,000,000.
> 
> That's not an example of what I have in mind: a simple application
> of libsamplerate, sped up by the usage of the new SSE2 lrint,
> as in Brad's diff.
> 
> I am not sure my naive test is a test at all, as it operates on floats,
> so perhaps lrint() never even comes to play. That is libsamplerate's
> "simple API" - I will try to come up with something that actualy
> conevrts to ints while resampling. But maybe someone already has
> a good example of such a speedup.

OK, here is a test that's a modified version of what Stuart linked,
testing the performance of the lrint() itself (code below).

This is a current/amd64 PC, clang version 16.0.6
It is actualy _slower_ than the standard version.

With the standard lrint():

0m10.90s real     0m03.75s user     0m05.66s system
0m11.01s real     0m03.56s user     0m05.96s system
0m10.99s real     0m03.67s user     0m05.82s system

With the SSE2 lrint():

0m12.92s real     0m05.74s user     0m05.63s system
0m12.77s real     0m05.15s user     0m06.12s system
0m12.66s real     0m05.57s user     0m05.62s system

Can you please confirm it is also slower
on your clang machine with SSE2?

        Jan



#include <immintrin.h>
#include <math.h>

static inline int 
psf_lrint(double const x)
{
        return _mm_cvtsd_si32(_mm_load_sd(&x));
}

static void
d2l(const double *src, long *dst, size_t len)
{
        for (size_t i = 0; i < len; i++)
                dst[i] = lrint(src[i]);
}

static void
d2l_sse(const double *src, long *dst, size_t len)
{
        for (size_t i = 0; i < len; i++)
                dst[i] = psf_lrint(src[i]);
}

int
main()
{
        size_t i, len = 500000000;
        double *src = NULL;
        long *dst = NULL;

        src = calloc(len, sizeof(double));
        dst = calloc(len, sizeof(long));

        for (i = 0; i < len; i++) {
                /*src[i] = sin(i);*/
                src[i] = 0;
        }

        /*d2l(src, dst, len);*/
        d2l_sse(src, dst, len);

        return 0;
}

Reply via email to