Hi Jeff,

you'll want to compile with optimization, otherwise you'd be intentionally making the native `sqrt` slower than it would be in a real application; you need to add `-O2` or `-O3` to your compilation. Also, you're using floats, not doubles, so use `sqrtf` in your C code, not `sqrt`! (your code is C, not necessarily how you'd write the same program in C++).

Also, compared to the time for the math you're doing, both in the volk and in the libm sqrt case, your time measurement's uncertainty is large. (taking the square root of only 16k values – that's nearly nothing.) You need to run that in a loop of many iterations, preferably with some warm-up to get the branch predictors trained. (assuming the CPU *has* branch prediction – the ARM1176JZ-S doesn't, as far as I know).

Hey, luckily your VOLK already ships with such a loop-running benchmark mockup: `volk_profile -R sqrt` will do exactly that. The `generic` implementation literally just calls `sqrtf`. Could you share the output of `volk_profile -R sqrt` with us?

Furthermore, I'm **highly** confused by your results: ARM1176JZ-S is a 32 bit processor, developed somewhere in the early 2000s; so, it's –by modern standards– a painfully slow 32 bit armv6 CPU. It predates both aarch64 and NEON! So, I'm pretty sure cpu_features must be wrong, or this is not the CPU you're using. In this rare case, I think you must be wrong and not the software, because you're also using a /usr/local/lib64 library path, which would quite unambigously point to a 64 bit OS, which couldn't run on an ARM11.

Could you double-check and *confirm* you're using an ARM1176JZ-S processor? If you are, are you perhaps running this with qemu-aarch64 on your armv6 (32 bit!) machine? Can you send us the `volk_sqrt` you're getting, or at least share what `file volk_sqrt` says about that binary? We then would need to help you file a bug upstream against cpu_features, because it'd be impossible for us to build a working VOLK if cpu_features goes and miscategorizes an ancient 32 bit machine as aarch64.

Best regards,
Marcus

On 08.10.23 00:22, Jeff R wrote:

I modified a simple Volk sqrt program for an ARM1176JZ-S processor to test performance, and the results are puzzling. The following program prints:

dur_VolkSqrt=(0.000000)0.001721 dur_CRTLSqrt=(0.000000)0.000318

The following processor information is displayed. It appears as though NEON is 
supported.


~/volk-3.0.0/build# cpu_features/list_cpu_features

arch : aarch64

implementer :  65 (0x41)

variant :   0 (0x00)

part : 3336 (0xD08)

revision :   3 (0x03)

flags : asimd,cpuid,crc32,fp

Why are the numbers so slow for Volk versus the CRTL? I may be missing something obvious. Thank you in advance.

Here’s the test program:

// g++ -I /usr/local/include/volk volk_sqrt.cpp -o volk_sqrt -L 
/usr/local/lib64/ -lvolk

// export LD_LIBRARY_PATH=/usr/local/lib64; ./volk_sqrt


#include <stdio.h>

#include <math.h>

#include <volk.h>

#include <limits.h>

#include <time.h>

#include <sys/time.h>


double get_wall_time()

{

    struct timeval time;


    if (gettimeofday(&time,NULL))

    {

        //  Handle error

        return 0;

    }

    return (double)time.tv_sec + (double)time.tv_usec * .000001;

}


int main(int argc, char* args[])

{

    double walStop;

    double walStart;

    double dur_VolkSqrt;

    double dur_CRTLSqrt;

    int N = 1024*16;


    unsigned int alignment = volk_get_alignment();

    float* in = (float*)volk_malloc(sizeof(float)*N, alignment);

    float* out = (float*)volk_malloc(sizeof(float)*N, alignment);


    for(unsigned int ii = 0; ii < N; ++ii)

    {

        in[ii] = (float)(ii*ii);

    }


    walStart = get_wall_time();

    volk_32f_sqrt_32f_a(out, in, N);

    //volk_32f_sqrt_32f(out, in, N);

    walStop = get_wall_time();

    dur_VolkSqrt = walStop - walStart;


    walStart = get_wall_time();

    for(unsigned int ii = 0; ii < N; ++ii)

    {

        out[ii] = sqrt(in[ii]);

    }

    walStop = get_wall_time();

    dur_CRTLSqrt = walStop - walStart;


    printf("dur_VolkSqrt=(%f)%f dur_CRTLSqrt=(%f)%f\n", dur_VolkSqrt/N, dur_VolkSqrt, dur_CRTLSqrt/N, dur_CRTLSqrt);

    volk_free(in);

    volk_free(out);

    return 0;

}

Reply via email to