Re: Benchmark memchar (with GCC builtins)

2015-11-03 Thread Andrei Alexandrescu via Digitalmars-d

On 11/02/2015 09:33 PM, Iakh wrote:

-
Naive:21.46 TickDuration(132842482)
SIMD: 1.161 TickDuration(7188211)
(was)SIMD: 3.04 TickDuration(18920182)
C:1 TickDuration(6189222)


Looks like the current memchr is well optimized. Not much blood left in 
that stone. -- Andrei




Re: Benchmark memchar (with GCC builtins)

2015-11-02 Thread Iakh via Digitalmars-d
On Friday, 30 October 2015 at 21:33:25 UTC, Andrei Alexandrescu 
wrote:
Could you please take a look at GCC's generated code and 
implementation of memchr? -- Andrei


So i did. I rewrite code to do main work in cacheLineSize chunks. 
And this

is what GLIBC version do.
So main loop looks this:

-
do
{
// ptr16 is aligned 64
ubyte16 r1 = __builtin_ia32_pcmpeqb128(ptr16[0], niddles);
ubyte16 r2 = __builtin_ia32_pcmpeqb128(ptr16[1], niddles);
ubyte16 r3 = __builtin_ia32_pcmpeqb128(ptr16[2], niddles);
ubyte16 r4 = __builtin_ia32_pcmpeqb128(ptr16[3], niddles);

r3 = __builtin_ia32_pmaxub128(r1, r3);
r4 = __builtin_ia32_pmaxub128(r2, r4);
r4 = __builtin_ia32_pmaxub128(r3, r4);
mask = __builtin_ia32_pmovmskb128(r4);

if (mask != 0)
{
mask = __builtin_ia32_pmovmskb128(r1);
mixin(CheckMask); // Check and return value

++ptr16; num -= 16;
mask = __builtin_ia32_pmovmskb128(r2);
mixin(CheckMask);

++ptr16; num -= 16;
r3 = __builtin_ia32_pcmpeqb128(*ptr16, niddles);
mask = __builtin_ia32_pmovmskb128(r3);
mixin(CheckMask);

++ptr16; num -= 16;
r4 = __builtin_ia32_pcmpeqb128(*ptr16, niddles);
mask = __builtin_ia32_pmovmskb128(r4);
mixin(CheckMask);
}

num -= 64;
ptr16 += 4;
}
while (num > 0);
-

and my best result:

-
Naive:21.46 TickDuration(132842482)
SIMD: 1.161 TickDuration(7188211)
(was)SIMD: 3.04 TickDuration(18920182)
C:1 TickDuration(6189222)



Re: Benchmark memchar (with GCC builtins)

2015-10-31 Thread rsw0x via Digitalmars-d

On Friday, 30 October 2015 at 21:29:47 UTC, Iakh wrote:

...


I got it to 1.5 the running time of C using SSE2 but couldn't get 
GDC to emit the correct aligned loads, if I used 
__builtin_assume_aligned the optimizer started being really off.


Re: Benchmark memchar (with GCC builtins)

2015-10-31 Thread Dmitry Olshansky via Digitalmars-d

On 31-Oct-2015 00:29, Iakh wrote:

I continue to play with SIMD. So I was trying to use std.simd
But it has lots of thing to be implemented. And I also gave up with
  core.simd.__simd due to problems with PMOVMSKB instruction (it is not
implemented).

Today I was playing with memchr for gdc:
memchr: http://www.cplusplus.com/reference/cstring/memchr/
My implementations with benchmark:
http://dpaste.dzfl.pl/4c46c0cf340c

Benchmark results:
-
Naive:21.9 TickDuration(136456491)
SIMD: 3.04 TickDuration(18920182)
SIMDM:2.44 TickDuration(15232176)
SIMDU: 1.8 TickDuration(11210454)
C:   1 TickDuration(6233963)



More or less matches my experience with C's memchr vs something else - 
it's fastest portable solution to find a byte in a chunk of memory.


--
Dmitry Olshansky


Re: Benchmark memchar (with GCC builtins)

2015-10-31 Thread Iakh via Digitalmars-d

On Saturday, 31 October 2015 at 08:37:23 UTC, rsw0x wrote:

I got it to 1.5 the running time of C using SSE2 but couldn't


Can you share your solution?


Re: Benchmark memchar (with GCC builtins)

2015-10-31 Thread Iakh via Digitalmars-d
On Friday, 30 October 2015 at 21:33:25 UTC, Andrei Alexandrescu 
wrote:
Could you please take a look at GCC's generated code and 
implementation of memchr? -- Andrei


Copy-and-paste from glibc's memchr(runGLibC) gaves the result 
below.

-
Naive: 21.4 TickDuration(132485705)
SIMD:  3.17 TickDuration(19629892)
SIMDM: 2.49 TickDuration(15420462)
C:1 TickDuration(6195504)
runGLibC:  4.32 TickDuration(26782585)
SIMDU:  1.8 TickDuration(11128618)

ASM shows memchr is realy called. There is neither compiler magic 
nor

local memchr imlementation.
Aligned versions of memchr use aligned load from memory and 
unaligned

one uses unaligned load. So at this point optimisation done well.


Benchmark memchar (with GCC builtins)

2015-10-30 Thread Iakh via Digitalmars-d

I continue to play with SIMD. So I was trying to use std.simd
But it has lots of thing to be implemented. And I also gave up 
with
 core.simd.__simd due to problems with PMOVMSKB instruction (it 
is not implemented).


Today I was playing with memchr for gdc:
memchr: http://www.cplusplus.com/reference/cstring/memchr/
My implementations with benchmark:
http://dpaste.dzfl.pl/4c46c0cf340c

Benchmark results:
-
Naive:21.9  TickDuration(136456491)
SIMD: 3.04  TickDuration(18920182)
SIMDM:2.44  TickDuration(15232176)
SIMDU: 1.8  TickDuration(11210454)
C:   1  TickDuration(6233963)

Mid colon is duration relative to C implementation 
(core.stdc.string).


memchrSIMD splits an input into three parts: unaligned begin, 
unaligned end, and aligned mid.


memchrSIMDM instead of pmovmskb use this code:
--
if (Mask mask = *cast(Mask*)(result.array.ptr))
{
return ptr + bsf(mask) / BitsInByte;
}
else if (Mask mask = *cast(Mask*)(result.array.ptr + 
Mask.sizeof))

{
return ptr + bsf(mask) / BitsInByte + 
cast(int)Mask.sizeof;

}
--

memchrSIMDU (unaligned) applay SIMD instructions from first array 
elements


SIMD part of function:
--
ubyte16 niddles;
niddles.ptr[0..16] = value;
ubyte16 result;
ubyte16 arr;

for (; ptr < alignedEnd; ptr += 16)
{
arr.ptr[0..16] = ptr[0..16];
result = __builtin_ia32_pcmpeqb128(arr, niddles);
int i = __builtin_ia32_pmovmskb128(result);
if (i != 0)
{
return ptr + bsf(i);
}
}
--


Re: Benchmark memchar (with GCC builtins)

2015-10-30 Thread Iakh via Digitalmars-d
On Friday, 30 October 2015 at 21:33:25 UTC, Andrei Alexandrescu 
wrote:
Could you please take a look at GCC's generated code and 
implementation of memchr? -- Andrei


glibc uses something like pseudo-SIMD with ordinal x86 
instructions (XOR magic, etc).

Deap comarison I left for next time :)


Re: Benchmark memchar (with GCC builtins)

2015-10-30 Thread Andrei Alexandrescu via Digitalmars-d

On 10/30/2015 05:29 PM, Iakh wrote:

I continue to play with SIMD. So I was trying to use std.simd
But it has lots of thing to be implemented. And I also gave up with
  core.simd.__simd due to problems with PMOVMSKB instruction (it is not
implemented).

Today I was playing with memchr for gdc:
memchr: http://www.cplusplus.com/reference/cstring/memchr/
My implementations with benchmark:
http://dpaste.dzfl.pl/4c46c0cf340c

Benchmark results:
-
Naive:21.9 TickDuration(136456491)
SIMD: 3.04 TickDuration(18920182)
SIMDM:2.44 TickDuration(15232176)
SIMDU: 1.8 TickDuration(11210454)
C:   1 TickDuration(6233963)

Mid colon is duration relative to C implementation (core.stdc.string).


Could you please take a look at GCC's generated code and implementation 
of memchr? -- Andrei




Re: Benchmark memchar (with GCC builtins)

2015-10-30 Thread Marco Leise via Digitalmars-d
Am Fri, 30 Oct 2015 22:31:54 +
schrieb Iakh :

> On Friday, 30 October 2015 at 21:33:25 UTC, Andrei Alexandrescu 
> wrote:
> > Could you please take a look at GCC's generated code and 
> > implementation of memchr? -- Andrei
> 
> glibc uses something like pseudo-SIMD with ordinal x86 
> instructions (XOR magic, etc).
> Deap comarison I left for next time :)

Instead of providing a library implementation, these functions
are often best left to compiler intrinsics which can provide
one of several instruction sequences based on the arguments
and the target CPU. In particular compilers often know runtime
length arguments from const-folding and can chose to use the
best matching SIMD instruction set if compiling for a specific
CPU. In regular D or C code you don't have access to this
information.

GlibC might use that particular implementation in its source,
but GCC will override these generic functions with builtin
intrinsics unless you disable them via command-line switch
(-fno-builtins). Same goes for e.g. ctlz (count leading
zeroes): It will be emulated somehow if compiled for older
CPUs and use fast native instructions on more recent CPUs.
Yes, sometimes the intrinsics are lacking, but in general I
trust the compiler devs more than myself, especially since
they likely tested it on several target architectures while I
mostly only do one.

Just ask yourself how you would select the optimal memchr
function matching the -march= (gdc) or -mcpu (ldc) flags at
compile-time.

-- 
Marco