Re: Playing SIMD

2015-10-28 Thread Marco Leise via Digitalmars-d
Am Mon, 26 Oct 2015 14:04:18 +0100
schrieb Iain Buclaw via Digitalmars-d
:

> > Yeah but PMOVMSKB not implemented in core.simd.
> >
> 
> Don't use core.simd, push for getting std.simd in, then leverage the
> generics exposed through that module.
 
Yeah, but PMOVMSKB is not implemented in std.simd.

-- 
Marco



Re: Playing SIMD

2015-10-28 Thread Andrei Alexandrescu via Digitalmars-d

On 10/28/2015 10:27 AM, Marco Leise wrote:

Am Mon, 26 Oct 2015 14:04:18 +0100
schrieb Iain Buclaw via Digitalmars-d
:


Yeah but PMOVMSKB not implemented in core.simd.



Don't use core.simd, push for getting std.simd in, then leverage the
generics exposed through that module.


Yeah, but PMOVMSKB is not implemented in std.simd.


Guess an enh report would help remembering the matter. -- Andrei



Re: Playing SIMD

2015-10-26 Thread John Colvin via Digitalmars-d

On Monday, 26 October 2015 at 11:10:31 UTC, Iakh wrote:

On Monday, 26 October 2015 at 09:49:00 UTC, Iakh wrote:

On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:
runBinary calls naiveIndexOf. You're not testing 
binaryIndexOf.


You are right.
This is fixed example:
http://dpaste.dzfl.pl/f7a54b789a21

and results at dpaste.dzfl.pl:
-
SIMD:   TickDuration(151000)
Binary: TickDuration(255000)
Naive:  TickDuration(459000)

So SIMD version ~1.68 faster than binary


At home with defult dub config "dub run --build=release":
-
SIMD:TickDuration(350644)
Binary:  TickDuration(434014)
Naive:   TickDuration(657548)


~1.24 times faster than binary and
~1.87 times faster than naive


It's good to see work like this. Have you tried with gdc/ldc 
(-march=native / -mcpu=native respectively if you want to use the 
full available instruction set)?


Re: Playing SIMD

2015-10-26 Thread Andrei Alexandrescu via Digitalmars-d

On 10/26/2015 05:48 AM, Iakh wrote:

On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:

runBinary calls naiveIndexOf. You're not testing binaryIndexOf.


You are right.
This is fixed example:
http://dpaste.dzfl.pl/f7a54b789a21

and results at dpaste.dzfl.pl:
-
SIMD:   TickDuration(151000)
Binary: TickDuration(255000)
Naive:  TickDuration(459000)

So SIMD version ~1.68 faster than binary


That's a healthy margin. It may get eroded by startup/finish codes that 
need to get to the first aligned chunk and handle the misaligned data at 
the end, etc. But it's a solid proof of concept. -- Andrei


Re: Playing SIMD

2015-10-26 Thread Iakh via Digitalmars-d

On Monday, 26 October 2015 at 09:49:00 UTC, Iakh wrote:

On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:

runBinary calls naiveIndexOf. You're not testing binaryIndexOf.


You are right.
This is fixed example:
http://dpaste.dzfl.pl/f7a54b789a21

and results at dpaste.dzfl.pl:
-
SIMD:   TickDuration(151000)
Binary: TickDuration(255000)
Naive:  TickDuration(459000)

So SIMD version ~1.68 faster than binary


At home with defult dub config "dub run --build=release":
-
SIMD:TickDuration(350644)
Binary:  TickDuration(434014)
Naive:   TickDuration(657548)


~1.24 times faster than binary and
~1.87 times faster than naive


Re: Playing SIMD

2015-10-26 Thread Iakh via Digitalmars-d

On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:

runBinary calls naiveIndexOf. You're not testing binaryIndexOf.


You are right.
This is fixed example:
http://dpaste.dzfl.pl/f7a54b789a21

and results at dpaste.dzfl.pl:
-
SIMD:   TickDuration(151000)
Binary: TickDuration(255000)
Naive:  TickDuration(459000)

So SIMD version ~1.68 faster than binary


Re: Playing SIMD

2015-10-26 Thread Iakh via Digitalmars-d

On Monday, 26 October 2015 at 12:35:39 UTC, Don wrote:


You need to be very careful with doing benchmarks on tiny test 
cases, they can be very misleading.


Can you advice some reading about benchmark?


Re: Playing SIMD

2015-10-26 Thread Kagamin via Digitalmars-d

On Monday, 26 October 2015 at 12:10:41 UTC, rumbu wrote:

Interpolation search is even faster :)

http://dpaste.dzfl.pl/4443c5753454


For a funny name it's called Newton search: you can use 
interpolation of higher order, can be more efficient if the data 
has a trend.


Re: Playing SIMD

2015-10-26 Thread Iakh via Digitalmars-d
On Monday, 26 October 2015 at 11:47:56 UTC, Andrei Alexandrescu 
wrote:

On 10/26/2015 05:48 AM, Iakh wrote:

On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:
runBinary calls naiveIndexOf. You're not testing 
binaryIndexOf.


You are right.
This is fixed example:
http://dpaste.dzfl.pl/f7a54b789a21

and results at dpaste.dzfl.pl:
-
SIMD:   TickDuration(151000)
Binary: TickDuration(255000)
Naive:  TickDuration(459000)

So SIMD version ~1.68 faster than binary


That's a healthy margin. It may get eroded by startup/finish 
codes that need to get to the first aligned chunk and handle 
the misaligned data at the end, etc. But it's a solid proof of 
concept. -- Andrei


(Binary)Searching in large sorted continuous array e. g. 1 MB of 
bytes

1 MB = 2 ^^ 20 bytes
Imagine 4 binary levels pass in 4 ticks. So 20 levels in 20 
ticks. With SIMD last 4 levels would be done in 2 or 3 ticks. For 
20 levels 18 - 19 ticks. So overall improvement is 5-10%. 
Furthermore think of cache and memory pages misses on firs levels.


IMO SIMD needed for unsorted data(strings?) or for trees.


Re: Playing SIMD

2015-10-26 Thread Iain Buclaw via Digitalmars-d
On 26 Oct 2015 1:40 pm, "Don via Digitalmars-d" 
wrote:
>
> On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:
>>
>> Here is my implementatation of SIMD find. Function returns index of
ubyte in static 16 byte array with unique values.
>
>
> [snip]
>
> You need to be very careful with doing benchmarks on tiny test cases,
they can be very misleading.
>
> Be aware that the speed of bsf() and bsr() is very very strongly
processor dependent. On some machines, it is utterly pathetic.

Some compilers are smart about uses of bsf and bsr.  :-)


Re: Playing SIMD

2015-10-26 Thread rumbu via Digitalmars-d
On Monday, 26 October 2015 at 11:47:56 UTC, Andrei Alexandrescu 
wrote:

On 10/26/2015 05:48 AM, Iakh wrote:

On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:
runBinary calls naiveIndexOf. You're not testing 
binaryIndexOf.


You are right.
This is fixed example:
http://dpaste.dzfl.pl/f7a54b789a21

and results at dpaste.dzfl.pl:
-
SIMD:   TickDuration(151000)
Binary: TickDuration(255000)
Naive:  TickDuration(459000)

So SIMD version ~1.68 faster than binary


That's a healthy margin. It may get eroded by startup/finish 
codes that need to get to the first aligned chunk and handle 
the misaligned data at the end, etc. But it's a solid proof of 
concept. -- Andrei


Interpolation search is even faster :)

http://dpaste.dzfl.pl/4443c5753454

-
SIMD:  TickDuration(144000)
Binary:TickDuration(229000)
Naive: TickDuration(472000)
Interpolation: TickDuration(92000)
-
SIMD:  TickDuration(145000)
Binary:TickDuration(247000)
Naive: TickDuration(463000)
Interpolation: TickDuration(91000)





Re: Playing SIMD

2015-10-26 Thread Iain Buclaw via Digitalmars-d
On 25 Oct 2015 11:50 pm, "Iakh via Digitalmars-d" <
digitalmars-d@puremagic.com> wrote:
>
> On Sunday, 25 October 2015 at 22:17:58 UTC, Matthias Bentrup wrote:
>>
>> On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:
>>>
>>> Is it optimal and how do you implement this stuf?
>>>
>>
>> I think it's better to use PMOVMSKB to avoid storing the PCMPEQB result
in memory and you need only one BSF instruction.
>
>
> Yeah but PMOVMSKB not implemented in core.simd.
>

Don't use core.simd, push for getting std.simd in, then leverage the
generics exposed through that module.


Re: Playing SIMD

2015-10-26 Thread Don via Digitalmars-d

On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:
Here is my implementatation of SIMD find. Function returns 
index of ubyte in static 16 byte array with unique values.


[snip]

You need to be very careful with doing benchmarks on tiny test 
cases, they can be very misleading.


Be aware that the speed of bsf() and bsr() is very very strongly 
processor dependent. On some machines, it is utterly pathetic. eg 
AMD K7, BSR is 23 micro-operations, on original pentium is was up 
to 73 (!), even on AMD Bobcat it is 11 micro-ops, but on recent 
Intel it is one micro-op. This fact of 73 can totally screw up 
your performance comparisons.


Just because it is a single machine instruction does not mean it 
is fast.




Re: Playing SIMD

2015-10-26 Thread Andrei Alexandrescu via Digitalmars-d

On 10/26/2015 08:35 AM, Don wrote:

On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:

Here is my implementatation of SIMD find. Function returns index of
ubyte in static 16 byte array with unique values.


[snip]

You need to be very careful with doing benchmarks on tiny test cases,
they can be very misleading.

Be aware that the speed of bsf() and bsr() is very very strongly
processor dependent. On some machines, it is utterly pathetic. eg AMD
K7, BSR is 23 micro-operations, on original pentium is was up to 73 (!),
even on AMD Bobcat it is 11 micro-ops, but on recent Intel it is one
micro-op. This fact of 73 can totally screw up your performance
comparisons.

Just because it is a single machine instruction does not mean it is fast.


One other note: don't compare with binary search, it's not an 
appropriate baseline. You should use it only if you implemented 
SIMD-based binary search.


Good baselines: std.find, memchr, a naive version with pointers (no 
bounds checking).



Andrei




Re: Playing SIMD

2015-10-26 Thread Iakh via Digitalmars-d

On Monday, 26 October 2015 at 15:03:12 UTC, Iakh wrote:

(Binary)Searching in large sorted continuous array e. g. 1 MB 
of bytes

1 MB = 2 ^^ 20 bytes
Imagine 4 binary levels pass in 4 ticks. So 20 levels in 20 
ticks. With SIMD last 4 levels would be done in 2 or 3 ticks. 
For 20 levels 18 - 19 ticks. So overall improvement is 5-10%. 
Furthermore think of cache and memory pages misses on firs 
levels.


IMO SIMD needed for unsorted data(strings?) or for trees.


But yeah... Who needs 1000_000 0..256 values(sorted :D )?


Re: Playing SIMD

2015-10-25 Thread Matthias Bentrup via Digitalmars-d

On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:

Is it optimal and how do you implement this stuf?



I think it's better to use PMOVMSKB to avoid storing the PCMPEQB 
result in memory and you need only one BSF instruction.


Re: Playing SIMD

2015-10-25 Thread anonymous via Digitalmars-d

On 25.10.2015 20:37, Iakh wrote:

Full exaple with comparation of algorithms (SIMD, naive, binary search):
http://dpaste.dzfl.pl/f3b8989841e3


From there:

void runBinary()
{
static int i = 0;
naiveIndexOf(cast(ubyte)(i++/ArraySize + 1), arr);
}


runBinary calls naiveIndexOf. You're not testing binaryIndexOf.


Re: Playing SIMD

2015-10-25 Thread Iakh via Digitalmars-d
On Sunday, 25 October 2015 at 22:17:58 UTC, Matthias Bentrup 
wrote:

On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:

Is it optimal and how do you implement this stuf?



I think it's better to use PMOVMSKB to avoid storing the 
PCMPEQB result in memory and you need only one BSF instruction.


Yeah but PMOVMSKB not implemented in core.simd.

Bit more comprehensive here:
http://forum.dlang.org/post/20150923115833.054fdb09@marco-toshiba


Re: Playing SIMD

2015-10-25 Thread Andrei Alexandrescu via Digitalmars-d

On 10/25/15 6:57 PM, Iakh wrote:

On Sunday, 25 October 2015 at 21:13:56 UTC, Andrei Alexandrescu wrote:

[...]
This is compelling but needs a bit of work to integrate. Care to work
on a PR that makes std.algorithm use it? Thanks! -- Andrei


First of all I need sort of investigation about PRs and std.algorithm.
But in general challenge accepted.


Thanks! -- Andrei


Re: Playing SIMD

2015-10-25 Thread Iakh via Digitalmars-d
On Sunday, 25 October 2015 at 21:13:56 UTC, Andrei Alexandrescu 
wrote:

[...]
This is compelling but needs a bit of work to integrate. Care 
to work on a PR that makes std.algorithm use it? Thanks! -- 
Andrei


First of all I need sort of investigation about PRs and 
std.algorithm. But in general challenge accepted.


Re: Playing SIMD

2015-10-25 Thread Andrei Alexandrescu via Digitalmars-d

On 10/25/2015 03:37 PM, Iakh wrote:

Here is my implementatation of SIMD find. Function returns index of
ubyte in static 16 byte array with unique values.

--
immutable size_t ArraySize = 16;
int simdIndexOf(ubyte niddle, ref const ubyte[ArraySize] haystack)
{
 ubyte16 arr;
 arr.array = haystack[];
 ubyte16 niddles;
 niddles.array[] = niddle;
 ubyte16 result;
 result = __simd_sto(XMM.PCMPEQB, arr, niddles);
 alias Mask = ulong;
 static assert(2 * Mask.sizeof == result.sizeof);
 immutable BitsInByte = 8;

 if (Mask mask = *cast(Mask*)(result.array.ptr))
 {
 return bsf(mask) / BitsInByte;
 }
 else if (Mask mask = *cast(Mask*)(result.array.ptr + Mask.sizeof))
 {
 return bsf(mask) / BitsInByte + cast(int)Mask.sizeof;
 }
 else
 {
 return -1;
 }

}
--

Is it optimal and how do you implement this stuf?

Full exaple with comparation of algorithms (SIMD, naive, binary search):
http://dpaste.dzfl.pl/f3b8989841e3

Benchmark result on dpaste.dzfl.pl:
SIMD:   TickDuration(157000)
Binary: TickDuration(472000)
Naive:  TickDuration(437000)

At home with defult dub config "dub run --build=release":
SIMD:TickDuration(241566)
Binary:  TickDuration(450515)
Naive:   TickDuration(450371)


This is compelling but needs a bit of work to integrate. Care to work on 
a PR that makes std.algorithm use it? Thanks! -- Andrei




Playing SIMD

2015-10-25 Thread Iakh via Digitalmars-d
Here is my implementatation of SIMD find. Function returns index 
of ubyte in static 16 byte array with unique values.


--
immutable size_t ArraySize = 16;
int simdIndexOf(ubyte niddle, ref const ubyte[ArraySize] haystack)
{
ubyte16 arr;
arr.array = haystack[];
ubyte16 niddles;
niddles.array[] = niddle;
ubyte16 result;
result = __simd_sto(XMM.PCMPEQB, arr, niddles);
alias Mask = ulong;
static assert(2 * Mask.sizeof == result.sizeof);
immutable BitsInByte = 8;

if (Mask mask = *cast(Mask*)(result.array.ptr))
{
return bsf(mask) / BitsInByte;
}
else if (Mask mask = *cast(Mask*)(result.array.ptr + 
Mask.sizeof))

{
return bsf(mask) / BitsInByte + cast(int)Mask.sizeof;
}
else
{
return -1;
}

}
--

Is it optimal and how do you implement this stuf?

Full exaple with comparation of algorithms (SIMD, naive, binary 
search):

http://dpaste.dzfl.pl/f3b8989841e3

Benchmark result on dpaste.dzfl.pl:
SIMD:   TickDuration(157000)
Binary: TickDuration(472000)
Naive:  TickDuration(437000)

At home with defult dub config "dub run --build=release":
SIMD:TickDuration(241566)
Binary:  TickDuration(450515)
Naive:   TickDuration(450371)