Re: Any SIMD experts?

2014-12-10 Thread Martin Nowak via Digitalmars-d

On 12/09/2014 05:22 PM, John Colvin wrote:


which of course Kenji already has a pull for, less than 3 hours later :)


It's right on time, it's right on time


Re: Any SIMD experts?

2014-12-10 Thread John Colvin via Digitalmars-d
On Wednesday, 10 December 2014 at 15:53:40 UTC, Martin Nowak 
wrote:

On 12/09/2014 05:22 PM, John Colvin wrote:


which of course Kenji already has a pull for, less than 3 
hours later :)


It's right on time, it's right on time


Still doesn't work though: 
https://issues.dlang.org/show_bug.cgi?id=13852


you can't even use core.imd.__simd because it doesn't have 
PCMPGTQ https://issues.dlang.org/show_bug.cgi?id=8047


Re: Any SIMD experts?

2014-12-09 Thread via Digitalmars-d

On Monday, 8 December 2014 at 16:44:37 UTC, Martin Nowak wrote:
actually slowed down some GC benchmarks by 3%. The branch 
predictor had more trouble with the single branch because that 
resulted in a fifty-fifty chance. There is some correlation


Consider rewriting the code to use non-branching comparison, i.e. 
use the condition test result directly in an expression or cmov.


Re: Any SIMD experts?

2014-12-09 Thread John Colvin via Digitalmars-d

On Monday, 8 December 2014 at 22:10:09 UTC, Martin Nowak wrote:

On 12/08/2014 06:05 PM, John Colvin wrote:

Well gcc gives me:


Tried that with dmd, it gave me.

bug.d(5): Error: incompatible types for ((a) = (l)): 
'__vector(ulong[4])' and '__vector(ulong[4])'
bug.d(5): Error: incompatible types for ((a)  (h)): 
'__vector(ulong[4])' and '__vector(ulong[4])'


Yours looks better.


It's unfortunate it can't work it out automatically like gcc can.

Anyway, you can write it out manually:

auto foo(ulong2 a, ulong2 l, ulong h)
{
static long2 shift = long.min; //-2^63
long2 aS = cast(long2)(a - shift);
long2 lS = cast(long2)(l - shift);
long2 hS = cast(long2)(h - shift);
return (hS  aS)  (lS  aS);
}

but that doesn't actually compile, the compiler just sits in 
semantic3 forever (see 
https://issues.dlang.org/show_bug.cgi?id=13841)


Re: Any SIMD experts?

2014-12-09 Thread John Colvin via Digitalmars-d

On Tuesday, 9 December 2014 at 12:36:29 UTC, John Colvin wrote:

On Monday, 8 December 2014 at 22:10:09 UTC, Martin Nowak wrote:

On 12/08/2014 06:05 PM, John Colvin wrote:

Well gcc gives me:


Tried that with dmd, it gave me.

bug.d(5): Error: incompatible types for ((a) = (l)): 
'__vector(ulong[4])' and '__vector(ulong[4])'
bug.d(5): Error: incompatible types for ((a)  (h)): 
'__vector(ulong[4])' and '__vector(ulong[4])'


Yours looks better.


It's unfortunate it can't work it out automatically like gcc 
can.


Anyway, you can write it out manually:

auto foo(ulong2 a, ulong2 l, ulong h)
{
static long2 shift = long.min; //-2^63
long2 aS = cast(long2)(a - shift);
long2 lS = cast(long2)(l - shift);
long2 hS = cast(long2)(h - shift);
return (hS  aS)  (lS  aS);
}

but that doesn't actually compile, the compiler just sits in 
semantic3 forever (see 
https://issues.dlang.org/show_bug.cgi?id=13841)


which of course Kenji already has a pull for, less than 3 hours 
later :)


Any SIMD experts?

2014-12-08 Thread Martin Nowak via Digitalmars-d

I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time.

ulong2 vval = [v0, v1];
ulong2 vlow = [low, low];
ulong2 vhigh = [high, high];

int res = PMOVMSKB(vval = vlow  vval  vhigh);

I figured out sort of a solution, but it seems way too complicated, 
because there is only signed comparison.


Usually (scalar) I'd use this, which makes use of unsigned wrap to safe 
one conditional


immutable size = cast(ulong)(vhigh - vlow);
if (cast(ulong)(v0 - vlow)  size) {}
if (cast(ulong)(v1 - vlow)  size) {}

over

if (v0 = vlow  v0  vhigh) {}

Maybe this can be used on SIMD too (saturated sub or so)?

-Martin


Re: Any SIMD experts?

2014-12-08 Thread Martin Nowak via Digitalmars-d

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
Usually (scalar) I'd use this, which makes use of unsigned wrap 
to safe one conditional


immutable size = cast(ulong)(vhigh - vlow);
if (cast(ulong)(v0 - vlow)  size) {}
if (cast(ulong)(v1 - vlow)  size) {}

over

if (v0 = vlow  v0  vhigh) {}

Maybe this can be used on SIMD too (saturated sub or so)?

-Martin


This is for the most performance critical instructions during GC 
marking.
If we can come up with some good SIMD this will result in a good 
speedup.


Yesterday I was surprised to learn that my unsigned wrap trick 
actually slowed down some GC benchmarks by 3%. The branch 
predictor had more trouble with the single branch because that 
resulted in a fifty-fifty chance. There is some correlation 
between the 2 branch bounds checks and one of them could be 
predicted fairly well, resulting in a better combined result.

Always profile!


Re: Any SIMD experts?

2014-12-08 Thread John Colvin via Digitalmars-d

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
a time.


ulong2 vval = [v0, v1];
ulong2 vlow = [low, low];
ulong2 vhigh = [high, high];

int res = PMOVMSKB(vval = vlow  vval  vhigh);

I figured out sort of a solution, but it seems way too 
complicated, because there is only signed comparison.


Usually (scalar) I'd use this, which makes use of unsigned wrap 
to safe one conditional


immutable size = cast(ulong)(vhigh - vlow);
if (cast(ulong)(v0 - vlow)  size) {}
if (cast(ulong)(v1 - vlow)  size) {}

over

if (v0 = vlow  v0  vhigh) {}

Maybe this can be used on SIMD too (saturated sub or so)?

-Martin


I don't quite understand what you mean by save one conditional.


Re: Any SIMD experts?

2014-12-08 Thread ponce via Digitalmars-d
Using the fact that a negative 64-bit integer is also a negative 
32-bit integer when truncated, you could use 2x SSE2 32-bit 
comparison.
- one for [v0 , v1] - [vlow, vlow] compared to [0, 0] (result 
should be = 0, signed comparison)
- one for [v0 , v1] - [vhigh, vhigh] compared to [0, 0] (result 
should be  0, signed comparison)


Then apply logical and on the masks.
This doesn't work if vhigh - vlow spans a too large area.

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
a time.


ulong2 vval = [v0, v1];
ulong2 vlow = [low, low];
ulong2 vhigh = [high, high];

int res = PMOVMSKB(vval = vlow  vval  vhigh);

I figured out sort of a solution, but it seems way too 
complicated, because there is only signed comparison.


Usually (scalar) I'd use this, which makes use of unsigned wrap 
to safe one conditional


immutable size = cast(ulong)(vhigh - vlow);
if (cast(ulong)(v0 - vlow)  size) {}
if (cast(ulong)(v1 - vlow)  size) {}

over

if (v0 = vlow  v0  vhigh) {}

Maybe this can be used on SIMD too (saturated sub or so)?

-Martin




Re: Any SIMD experts?

2014-12-08 Thread ponce via Digitalmars-d

On Monday, 8 December 2014 at 16:57:20 UTC, ponce wrote:
Using the fact that a negative 64-bit integer is also a 
negative 32-bit integer when truncated,


Oops. This assumption is wrong.


Re: Any SIMD experts?

2014-12-08 Thread John Colvin via Digitalmars-d

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
a time.


ulong2 vval = [v0, v1];
ulong2 vlow = [low, low];
ulong2 vhigh = [high, high];

int res = PMOVMSKB(vval = vlow  vval  vhigh);

I figured out sort of a solution, but it seems way too 
complicated, because there is only signed comparison.


Usually (scalar) I'd use this, which makes use of unsigned wrap 
to safe one conditional


immutable size = cast(ulong)(vhigh - vlow);
if (cast(ulong)(v0 - vlow)  size) {}
if (cast(ulong)(v1 - vlow)  size) {}

over

if (v0 = vlow  v0  vhigh) {}

Maybe this can be used on SIMD too (saturated sub or so)?

-Martin


Well gcc gives me:

typedef unsigned long ulong4 __attribute__ ((vector_size (32)));

ulong4 foo(ulong4 a, ulong4 l, ulong4 h)
{
return (a = l)  (a  h);
}


foo(unsigned long __vector, unsigned long __vector, unsigned long 
__vector):

vmovdqa .LC0(%rip), %ymm3
vpsubq  %ymm3, %ymm0, %ymm0
vpsubq  %ymm3, %ymm2, %ymm2
vpsubq  %ymm3, %ymm1, %ymm1
vpcmpgtq%ymm0, %ymm2, %ymm2
vpcmpgtq%ymm0, %ymm1, %ymm1
vpandn  %ymm2, %ymm1, %ymm0
ret
.LC0:
.quad   -9223372036854775808
.quad   -9223372036854775808
.quad   -9223372036854775808
.quad   -9223372036854775808


Re: Any SIMD experts?

2014-12-08 Thread John Colvin via Digitalmars-d

On Monday, 8 December 2014 at 17:05:09 UTC, John Colvin wrote:

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) 
at a time.


ulong2 vval = [v0, v1];
ulong2 vlow = [low, low];
ulong2 vhigh = [high, high];

int res = PMOVMSKB(vval = vlow  vval  vhigh);

I figured out sort of a solution, but it seems way too 
complicated, because there is only signed comparison.


Usually (scalar) I'd use this, which makes use of unsigned 
wrap to safe one conditional


immutable size = cast(ulong)(vhigh - vlow);
if (cast(ulong)(v0 - vlow)  size) {}
if (cast(ulong)(v1 - vlow)  size) {}

over

if (v0 = vlow  v0  vhigh) {}

Maybe this can be used on SIMD too (saturated sub or so)?

-Martin


Well gcc gives me:

typedef unsigned long ulong4 __attribute__ ((vector_size (32)));

ulong4 foo(ulong4 a, ulong4 l, ulong4 h)
{
return (a = l)  (a  h);
}


foo(unsigned long __vector, unsigned long __vector, unsigned 
long __vector):

vmovdqa .LC0(%rip), %ymm3
vpsubq  %ymm3, %ymm0, %ymm0
vpsubq  %ymm3, %ymm2, %ymm2
vpsubq  %ymm3, %ymm1, %ymm1
vpcmpgtq%ymm0, %ymm2, %ymm2
vpcmpgtq%ymm0, %ymm1, %ymm1
vpandn  %ymm2, %ymm1, %ymm0
ret
.LC0:
.quad   -9223372036854775808
.quad   -9223372036854775808
.quad   -9223372036854775808
.quad   -9223372036854775808


To conceptually get what it's doing here, the trick is that it's 
offsetting the values so as to simulate unsigned comparisons 
using signed instructions.


Re: Any SIMD experts?

2014-12-08 Thread Etienne via Digitalmars-d

On 2014-12-08 11:44 AM, Martin Nowak wrote:

This is for the most performance critical instructions during GC marking.
If we can come up with some good SIMD this will result in a good speedup.

Yesterday I was surprised to learn that my unsigned wrap trick actually
slowed down some GC benchmarks by 3%. The branch predictor had more
trouble with the single branch because that resulted in a fifty-fifty
chance. There is some correlation between the 2 branch bounds checks and
one of them could be predicted fairly well, resulting in a better
combined result.
Always profile!


Are you using it for the binary search part (find pool) ?

http://cs.nyu.edu/~lerner/spring12/Preso06-SIMDTree.pdf

The savings are would be best when the tree (pool info?) is under 64KB, 
savings being up to 30%. It could be worth doing it. Here's a quick 
solution:


http://stackoverflow.com/questions/20616605/using-simd-avx-sse-for-tree-traversal

Most of the code circulating on the web uses the _mm256 or _mm128 SIMD 
format, which I ported to multiple platforms already and I can share it 
through boost license too if you want:


https://github.com/etcimon/botan/blob/master/source/botan/utils/simd/immintrin.d



Re: Any SIMD experts?

2014-12-08 Thread Martin Nowak via Digitalmars-d

On 12/08/2014 06:21 PM, Etienne wrote:

Are you using it for the binary search part (find pool) ?


Nope the binary search is surprisingly fine, but I had it on my list of 
subjects for quite a while too.


Re: Any SIMD experts?

2014-12-08 Thread Martin Nowak via Digitalmars-d

On 12/08/2014 05:55 PM, John Colvin wrote:

I don't quite understand what you mean by save one conditional.


Instead of
  if (v = low  v  high)
it's
  if (cast(size_t)(v - low)  size)

So there is only one branch.
There are other ways to eliminate one branch, for example this.

  if ((v = low)  (v  high))

But the interesting point was that the second branch actually helped the 
branch predictor. Because it seems to know that when the first branch is 
true the second will likely be true as well, respectively if it was 
false the 2nd one will be as well.


Re: Any SIMD experts?

2014-12-08 Thread Martin Nowak via Digitalmars-d

On 12/08/2014 05:57 PM, ponce wrote:

This doesn't work if vhigh - vlow spans a too large area.


It won't unless you allocate more than 4GB of RAM.


Re: Any SIMD experts?

2014-12-08 Thread Martin Nowak via Digitalmars-d

On 12/08/2014 06:05 PM, John Colvin wrote:

Well gcc gives me:


Tried that with dmd, it gave me.

bug.d(5): Error: incompatible types for ((a) = (l)): 
'__vector(ulong[4])' and '__vector(ulong[4])'
bug.d(5): Error: incompatible types for ((a)  (h)): 
'__vector(ulong[4])' and '__vector(ulong[4])'


Yours looks better.


Re: Any SIMD experts?

2014-12-08 Thread ponce via Digitalmars-d

On Monday, 8 December 2014 at 22:37:08 UTC, ponce wrote:

  vresult - mask  31 (for each dword)


Correction:

   vresult - mask  63 (for each qword)


Re: Any SIMD experts?

2014-12-08 Thread ponce via Digitalmars-d

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
a time.


ulong2 vval = [v0, v1];
ulong2 vlow = [low, low];
ulong2 vhigh = [high, high];

int res = PMOVMSKB(vval = vlow  vval  vhigh);

I figured out sort of a solution, but it seems way too 
complicated, because there is only signed comparison.


Usually (scalar) I'd use this, which makes use of unsigned wrap 
to safe one conditional


immutable size = cast(ulong)(vhigh - vlow);
if (cast(ulong)(v0 - vlow)  size) {}
if (cast(ulong)(v1 - vlow)  size) {}

over

if (v0 = vlow  v0  vhigh) {}

Maybe this can be used on SIMD too (saturated sub or so)?

-Martin


Another solution, in SSE pseudo-code:

  values - [vval0, vval1] // two value to bound-check, would be 
4 in AVX


  low - [vlow, vlow]
  high - [vhigh, vhigh]

  subl - values - low // using the PSUBQ instruction (SSE2)
  subh - values - high // using the PSUBQ instruction (SSE2)
  mask - subh andnot subl // using the PANDN instruction (SSE2)

  // only logical shift is available for 64-bit integers until 
AVX-512

  // using PSRLQ to shift the sign bit to the right (SSE2)
  vresult - mask  31 (for each dword)


Now, each 64-bit word in vresult contains exactly 1 if bound were 
checked, or 0 else.




Re: Any SIMD experts?

2014-12-08 Thread Martin Nowak via Digitalmars-d

On 12/08/2014 06:20 PM, John Colvin wrote:


To conceptually get what it's doing here, the trick is that it's
offsetting the values so as to simulate unsigned comparisons using
signed instructions.


All too easy, but would've taken me a pen and paper to realize :).


Re: Any SIMD experts?

2014-12-08 Thread John Colvin via Digitalmars-d

On Monday, 8 December 2014 at 23:40:21 UTC, Martin Nowak wrote:

On 12/08/2014 06:20 PM, John Colvin wrote:


To conceptually get what it's doing here, the trick is that 
it's
offsetting the values so as to simulate unsigned comparisons 
using

signed instructions.


All too easy, but would've taken me a pen and paper to realize 
:).


I wouldn't have seen it so easily but I've spent the last few 
weeks doing discrete Fourier analysis, so I've got very used to 
spotting modulo arithmetic tricks of exactly this kind.