Re: From a C++/JS benchmark

2011-08-08 Thread bearophile
Eric Poggel (JoeCoder):

 determinism can be very important when it comes to 
 reducing network traffic.  If you can achieve it, then you can make sure 
 all players have the same game state and then only send user input 
 commands over the network.

It seems a hard thing to obtain, but I agree that it gets useful.

For me having some FP determinism is useful for debugging: to avoid results 
from changing randomly if I perform a tiny change in the source code that 
triggers a change in what optimizations the compiler does.

But there are several situations (if I am writing a ray tracer?) where FP 
determinism is not required in my release build. I was not arguing about 
removing FP rules from the D compiler, just that there are situations where 
relaxing those FP rules, on request, doesn't seem to harm. I am not expert 
about the risks Walter was talking about, so maybe I'm just walking on thin ice 
(but no one will get hurt if my little raytrcer produces some errors in its 
images).

You don't come often in this newsgroup, thank you for the link :-)

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-08 Thread Eric Poggel (JoeCoder)

On 8/8/2011 3:02 PM, bearophile wrote:

Eric Poggel (JoeCoder):


determinism can be very important when it comes to
reducing network traffic.  If you can achieve it, then you can make sure
all players have the same game state and then only send user input
commands over the network.


It seems a hard thing to obtain, but I agree that it gets useful.

For me having some FP determinism is useful for debugging: to avoid results 
from changing randomly if I perform a tiny change in the source code that 
triggers a change in what optimizations the compiler does.

But there are several situations (if I am writing a ray tracer?) where FP 
determinism is not required in my release build. I was not arguing about 
removing FP rules from the D compiler, just that there are situations where 
relaxing those FP rules, on request, doesn't seem to harm. I am not expert 
about the risks Walter was talking about, so maybe I'm just walking on thin ice 
(but no one will get hurt if my little raytrcer produces some errors in its 
images).

You don't come often in this newsgroup, thank you for the link :-)

Bye,
bearophile


You'd be surprised how much I lurk here.  I agree there are some 
interesting areas where fast floating point may indeed be worth it, but 
I also don't know enough.


I've also wondered about creating a Fixed!(long, 8) struct that would 
let me work with longs and 8 bits of precision after the decimal point 
as a way of having equal precision anywhere in a large universe and 
achieving determinism at the same time.  But I don't know how 
performance would compare vs floats or doubles.


Re: From a C++/JS benchmark

2011-08-07 Thread Trass3r
Anyways, I've tweaked the GDC codegen, and program speed meets that of  
C++ now (on my system).


Implementation: http://ideone.com/0j0L1

Command-line:
gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease
g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native

Best times:
G++-32bit:  1140 vps
GDC-32bit:  1135 vps


Regards
Iain


64Bit:

C++:
4501
4427
4274
4390
4468
4349
4239

GDC:
4290
4401
4400
4401
4401
4400

GDC with -fno-bounds-check:
4328

4442
4434

4445


Re: From a C++/JS benchmark

2011-08-07 Thread Eric Poggel (JoeCoder)

On 8/6/2011 8:34 PM, bearophile wrote:

Walter:


On 8/6/2011 4:46 PM, bearophile wrote:

Walter is not a lover of that -ffast-math switch.


No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking
IEEE conformance is something very, very few should even consider.


I have read several papers about FP arithmetic, but I am not an expert yet on 
them. Both GDC and LDC have compilation switches to perform those unsafe FP 
optimizations, so even if you don't like them, most D compilers today have them 
optional, and I don't think those switches will be removed.

If you want to simulate a flock of boids (http://en.wikipedia.org/wiki/Boids ) 
on the screen using D, and you use floating point values to represent their 
speed vector, introducing unsafe FP optimizations will not harm so much. Video 
games are a significant purpose for D language, and in them FP errors are often 
benign (maybe some parts of the game are able to tolerate them and some other 
part of the game needs to be compiled with strict FP semantics).

Bye,
bearophile


Floating point determinism can be very important when it comes to 
reducing network traffic.  If you can achieve it, then you can make sure 
all players have the same game state and then only send user input 
commands over the network.


Glenn Fiedler has an interesting writeup on it, but I haven't had a 
chance to read all of it yet:


http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/


Re: From a C++/JS benchmark

2011-08-06 Thread Iain Buclaw
== Quote from bearophile (bearophileh...@lycos.com)'s article
 Trass3r:
  C++ no SIMD:
  Skinned vertices per second: 4242
 
 ...
  D gdc:
  Skinned vertices per second: 2345
 Are you able and willing to show me the asm produced by gdc? There's a problem
there.
 Bye,
 bearophile


Notes from me:

- Options -fno-bounds-check and -frelease can be just as important in GDC as 
they
are in DMD under certain instances.
- You can output asm in intel dialect using -masm=intel if att is that 
difficult
for you to read. 8-)

I will look into this later from my workstation.


Re: From a C++/JS benchmark

2011-08-06 Thread bearophile
Iain Buclaw:

 I will look into this later from my workstation.

The remaining thing to look at is just the small performance difference between 
the D-GDC version and the C++-G++ version.

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-06 Thread Iain Buclaw
== Quote from bearophile (bearophileh...@lycos.com)'s article
 Iain Buclaw:
  I will look into this later from my workstation.
 The remaining thing to look at is just the small performance difference 
 between
the D-GDC version and the C++-G++ version.
 Bye,
 bearophile

Three things that helped improve performance in a minor way for me:
1) using pointers over dynamic arrays. (5% speedup)
2) removing the calls to CalVector4's constructor (5.7% speedup)
3) using core.stdc.time over std.datetime. (1.6% speedup)

Point one is pretty well known issue in D as far as I'm aware.
Point two is not an issue with inlining (all methods are marked 'inline'), but 
it
did help remove quite a few movss instructions being emitted.
Point three is interesting, it seems that sw.peek().msecs slows down the 
number
of iterations in the while loop.


With those changes, D implementation is still 21% slower than C++ implementation
without SIMD.

http://ideone.com/4PP2D


Re: From a C++/JS benchmark

2011-08-06 Thread bearophile
Iain Buclaw:

Are you using GDC2-64 bit on Linux?

 Three things that helped improve performance in a minor way for me:
 1) using pointers over dynamic arrays. (5% speedup)
 2) removing the calls to CalVector4's constructor (5.7% speedup)
 3) using core.stdc.time over std.datetime. (1.6% speedup)
 
 Point one is pretty well known issue in D as far as I'm aware.

Really? I don't remember discussions about it. What is its cause?


 Point two is not an issue with inlining (all methods are marked 'inline'), 
 but it
 did help remove quite a few movss instructions being emitted.

This too is something worth fixing. Is this issue in Bugzilla already?


 Point three is interesting, it seems that sw.peek().msecs slows down the 
 number
 of iterations in the while loop.

This needs to be fixed.


 With those changes, D implementation is still 21% slower than C++ 
 implementation
 without SIMD.
 http://ideone.com/4PP2D

This is a lot still.

Thank you for your work. I think all three issues are worth fixing, eventually.

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-06 Thread Iain Buclaw
== Quote from bearophile (bearophileh...@lycos.com)'s article
 Iain Buclaw:
 Are you using GDC2-64 bit on Linux?

GDC2-32 bit on Linux.


  Three things that helped improve performance in a minor way for me:
  1) using pointers over dynamic arrays. (5% speedup)
  2) removing the calls to CalVector4's constructor (5.7% speedup)
  3) using core.stdc.time over std.datetime. (1.6% speedup)
 
  Point one is pretty well known issue in D as far as I'm aware.
 Really? I don't remember discussions about it. What is its cause?

I can't remember the exact discussion, but it was something about a benchmark of
passing by value vs passing by ref vs passing by pointer.

  Point two is not an issue with inlining (all methods are marked 'inline'), 
  but it
  did help remove quite a few movss instructions being emitted.
 This too is something worth fixing. Is this issue in Bugzilla already?

I don't think its an issue really. But of course, there is a difference between
what you say and what you mean with regards to the code here (that being, with 
the
first version, lots of temp vars get created and moved around the place).


Regards
Iain


Re: From a C++/JS benchmark

2011-08-06 Thread bearophile
Iain Buclaw:

 1) using pointers over dynamic arrays. (5% speedup)
 2) removing the calls to CalVector4's constructor (5.7% speedup)

With DMD I have seen 180k - 190k vertices/sec replacing this:

struct CalVector4 {
float X, Y, Z, W;

this(float x, float y, float z, float w = 0.0f) {
X = x;
Y = y;
Z = z;
W = w;
}
}

With:

struct CalVector4 {
float X, Y, Z, W=0.0f;
}

I'd like the D compiler to optimize better there.



 http://ideone.com/4PP2D

This line of code is not good:
auto vertices = cast(Vertex *) new Vertex[N];

This is much better, it's less bug-prone, simpler and shorter:
auto vertices = (new Vertex[N]).ptr;

But in practice in this program it is enough to allocate dynamic arrays 
normally, and then perform the call like this (with DMD it gives the same 
performance):
calculateVerticesAndNormals(boneTransforms.ptr, N, vertices.ptr, 
influences.ptr, output.ptr);

I don't know why passing pointers gives some more performance here, compared to 
passing dynamic arrays (but I have seen the same behaviour in other D programs 
of mine).

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-06 Thread Walter Bright

On 8/6/2011 3:19 PM, bearophile wrote:

I don't know why passing pointers gives some more performance here, compared
to passing dynamic arrays (but I have seen the same behaviour in other D
programs of mine).


A dynamic array is two values being passed, a pointer is one.


Re: From a C++/JS benchmark

2011-08-06 Thread Iain Buclaw
== Quote from bearophile (bearophileh...@lycos.com)'s article
 Iain Buclaw:
  1) using pointers over dynamic arrays. (5% speedup)
  2) removing the calls to CalVector4's constructor (5.7% speedup)
 With DMD I have seen 180k - 190k vertices/sec replacing this:
 struct CalVector4 {
 float X, Y, Z, W;
 this(float x, float y, float z, float w = 0.0f) {
 X = x;
 Y = y;
 Z = z;
 W = w;
 }
 }
 With:
 struct CalVector4 {
 float X, Y, Z, W=0.0f;
 }
 I'd like the D compiler to optimize better there.
  http://ideone.com/4PP2D
 This line of code is not good:
 auto vertices = cast(Vertex *) new Vertex[N];
 This is much better, it's less bug-prone, simpler and shorter:
 auto vertices = (new Vertex[N]).ptr;
 But in practice in this program it is enough to allocate dynamic arrays
normally, and then perform the call like this (with DMD it gives the same
performance):
 calculateVerticesAndNormals(boneTransforms.ptr, N, vertices.ptr, 
 influences.ptr,
output.ptr);

I was playing about with heap vs stack. Must've forgot to remove that, sorry. :)

Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now 
(on
my system).

Implementation: http://ideone.com/0j0L1

Command-line:
gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease
g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native

Best times:
G++-32bit:  1140 vps
GDC-32bit:  1135 vps


Regards
Iain


Re: From a C++/JS benchmark

2011-08-06 Thread bearophile
Walter:

 A dynamic array is two values being passed, a pointer is one.

I know, but I think there are many optimization opportunities. An example:


private void foo(int[] a2) {}
void main() {
int[100] a1;
foo(a1);
}


In code like that I think a D compiler is free to compile like this, because 
foo is private, so it's free to perform optimizations based on just the code 
inside the module:

private void foo(ref int[100] a2) {}
void main() {
int[100] a1;
foo(a1);
}


I think there are several cases where a D compiler is free to replace the two 
values with just a pointer.


Another example, to optimize code like this:

private void foo(int[] a1, int[] a2) {}
void main() {
int n = 100; // run-time value
auto a3 = new int[n];
auto a4 = new int[n];
foo(a3, a4);
}


Into something like this:

private void foo(int* a1, int* a2, size_t a1a2len) {}
void main() {
int n = 100;
auto a3 = new int[n];
auto a4 = new int[n];
foo(a3.ptr, a4.ptr, n);
}

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-06 Thread bearophile
Iain Buclaw:

 Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ 
 now (on
 my system).

Are you willing to explain your changes (and maybe give a link to the changes)? 
Maybe Walter is interested for DMD too.


 Command-line:
 gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease
 g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native

In newer versions of GCC -Ofast means -ffast-math too.

Walter is not a lover of that -ffast-math switch.
But I now think that the combination of D strongly pure functions with unsafe 
FP optimizations offers optimization opportunities that maybe not even GCC is 
able to use now when it compiles C/C++ code (do you see why?). Not using this 
opportunity is a waste, in my opinion.

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-06 Thread Walter Bright

On 8/6/2011 4:46 PM, bearophile wrote:

Walter is not a lover of that -ffast-math switch.


No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking 
IEEE conformance is something very, very few should even consider.


Re: From a C++/JS benchmark

2011-08-06 Thread bearophile
Walter:

 On 8/6/2011 4:46 PM, bearophile wrote:
  Walter is not a lover of that -ffast-math switch.
 
 No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking 
 IEEE conformance is something very, very few should even consider.

I have read several papers about FP arithmetic, but I am not an expert yet on 
them. Both GDC and LDC have compilation switches to perform those unsafe FP 
optimizations, so even if you don't like them, most D compilers today have them 
optional, and I don't think those switches will be removed.

If you want to simulate a flock of boids (http://en.wikipedia.org/wiki/Boids ) 
on the screen using D, and you use floating point values to represent their 
speed vector, introducing unsafe FP optimizations will not harm so much. Video 
games are a significant purpose for D language, and in them FP errors are often 
benign (maybe some parts of the game are able to tolerate them and some other 
part of the game needs to be compiled with strict FP semantics).

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-05 Thread bearophile
Trass3r:

  are you willing and able to show me the asm before it gets assembled?  
  (with gcc you do it with the -S switch). (I also suggest to use only the  
  C standard library, with time() and printf() to produce a smaller asm  
  output: http://codepad.org/12EUo16J ).

You are a person of few words :-) Thank you for the asm.

Apparently the program was not compiled in release mode (or with nobounds. With 
DMD it's the same thing, maybe with gdc it's not the same thing). It contains 
the calls, but they aren't to the next line, they were for the array bounds:

call_d_assert
call_d_array_bounds
call_d_array_bounds
call_d_assert_msg
call_d_array_bounds
call_d_array_bounds
call_d_array_bounds
call_d_array_bounds
call_d_array_bounds
call_d_array_bounds
call_d_assert_msg

But I think this doesn't fully explain the low performance, I have seen too 
many instructions like:

movss   DWORD PTR [rsp+32], xmm1
movss   DWORD PTR [rsp+16], xmm2
movss   DWORD PTR [rsp+48], xmm3

If you want to go on with this exploration, then I suggest you to find a way to 
disable bound tests.

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-05 Thread Trass3r
If you want to go on with this exploration, then I suggest you to find a  
way to disable bound tests.


Ok, now I get up to 3293 skinned vertices per second.
Still a bit worse than LDC.


Re: From a C++/JS benchmark

2011-08-05 Thread Trass3r

Am 04.08.2011, 04:07 Uhr, schrieb Trass3r u...@known.com:


C++:
Skinned vertices per second: 4866

C++ no SIMD:
Skinned vertices per second: 4242


D dmd:
Skinned vertices per second: 159046

D gdc:
Skinned vertices per second: 2345



D ldc:
Skinned vertices per second: 3791

ldc2 -O3 -release -enable-inlining dver.d



D gdc with added -frelease -fno-bounds-check:
Skinned vertices per second: 3771


Re: From a C++/JS benchmark

2011-08-05 Thread bearophile
Trass3r:

  C++ no SIMD:
  Skinned vertices per second: 4242
...
 D gdc with added -frelease -fno-bounds-check:
 Skinned vertices per second: 3771

I'd like to know why the GCC back-end is able to produce a more efficient 
binary from the C++ code (compared to the D code), but now the problem is not 
large, as before.

It seems I've found a benchmark coming from real-world code that's a worst case 
for DMD (GDC here produces code about 237 times faster than DMD).

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-05 Thread Trass3r
I'd like to know why the GCC back-end is able to produce a more  
efficient binary from the C++ code (compared to the D code), but now the  
problem is not large, as before.


I attached both asm versions ;)

cppver.s
Description: Binary data


dver.s
Description: Binary data


Re: From a C++/JS benchmark

2011-08-04 Thread Marco Leise

Am 03.08.2011, 21:52 Uhr, schrieb David Nadlinger s...@klickverbot.at:


On 8/3/11 9:48 PM, Adam D. Ruppe wrote:

System: Windows XP, Core 2 Duo E6850


Is this Windows XP 32 bit or 64 bit? That will probably make
a difference on the longs I'd expect.


It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64).

David


I thought he was referring to the processor being able to handle 64-bit  
ints more efficiently in 64-bit operation mode on a 64-bit OS with 64-bit  
executables.


Re: From a C++/JS benchmark

2011-08-04 Thread Trass3r

e you able and willing to show me the asm produced by gdc? There's a

problem there.

bla.rar
Description: application/rar-compressed


Re: From a C++/JS benchmark

2011-08-04 Thread Adam Ruppe
Marco Leise wrote:
 I thought he was referring to the processor being able to handle
 64-bit ints more efficiently in 64-bit operation mode on a 64-bit OS
 with 64-bit executables.

I was thinking a little of both but this is the main thing. My
suspicion was that Java might have been using a 64 bit JVM and
everything else was compiled in 32 bit, causing it to win in that place.

But with a 32 bit OS that means 32 bit programs all around.


Re: From a C++/JS benchmark

2011-08-04 Thread bearophile
 Trass3r:
 are you able and willing to show me the asm produced by gdc? There's a
 problem there.
 [attach bla.rar]

In the bla.rar attach there's the unstripped Linux binary, so to read the asm I 
have used the objdump disassembler. But are you willing and able to show me the 
asm before it gets assembled? (with gcc you do it with the -S switch). (I also 
suggest to use only the C standard library, with time() and printf() to produce 
a smaller asm output: http://codepad.org/12EUo16J ).

Using objdump I see it uses 16 xmm registers, this is the main routine. But 
what's the purpose of those callq? They seem to call the successive asm 
instruction. The x86 asm of this routine contains jumps only and no call.
The asm of this routine is also very long, I don't know why yet. I see too many 
instructions like movss  0x80(%rsp), %xmm7 this looks like a problem.


_calculateVerticesAndNormals:
push   %r15
push   %r14
push   %r13
push   %r12
push   %rbp
push   %rbx
sub$0x268, %rsp
mov0x2a0(%rsp), %rax
mov%rdi, 0xe8(%rsp)
mov%rsi, 0xe0(%rsp)
mov%rcx, 0x128(%rsp)
mov%r8, 0x138(%rsp)
mov%rax, 0xf0(%rsp)
mov0x2a8(%rsp), %rax
mov%rdi, 0x180(%rsp)
mov%rsi, 0x188(%rsp)
mov%rcx, 0x170(%rsp)
mov%rax, 0xf8(%rsp)
mov0x2b0(%rsp), %rax
mov%r8, 0x178(%rsp)
mov%rax, 0x130(%rsp)
mov0x2b8(%rsp), %rax
mov%rax, 0x140(%rsp)
mov%rcx, %rax
add%rax, %rax
cmp0x130(%rsp), %rax
je 74d _calculateVerticesAndNormals+0xcd
mov$0x57, %edx
mov$0x6, %edi
mov$0x0, %esi
movq   $0x6, 0x190(%rsp)
movq   $0x0, 0x198(%rsp)
callq  74d _calculateVerticesAndNormals+0xcd
cmpq   $0x0, 0x128(%rsp)
je 1317 _calculateVerticesAndNormals+0xc97
movq   $0x1, 0x120(%rsp)
xor%r15d, %r15d
movq   $0x0, 0x100(%rsp)
movslq %r15d, %r12
cmp%r12, 0xf0(%rsp)
movq   $0x0, 0x108(%rsp)
jbef1d _calculateVerticesAndNormals+0x89d
nopl   0x0(%rax)
lea(%r12, %r12, 2), %rax
shl$0x2, %rax
mov%rax, 0x148(%rsp)
mov0xf8(%rsp), %rax
add0x148(%rsp), %rax
movss  0x4(%rax), %xmm9
movzbl 0x8(%rax), %r13d
movslq (%rax), %rax
cmp0xe8(%rsp), %rax
jaef50 _calculateVerticesAndNormals+0x8d0
lea(%rax, %rax, 2), %rax
shl$0x4, %rax
mov%rax, 0x110(%rsp)
mov0xe0(%rsp), %rbx
add0x110(%rsp), %rbx
je 12af _calculateVerticesAndNormals+0xc2f
movss  (%rbx), %xmm7
test   %r13b, %r13b
movss  0x4(%rbx), %xmm8
movss  0x8(%rbx), %xmm6
mulss  %xmm9, %xmm7
movss  0xc(%rbx), %xmm11
mulss  %xmm9, %xmm8
movss  0x10(%rbx), %xmm4
mulss  %xmm9, %xmm6
movss  0x14(%rbx), %xmm5
mulss  %xmm9, %xmm11
movss  0x18(%rbx), %xmm3
mulss  %xmm9, %xmm4
movss  0x1c(%rbx), %xmm10
mulss  %xmm9, %xmm5
movss  0x20(%rbx), %xmm1
mulss  %xmm9, %xmm3
movss  0x24(%rbx), %xmm2
mulss  %xmm9, %xmm10
movss  0x28(%rbx), %xmm0
mulss  %xmm9, %xmm1
mulss  %xmm9, %xmm2
mulss  %xmm9, %xmm0
mulss  0x2c(%rbx), %xmm9
jnecdb _calculateVerticesAndNormals+0x65b
add$0x1, %r12
mov%r14, %rax
lea(%r12, %r12, 2), %r13
shl$0x2, %r13
jmpq   99e _calculateVerticesAndNormals+0x31e
nopl   (%rax)
mov%r13, %rax
mov0xf8(%rsp), %rdx
add%rax, %rdx
movss  0x4(%rdx), %xmm12
movzbl 0x8(%rdx), %r14d
movslq (%rdx), %rdx
cmp%rdx, 0xe8(%rsp)
jbeaa0 _calculateVerticesAndNormals+0x420
mov0xe0(%rsp), %rbx
lea(%rdx, %rdx, 2), %rbp
shl$0x4, %rbp
add%rbp, %rbx
je baf _calculateVerticesAndNormals+0x52f
movss  (%rbx), %xmm13
add$0x1, %r12
add$0xc, %r13
test   %r14b, %r14b
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm7
movss  0x4(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm8
movss  0x8(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm6
movss  0xc(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm11
movss  0x10(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm4
movss  0x14(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm5
movss  0x18(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm3
movss  0x1c(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm10
movss  0x20(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm1
movss  0x24(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm2
movss  0x28(%rbx), %xmm13
mulss  %xmm12, %xmm13
mulss  0x2c(%rbx), %xmm12
addss  %xmm13, %xmm0
addss  %xmm12, %xmm9
jnecd8 _calculateVerticesAndNormals+0x658
add$0x1, %r15d
cmp%r12, 0xf0(%rsp)
ja 890 _calculateVerticesAndNormals+0x210
mov$0x63, %edx
mov$0x6, %edi
mov$0x0, %esi
mov%rax, 0xc8(%rsp)
movss  %xmm0, (%rsp)
movss  %xmm1, 0x20(%rsp)
movss  %xmm2, 0x10(%rsp)
movss  %xmm3, 0x30(%rsp)
movss  %xmm4, 0x50(%rsp)
movss  %xmm5, 0x40(%rsp)
movss  %xmm6, 0x60(%rsp)
movss  %xmm7, 0x80(%rsp)
movss  %xmm8, 0x70(%rsp)
movss  %xmm9, 0x90(%rsp)
movss  %xmm10, 0xa0(%rsp)
movss  %xmm11, 0xb0(%rsp)
movq   $0x6, 0x1c0(%rsp)
movq   $0x0, 0x1c8(%rsp)
callq  a3b _calculateVerticesAndNormals+0x3bb
mov0xc8(%rsp), %rax
movss  (%rsp), %xmm0
movss  0x20(%rsp), %xmm1
movss  0x10(%rsp), %xmm2
movss  0x30(%rsp), %xmm3
movss  

Re: From a C++/JS benchmark

2011-08-04 Thread Adam Ruppe
 But what's the purpose of those callq? They seem to call the
 successive asm instruct

I find ATT syntax to be almost impossible to read, but it looks
like they are comparing the instruction pointer for some reason.

call works by pushing the instruction pointer on the stack, then
jumping to the new address. By calling the next thing, you can
then pop the instruction pointer off the stack and continue on where
you left off.

I don't know why they want this though. That ATT syntax really
messes with my brain...


Re: From a C++/JS benchmark

2011-08-03 Thread Denis Shelomovskij

03.08.2011 18:20, bearophile:

The benchmark info:
http://chadaustin.me/2011/01/digging-into-javascript-performance/

The code, in C++, JS, Java, C#:
https://github.com/chadaustin/Web-Benchmarks/
The C++/JS/Java code runs on a single core.

D2 version translated from the C# version (the C++ version uses struct 
inheritance!):
http://ideone.com/kf1tz

Bye,
bearophile


Compilers:
C++:  cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:1024
Java: Oracle Java 1.6 with hm... Oracle default settings
C#:   Csc /optimize+
D2:   dmd -O -noboundscheck -inline -release

Type column: working scalar type
Other columns: vertices per second (inaccuracy is about 1%) by language 
(tests from bearophile's message, C++ test is skinning_test_no_simd.cpp).


System: Windows XP, Core 2 Duo E6850

---
  Type  |C++ |Java| C# | D2
---
float   | 31_400_000 | 17_000_000 | 14_700_000 |168_000
double  | 32_300_000 | 16_000_000 | 14_100_000 |166_000
real| 32_300_000 |   no real  |   no real  |203_000
int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000
long| 29_100_000 |  6_600_000 |  4_400_000 |  5_800_000
---

JavaScript vs C++ speed is at the first link of original bearophile's 
post and JS is about 10-20 temes slower than C++.
Looks like a spiteful joke... In other words: WTF?! JavaScript is about 
10 times faster than D in floating point calculations!? Please, tell me 
that I'm mistaken.


Re: From a C++/JS benchmark

2011-08-03 Thread Ziad Hatahet
I believe that long in this case is 32 bits in C++, and 64-bits in the
remaining languages, hence the same result for int and long in C++. Try with
long long maybe? :)


--
Ziad


2011/8/3 Denis Shelomovskij verylonglogin@gmail.com

 03.08.2011 18:20, bearophile:

  The benchmark info:
 http://chadaustin.me/2011/01/**digging-into-javascript-**performance/http://chadaustin.me/2011/01/digging-into-javascript-performance/

 The code, in C++, JS, Java, C#:
 https://github.com/chadaustin/**Web-Benchmarks/https://github.com/chadaustin/Web-Benchmarks/
 The C++/JS/Java code runs on a single core.

 D2 version translated from the C# version (the C++ version uses struct
 inheritance!):
 http://ideone.com/kf1tz

 Bye,
 bearophile


 Compilers:
 C++:  cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:1024
 Java: Oracle Java 1.6 with hm... Oracle default settings
 C#:   Csc /optimize+
 D2:   dmd -O -noboundscheck -inline -release

 Type column: working scalar type
 Other columns: vertices per second (inaccuracy is about 1%) by language
 (tests from bearophile's message, C++ test is skinning_test_no_simd.cpp).

 System: Windows XP, Core 2 Duo E6850

 --**-
  Type  |C++ |Java| C# | D2
 --**-
 float   | 31_400_000 | 17_000_000 | 14_700_000 |168_000
 double  | 32_300_000 | 16_000_000 | 14_100_000 |166_000
 real| 32_300_000 |   no real  |   no real  |203_000
 int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000
 long| 29_100_000 |  6_600_000 |  4_400_000 |  5_800_000
 --**-

 JavaScript vs C++ speed is at the first link of original bearophile's post
 and JS is about 10-20 temes slower than C++.
 Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10
 times faster than D in floating point calculations!? Please, tell me that
 I'm mistaken.



Re: From a C++/JS benchmark

2011-08-03 Thread Denis Shelomovskij

03.08.2011 22:15, Ziad Hatahet:

I believe that long in this case is 32 bits in C++, and 64-bits in the
remaining languages, hence the same result for int and long in C++. Try
with long long maybe? :)


--
Ziad


2011/8/3 Denis Shelomovskij verylonglogin@gmail.com
mailto:verylonglogin@gmail.com

03.08.2011 18:20, bearophile:

The benchmark info:
http://chadaustin.me/2011/01/__digging-into-javascript-__performance/
http://chadaustin.me/2011/01/digging-into-javascript-performance/

The code, in C++, JS, Java, C#:
https://github.com/chadaustin/__Web-Benchmarks/
https://github.com/chadaustin/Web-Benchmarks/
The C++/JS/Java code runs on a single core.

D2 version translated from the C# version (the C++ version uses
struct inheritance!):
http://ideone.com/kf1tz

Bye,
bearophile


Compilers:
C++:  cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:1024
Java: Oracle Java 1.6 with hm... Oracle default settings
C#:   Csc /optimize+
D2:   dmd -O -noboundscheck -inline -release

Type column: working scalar type
Other columns: vertices per second (inaccuracy is about 1%) by
language (tests from bearophile's message, C++ test is
skinning_test_no_simd.cpp).

System: Windows XP, Core 2 Duo E6850

--__-
  Type  |C++ |Java| C# | D2
--__-
float   | 31_400_000 | 17_000_000 | 14_700_000 |168_000
double  | 32_300_000 | 16_000_000 | 14_100_000 |166_000
real| 32_300_000 |   no real  |   no real  |203_000
int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000
long| 29_100_000 |  6_600_000 |  4_400_000 |  5_800_000
--__-

JavaScript vs C++ speed is at the first link of original
bearophile's post and JS is about 10-20 temes slower than C++.
Looks like a spiteful joke... In other words: WTF?! JavaScript is
about 10 times faster than D in floating point calculations!?
Please, tell me that I'm mistaken.




Good! This is my first blunder (it's so easy to complitely forget 
illogical (for me) language design). So, corrected last row:


 Type  |C++ |Java| C# | D2
-
long| 5_500_000 |  6_600_000 |  4_400_000 |  5_800_000


Java is the fastest long language :)


Re: From a C++/JS benchmark

2011-08-03 Thread David Nadlinger

On 8/3/11 9:48 PM, Adam D. Ruppe wrote:

System: Windows XP, Core 2 Duo E6850


Is this Windows XP 32 bit or 64 bit? That will probably make
a difference on the longs I'd expect.


It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64).

David


Re: From a C++/JS benchmark

2011-08-03 Thread Adam D. Ruppe
 System: Windows XP, Core 2 Duo E6850

Is this Windows XP 32 bit or 64 bit? That will probably make
a difference on the longs I'd expect.


Re: From a C++/JS benchmark

2011-08-03 Thread Denis Shelomovskij

03.08.2011 22:48, Adam D. Ruppe пишет:

System: Windows XP, Core 2 Duo E6850


Is this Windows XP 32 bit or 64 bit? That will probably make
a difference on the longs I'd expect.


I meant Windows XP 32 bit (5.1 (Build 2600: Service Pack 3)) (according 
to what is Windows XP in wikipedia)


Re: From a C++/JS benchmark

2011-08-03 Thread bearophile
Denis Shelomovskij:

 (tests from bearophile's message, C++ test is skinning_test_no_simd.cpp).

For a more realistic test I suggest you to time the C++ version that uses the 
intrinsics too (only for float).


 Looks like a spiteful joke... In other words: WTF?! JavaScript is about 
 10 times faster than D in floating point calculations!? Please, tell me 
 that I'm mistaken.

Languages aren't slow or fast, their implementations produce assembly that's 
more or less efficient.

A D1 version fit for LDC V1 with Tango:
http://codepad.org/ewDy31UH

Vertices (millions), Linux 32 bit:
  C++ no simd:  29.5
  D:27.6

LDC based on DMD v1.057 and llvm 2.6, ldc -O3 -release -inline

G++ V4.3.3, -s -O3 -mfpmath=sse -ffast-math -msse3

It's a bit slower than the C++ version, but for most people that's an 
acceptable difference (and maybe porting the C++ code to D instead of the C# 
one and using a more modern LLVM you reduce that loss a bit).

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-03 Thread Trass3r
Looks like a spiteful joke... In other words: WTF?! JavaScript is about  
10 times faster than D in floating point calculations!? Please, tell me  
that I'm mistaken.


I'm afraid not. dmd's backend isn't good at floating point calculations.


Re: From a C++/JS benchmark

2011-08-03 Thread bearophile
Trass3r:

 I'm afraid not. dmd's backend isn't good at floating point calculations.

Studying a bit the asm it's not hard to find the cause, because this benchmark 
is quite pure (synthetic, despite I think it comes from real-world code).

This is what G++ generates from the C++ code without intrinsics (the version 
that uses SIMD intrinsics has a similar look but it's shorter):

movl  (%eax), %edx
movss  4(%eax), %xmm0
movl  8(%eax), %ecx
leal  (%edx,%edx,2), %edx
sall  $4, %edx
addl  %ebx, %edx
testl  %ecx, %ecx
movss  12(%edx), %xmm1
movss  20(%edx), %xmm7
movss  (%edx), %xmm5
mulss  %xmm0, %xmm1
mulss  %xmm0, %xmm7
movss  4(%edx), %xmm6
movss  8(%edx), %xmm4
movss  %xmm1, (%esp)
mulss  %xmm0, %xmm5
movss  28(%edx), %xmm1
movss  %xmm7, 4(%esp)
mulss  %xmm0, %xmm6
movss  32(%edx), %xmm7
mulss  %xmm0, %xmm1
movss  16(%edx), %xmm3
mulss  %xmm0, %xmm7
movss  24(%edx), %xmm2
movss  %xmm1, 16(%esp)
mulss  %xmm0, %xmm4
movss  36(%edx), %xmm1
movss  %xmm7, 8(%esp)
mulss  %xmm0, %xmm3
movss  40(%edx), %xmm7
mulss  %xmm0, %xmm2
mulss  %xmm0, %xmm1
mulss  %xmm0, %xmm7
mulss  44(%edx), %xmm0
leal  12(%eax), %edx
movss  %xmm7, 12(%esp)
movss  %xmm0, 20(%esp)


This is what DMD generates for the same (or quite similar) piece of code:

movsd
mov  EAX,068h[ESP]
imul  EDX,EAX,030h
add  EDX,018h[ESP]
fld  float ptr [EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 038h[ESP]
fld  float ptr 4[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 03Ch[ESP]
fld  float ptr 8[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 040h[ESP]
fld  float ptr 0Ch[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 044h[ESP]
fld  float ptr 010h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 048h[ESP]
fld  float ptr 014h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 04Ch[ESP]
fld  float ptr 018h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 050h[ESP]
fld  float ptr 01Ch[EDX]
mov  CL,070h[ESP]
xor  CL,1
fmul  float ptr 06Ch[ESP]
fstp  float ptr 054h[ESP]
fld  float ptr 020h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 058h[ESP]
fld  float ptr 024h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 05Ch[ESP]
fld  float ptr 028h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 060h[ESP]
fld  float ptr 02Ch[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 064h[ESP]

I think DMD back-end already contains logic to use xmm registers as true 
registers (not as a floating point stack or temporary holes where to push and 
pull FP values), so I suspect it doesn't take too much work to modify it to 
emit FP asm with a single optimization: just keep the values inside registers. 
In my uninformed opinion all other FP optimizations are almost insignificant 
compared to this one :-)

Bye,
bearophile


Re: From a C++/JS benchmark

2011-08-03 Thread Trass3r

C++:
Skinned vertices per second: 4866

C++ no SIMD:
Skinned vertices per second: 4242


D dmd:
Skinned vertices per second: 159046

D gdc:
Skinned vertices per second: 2345



Compilers:

gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4)
g++ -s -O3 -mfpmath=sse -ffast-math -march=native

DMD64 D Compiler v2.054
dmd -O -noboundscheck -inline -release dver.d

gcc version 4.6.1 20110627 (gdc 0.30, using dmd 2.054) (GCC)
gdc -s -O3 -mfpmath=sse -ffast-math -march=native dver.d


Ubuntu 11.04 x64
Core2 Duo E6300


Re: From a C++/JS benchmark

2011-08-03 Thread Trass3r

C++:
Skinned vertices per second: 4866

C++ no SIMD:
Skinned vertices per second: 4242


D dmd:
Skinned vertices per second: 159046

D gdc:
Skinned vertices per second: 2345



D ldc:
Skinned vertices per second: 3791

ldc2 -O3 -release -enable-inlining dver.d


Re: From a C++/JS benchmark

2011-08-03 Thread bearophile
Trass3r:

 C++ no SIMD:
 Skinned vertices per second: 4242
 
...
 D gdc:
 Skinned vertices per second: 2345

Are you able and willing to show me the asm produced by gdc? There's a problem 
there.

Bye,
bearophile