Re: From a C++/JS benchmark
Eric Poggel (JoeCoder): determinism can be very important when it comes to reducing network traffic. If you can achieve it, then you can make sure all players have the same game state and then only send user input commands over the network. It seems a hard thing to obtain, but I agree that it gets useful. For me having some FP determinism is useful for debugging: to avoid results from changing randomly if I perform a tiny change in the source code that triggers a change in what optimizations the compiler does. But there are several situations (if I am writing a ray tracer?) where FP determinism is not required in my release build. I was not arguing about removing FP rules from the D compiler, just that there are situations where relaxing those FP rules, on request, doesn't seem to harm. I am not expert about the risks Walter was talking about, so maybe I'm just walking on thin ice (but no one will get hurt if my little raytrcer produces some errors in its images). You don't come often in this newsgroup, thank you for the link :-) Bye, bearophile
Re: From a C++/JS benchmark
On 8/8/2011 3:02 PM, bearophile wrote: Eric Poggel (JoeCoder): determinism can be very important when it comes to reducing network traffic. If you can achieve it, then you can make sure all players have the same game state and then only send user input commands over the network. It seems a hard thing to obtain, but I agree that it gets useful. For me having some FP determinism is useful for debugging: to avoid results from changing randomly if I perform a tiny change in the source code that triggers a change in what optimizations the compiler does. But there are several situations (if I am writing a ray tracer?) where FP determinism is not required in my release build. I was not arguing about removing FP rules from the D compiler, just that there are situations where relaxing those FP rules, on request, doesn't seem to harm. I am not expert about the risks Walter was talking about, so maybe I'm just walking on thin ice (but no one will get hurt if my little raytrcer produces some errors in its images). You don't come often in this newsgroup, thank you for the link :-) Bye, bearophile You'd be surprised how much I lurk here. I agree there are some interesting areas where fast floating point may indeed be worth it, but I also don't know enough. I've also wondered about creating a Fixed!(long, 8) struct that would let me work with longs and 8 bits of precision after the decimal point as a way of having equal precision anywhere in a large universe and achieving determinism at the same time. But I don't know how performance would compare vs floats or doubles.
Re: From a C++/JS benchmark
Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now (on my system). Implementation: http://ideone.com/0j0L1 Command-line: gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native Best times: G++-32bit: 1140 vps GDC-32bit: 1135 vps Regards Iain 64Bit: C++: 4501 4427 4274 4390 4468 4349 4239 GDC: 4290 4401 4400 4401 4401 4400 GDC with -fno-bounds-check: 4328 4442 4434 4445
Re: From a C++/JS benchmark
On 8/6/2011 8:34 PM, bearophile wrote: Walter: On 8/6/2011 4:46 PM, bearophile wrote: Walter is not a lover of that -ffast-math switch. No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking IEEE conformance is something very, very few should even consider. I have read several papers about FP arithmetic, but I am not an expert yet on them. Both GDC and LDC have compilation switches to perform those unsafe FP optimizations, so even if you don't like them, most D compilers today have them optional, and I don't think those switches will be removed. If you want to simulate a flock of boids (http://en.wikipedia.org/wiki/Boids ) on the screen using D, and you use floating point values to represent their speed vector, introducing unsafe FP optimizations will not harm so much. Video games are a significant purpose for D language, and in them FP errors are often benign (maybe some parts of the game are able to tolerate them and some other part of the game needs to be compiled with strict FP semantics). Bye, bearophile Floating point determinism can be very important when it comes to reducing network traffic. If you can achieve it, then you can make sure all players have the same game state and then only send user input commands over the network. Glenn Fiedler has an interesting writeup on it, but I haven't had a chance to read all of it yet: http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/
Re: From a C++/JS benchmark
== Quote from bearophile (bearophileh...@lycos.com)'s article Trass3r: C++ no SIMD: Skinned vertices per second: 4242 ... D gdc: Skinned vertices per second: 2345 Are you able and willing to show me the asm produced by gdc? There's a problem there. Bye, bearophile Notes from me: - Options -fno-bounds-check and -frelease can be just as important in GDC as they are in DMD under certain instances. - You can output asm in intel dialect using -masm=intel if att is that difficult for you to read. 8-) I will look into this later from my workstation.
Re: From a C++/JS benchmark
Iain Buclaw: I will look into this later from my workstation. The remaining thing to look at is just the small performance difference between the D-GDC version and the C++-G++ version. Bye, bearophile
Re: From a C++/JS benchmark
== Quote from bearophile (bearophileh...@lycos.com)'s article Iain Buclaw: I will look into this later from my workstation. The remaining thing to look at is just the small performance difference between the D-GDC version and the C++-G++ version. Bye, bearophile Three things that helped improve performance in a minor way for me: 1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup) 3) using core.stdc.time over std.datetime. (1.6% speedup) Point one is pretty well known issue in D as far as I'm aware. Point two is not an issue with inlining (all methods are marked 'inline'), but it did help remove quite a few movss instructions being emitted. Point three is interesting, it seems that sw.peek().msecs slows down the number of iterations in the while loop. With those changes, D implementation is still 21% slower than C++ implementation without SIMD. http://ideone.com/4PP2D
Re: From a C++/JS benchmark
Iain Buclaw: Are you using GDC2-64 bit on Linux? Three things that helped improve performance in a minor way for me: 1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup) 3) using core.stdc.time over std.datetime. (1.6% speedup) Point one is pretty well known issue in D as far as I'm aware. Really? I don't remember discussions about it. What is its cause? Point two is not an issue with inlining (all methods are marked 'inline'), but it did help remove quite a few movss instructions being emitted. This too is something worth fixing. Is this issue in Bugzilla already? Point three is interesting, it seems that sw.peek().msecs slows down the number of iterations in the while loop. This needs to be fixed. With those changes, D implementation is still 21% slower than C++ implementation without SIMD. http://ideone.com/4PP2D This is a lot still. Thank you for your work. I think all three issues are worth fixing, eventually. Bye, bearophile
Re: From a C++/JS benchmark
== Quote from bearophile (bearophileh...@lycos.com)'s article Iain Buclaw: Are you using GDC2-64 bit on Linux? GDC2-32 bit on Linux. Three things that helped improve performance in a minor way for me: 1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup) 3) using core.stdc.time over std.datetime. (1.6% speedup) Point one is pretty well known issue in D as far as I'm aware. Really? I don't remember discussions about it. What is its cause? I can't remember the exact discussion, but it was something about a benchmark of passing by value vs passing by ref vs passing by pointer. Point two is not an issue with inlining (all methods are marked 'inline'), but it did help remove quite a few movss instructions being emitted. This too is something worth fixing. Is this issue in Bugzilla already? I don't think its an issue really. But of course, there is a difference between what you say and what you mean with regards to the code here (that being, with the first version, lots of temp vars get created and moved around the place). Regards Iain
Re: From a C++/JS benchmark
Iain Buclaw: 1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup) With DMD I have seen 180k - 190k vertices/sec replacing this: struct CalVector4 { float X, Y, Z, W; this(float x, float y, float z, float w = 0.0f) { X = x; Y = y; Z = z; W = w; } } With: struct CalVector4 { float X, Y, Z, W=0.0f; } I'd like the D compiler to optimize better there. http://ideone.com/4PP2D This line of code is not good: auto vertices = cast(Vertex *) new Vertex[N]; This is much better, it's less bug-prone, simpler and shorter: auto vertices = (new Vertex[N]).ptr; But in practice in this program it is enough to allocate dynamic arrays normally, and then perform the call like this (with DMD it gives the same performance): calculateVerticesAndNormals(boneTransforms.ptr, N, vertices.ptr, influences.ptr, output.ptr); I don't know why passing pointers gives some more performance here, compared to passing dynamic arrays (but I have seen the same behaviour in other D programs of mine). Bye, bearophile
Re: From a C++/JS benchmark
On 8/6/2011 3:19 PM, bearophile wrote: I don't know why passing pointers gives some more performance here, compared to passing dynamic arrays (but I have seen the same behaviour in other D programs of mine). A dynamic array is two values being passed, a pointer is one.
Re: From a C++/JS benchmark
== Quote from bearophile (bearophileh...@lycos.com)'s article Iain Buclaw: 1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup) With DMD I have seen 180k - 190k vertices/sec replacing this: struct CalVector4 { float X, Y, Z, W; this(float x, float y, float z, float w = 0.0f) { X = x; Y = y; Z = z; W = w; } } With: struct CalVector4 { float X, Y, Z, W=0.0f; } I'd like the D compiler to optimize better there. http://ideone.com/4PP2D This line of code is not good: auto vertices = cast(Vertex *) new Vertex[N]; This is much better, it's less bug-prone, simpler and shorter: auto vertices = (new Vertex[N]).ptr; But in practice in this program it is enough to allocate dynamic arrays normally, and then perform the call like this (with DMD it gives the same performance): calculateVerticesAndNormals(boneTransforms.ptr, N, vertices.ptr, influences.ptr, output.ptr); I was playing about with heap vs stack. Must've forgot to remove that, sorry. :) Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now (on my system). Implementation: http://ideone.com/0j0L1 Command-line: gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native Best times: G++-32bit: 1140 vps GDC-32bit: 1135 vps Regards Iain
Re: From a C++/JS benchmark
Walter: A dynamic array is two values being passed, a pointer is one. I know, but I think there are many optimization opportunities. An example: private void foo(int[] a2) {} void main() { int[100] a1; foo(a1); } In code like that I think a D compiler is free to compile like this, because foo is private, so it's free to perform optimizations based on just the code inside the module: private void foo(ref int[100] a2) {} void main() { int[100] a1; foo(a1); } I think there are several cases where a D compiler is free to replace the two values with just a pointer. Another example, to optimize code like this: private void foo(int[] a1, int[] a2) {} void main() { int n = 100; // run-time value auto a3 = new int[n]; auto a4 = new int[n]; foo(a3, a4); } Into something like this: private void foo(int* a1, int* a2, size_t a1a2len) {} void main() { int n = 100; auto a3 = new int[n]; auto a4 = new int[n]; foo(a3.ptr, a4.ptr, n); } Bye, bearophile
Re: From a C++/JS benchmark
Iain Buclaw: Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now (on my system). Are you willing to explain your changes (and maybe give a link to the changes)? Maybe Walter is interested for DMD too. Command-line: gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native In newer versions of GCC -Ofast means -ffast-math too. Walter is not a lover of that -ffast-math switch. But I now think that the combination of D strongly pure functions with unsafe FP optimizations offers optimization opportunities that maybe not even GCC is able to use now when it compiles C/C++ code (do you see why?). Not using this opportunity is a waste, in my opinion. Bye, bearophile
Re: From a C++/JS benchmark
On 8/6/2011 4:46 PM, bearophile wrote: Walter is not a lover of that -ffast-math switch. No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking IEEE conformance is something very, very few should even consider.
Re: From a C++/JS benchmark
Walter: On 8/6/2011 4:46 PM, bearophile wrote: Walter is not a lover of that -ffast-math switch. No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking IEEE conformance is something very, very few should even consider. I have read several papers about FP arithmetic, but I am not an expert yet on them. Both GDC and LDC have compilation switches to perform those unsafe FP optimizations, so even if you don't like them, most D compilers today have them optional, and I don't think those switches will be removed. If you want to simulate a flock of boids (http://en.wikipedia.org/wiki/Boids ) on the screen using D, and you use floating point values to represent their speed vector, introducing unsafe FP optimizations will not harm so much. Video games are a significant purpose for D language, and in them FP errors are often benign (maybe some parts of the game are able to tolerate them and some other part of the game needs to be compiled with strict FP semantics). Bye, bearophile
Re: From a C++/JS benchmark
Trass3r: are you willing and able to show me the asm before it gets assembled? (with gcc you do it with the -S switch). (I also suggest to use only the C standard library, with time() and printf() to produce a smaller asm output: http://codepad.org/12EUo16J ). You are a person of few words :-) Thank you for the asm. Apparently the program was not compiled in release mode (or with nobounds. With DMD it's the same thing, maybe with gdc it's not the same thing). It contains the calls, but they aren't to the next line, they were for the array bounds: call_d_assert call_d_array_bounds call_d_array_bounds call_d_assert_msg call_d_array_bounds call_d_array_bounds call_d_array_bounds call_d_array_bounds call_d_array_bounds call_d_array_bounds call_d_assert_msg But I think this doesn't fully explain the low performance, I have seen too many instructions like: movss DWORD PTR [rsp+32], xmm1 movss DWORD PTR [rsp+16], xmm2 movss DWORD PTR [rsp+48], xmm3 If you want to go on with this exploration, then I suggest you to find a way to disable bound tests. Bye, bearophile
Re: From a C++/JS benchmark
If you want to go on with this exploration, then I suggest you to find a way to disable bound tests. Ok, now I get up to 3293 skinned vertices per second. Still a bit worse than LDC.
Re: From a C++/JS benchmark
Am 04.08.2011, 04:07 Uhr, schrieb Trass3r u...@known.com: C++: Skinned vertices per second: 4866 C++ no SIMD: Skinned vertices per second: 4242 D dmd: Skinned vertices per second: 159046 D gdc: Skinned vertices per second: 2345 D ldc: Skinned vertices per second: 3791 ldc2 -O3 -release -enable-inlining dver.d D gdc with added -frelease -fno-bounds-check: Skinned vertices per second: 3771
Re: From a C++/JS benchmark
Trass3r: C++ no SIMD: Skinned vertices per second: 4242 ... D gdc with added -frelease -fno-bounds-check: Skinned vertices per second: 3771 I'd like to know why the GCC back-end is able to produce a more efficient binary from the C++ code (compared to the D code), but now the problem is not large, as before. It seems I've found a benchmark coming from real-world code that's a worst case for DMD (GDC here produces code about 237 times faster than DMD). Bye, bearophile
Re: From a C++/JS benchmark
I'd like to know why the GCC back-end is able to produce a more efficient binary from the C++ code (compared to the D code), but now the problem is not large, as before. I attached both asm versions ;) cppver.s Description: Binary data dver.s Description: Binary data
Re: From a C++/JS benchmark
Am 03.08.2011, 21:52 Uhr, schrieb David Nadlinger s...@klickverbot.at: On 8/3/11 9:48 PM, Adam D. Ruppe wrote: System: Windows XP, Core 2 Duo E6850 Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect. It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64). David I thought he was referring to the processor being able to handle 64-bit ints more efficiently in 64-bit operation mode on a 64-bit OS with 64-bit executables.
Re: From a C++/JS benchmark
e you able and willing to show me the asm produced by gdc? There's a problem there. bla.rar Description: application/rar-compressed
Re: From a C++/JS benchmark
Marco Leise wrote: I thought he was referring to the processor being able to handle 64-bit ints more efficiently in 64-bit operation mode on a 64-bit OS with 64-bit executables. I was thinking a little of both but this is the main thing. My suspicion was that Java might have been using a 64 bit JVM and everything else was compiled in 32 bit, causing it to win in that place. But with a 32 bit OS that means 32 bit programs all around.
Re: From a C++/JS benchmark
Trass3r: are you able and willing to show me the asm produced by gdc? There's a problem there. [attach bla.rar] In the bla.rar attach there's the unstripped Linux binary, so to read the asm I have used the objdump disassembler. But are you willing and able to show me the asm before it gets assembled? (with gcc you do it with the -S switch). (I also suggest to use only the C standard library, with time() and printf() to produce a smaller asm output: http://codepad.org/12EUo16J ). Using objdump I see it uses 16 xmm registers, this is the main routine. But what's the purpose of those callq? They seem to call the successive asm instruction. The x86 asm of this routine contains jumps only and no call. The asm of this routine is also very long, I don't know why yet. I see too many instructions like movss 0x80(%rsp), %xmm7 this looks like a problem. _calculateVerticesAndNormals: push %r15 push %r14 push %r13 push %r12 push %rbp push %rbx sub$0x268, %rsp mov0x2a0(%rsp), %rax mov%rdi, 0xe8(%rsp) mov%rsi, 0xe0(%rsp) mov%rcx, 0x128(%rsp) mov%r8, 0x138(%rsp) mov%rax, 0xf0(%rsp) mov0x2a8(%rsp), %rax mov%rdi, 0x180(%rsp) mov%rsi, 0x188(%rsp) mov%rcx, 0x170(%rsp) mov%rax, 0xf8(%rsp) mov0x2b0(%rsp), %rax mov%r8, 0x178(%rsp) mov%rax, 0x130(%rsp) mov0x2b8(%rsp), %rax mov%rax, 0x140(%rsp) mov%rcx, %rax add%rax, %rax cmp0x130(%rsp), %rax je 74d _calculateVerticesAndNormals+0xcd mov$0x57, %edx mov$0x6, %edi mov$0x0, %esi movq $0x6, 0x190(%rsp) movq $0x0, 0x198(%rsp) callq 74d _calculateVerticesAndNormals+0xcd cmpq $0x0, 0x128(%rsp) je 1317 _calculateVerticesAndNormals+0xc97 movq $0x1, 0x120(%rsp) xor%r15d, %r15d movq $0x0, 0x100(%rsp) movslq %r15d, %r12 cmp%r12, 0xf0(%rsp) movq $0x0, 0x108(%rsp) jbef1d _calculateVerticesAndNormals+0x89d nopl 0x0(%rax) lea(%r12, %r12, 2), %rax shl$0x2, %rax mov%rax, 0x148(%rsp) mov0xf8(%rsp), %rax add0x148(%rsp), %rax movss 0x4(%rax), %xmm9 movzbl 0x8(%rax), %r13d movslq (%rax), %rax cmp0xe8(%rsp), %rax jaef50 _calculateVerticesAndNormals+0x8d0 lea(%rax, %rax, 2), %rax shl$0x4, %rax mov%rax, 0x110(%rsp) mov0xe0(%rsp), %rbx add0x110(%rsp), %rbx je 12af _calculateVerticesAndNormals+0xc2f movss (%rbx), %xmm7 test %r13b, %r13b movss 0x4(%rbx), %xmm8 movss 0x8(%rbx), %xmm6 mulss %xmm9, %xmm7 movss 0xc(%rbx), %xmm11 mulss %xmm9, %xmm8 movss 0x10(%rbx), %xmm4 mulss %xmm9, %xmm6 movss 0x14(%rbx), %xmm5 mulss %xmm9, %xmm11 movss 0x18(%rbx), %xmm3 mulss %xmm9, %xmm4 movss 0x1c(%rbx), %xmm10 mulss %xmm9, %xmm5 movss 0x20(%rbx), %xmm1 mulss %xmm9, %xmm3 movss 0x24(%rbx), %xmm2 mulss %xmm9, %xmm10 movss 0x28(%rbx), %xmm0 mulss %xmm9, %xmm1 mulss %xmm9, %xmm2 mulss %xmm9, %xmm0 mulss 0x2c(%rbx), %xmm9 jnecdb _calculateVerticesAndNormals+0x65b add$0x1, %r12 mov%r14, %rax lea(%r12, %r12, 2), %r13 shl$0x2, %r13 jmpq 99e _calculateVerticesAndNormals+0x31e nopl (%rax) mov%r13, %rax mov0xf8(%rsp), %rdx add%rax, %rdx movss 0x4(%rdx), %xmm12 movzbl 0x8(%rdx), %r14d movslq (%rdx), %rdx cmp%rdx, 0xe8(%rsp) jbeaa0 _calculateVerticesAndNormals+0x420 mov0xe0(%rsp), %rbx lea(%rdx, %rdx, 2), %rbp shl$0x4, %rbp add%rbp, %rbx je baf _calculateVerticesAndNormals+0x52f movss (%rbx), %xmm13 add$0x1, %r12 add$0xc, %r13 test %r14b, %r14b mulss %xmm12, %xmm13 addss %xmm13, %xmm7 movss 0x4(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm8 movss 0x8(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm6 movss 0xc(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm11 movss 0x10(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm4 movss 0x14(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm5 movss 0x18(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm3 movss 0x1c(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm10 movss 0x20(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm1 movss 0x24(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm2 movss 0x28(%rbx), %xmm13 mulss %xmm12, %xmm13 mulss 0x2c(%rbx), %xmm12 addss %xmm13, %xmm0 addss %xmm12, %xmm9 jnecd8 _calculateVerticesAndNormals+0x658 add$0x1, %r15d cmp%r12, 0xf0(%rsp) ja 890 _calculateVerticesAndNormals+0x210 mov$0x63, %edx mov$0x6, %edi mov$0x0, %esi mov%rax, 0xc8(%rsp) movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movq $0x6, 0x1c0(%rsp) movq $0x0, 0x1c8(%rsp) callq a3b _calculateVerticesAndNormals+0x3bb mov0xc8(%rsp), %rax movss (%rsp), %xmm0 movss 0x20(%rsp), %xmm1 movss 0x10(%rsp), %xmm2 movss 0x30(%rsp), %xmm3 movss
Re: From a C++/JS benchmark
But what's the purpose of those callq? They seem to call the successive asm instruct I find ATT syntax to be almost impossible to read, but it looks like they are comparing the instruction pointer for some reason. call works by pushing the instruction pointer on the stack, then jumping to the new address. By calling the next thing, you can then pop the instruction pointer off the stack and continue on where you left off. I don't know why they want this though. That ATT syntax really messes with my brain...
Re: From a C++/JS benchmark
03.08.2011 18:20, bearophile: The benchmark info: http://chadaustin.me/2011/01/digging-into-javascript-performance/ The code, in C++, JS, Java, C#: https://github.com/chadaustin/Web-Benchmarks/ The C++/JS/Java code runs on a single core. D2 version translated from the C# version (the C++ version uses struct inheritance!): http://ideone.com/kf1tz Bye, bearophile Compilers: C++: cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:1024 Java: Oracle Java 1.6 with hm... Oracle default settings C#: Csc /optimize+ D2: dmd -O -noboundscheck -inline -release Type column: working scalar type Other columns: vertices per second (inaccuracy is about 1%) by language (tests from bearophile's message, C++ test is skinning_test_no_simd.cpp). System: Windows XP, Core 2 Duo E6850 --- Type |C++ |Java| C# | D2 --- float | 31_400_000 | 17_000_000 | 14_700_000 |168_000 double | 32_300_000 | 16_000_000 | 14_100_000 |166_000 real| 32_300_000 | no real | no real |203_000 int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000 long| 29_100_000 | 6_600_000 | 4_400_000 | 5_800_000 --- JavaScript vs C++ speed is at the first link of original bearophile's post and JS is about 10-20 temes slower than C++. Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.
Re: From a C++/JS benchmark
I believe that long in this case is 32 bits in C++, and 64-bits in the remaining languages, hence the same result for int and long in C++. Try with long long maybe? :) -- Ziad 2011/8/3 Denis Shelomovskij verylonglogin@gmail.com 03.08.2011 18:20, bearophile: The benchmark info: http://chadaustin.me/2011/01/**digging-into-javascript-**performance/http://chadaustin.me/2011/01/digging-into-javascript-performance/ The code, in C++, JS, Java, C#: https://github.com/chadaustin/**Web-Benchmarks/https://github.com/chadaustin/Web-Benchmarks/ The C++/JS/Java code runs on a single core. D2 version translated from the C# version (the C++ version uses struct inheritance!): http://ideone.com/kf1tz Bye, bearophile Compilers: C++: cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:1024 Java: Oracle Java 1.6 with hm... Oracle default settings C#: Csc /optimize+ D2: dmd -O -noboundscheck -inline -release Type column: working scalar type Other columns: vertices per second (inaccuracy is about 1%) by language (tests from bearophile's message, C++ test is skinning_test_no_simd.cpp). System: Windows XP, Core 2 Duo E6850 --**- Type |C++ |Java| C# | D2 --**- float | 31_400_000 | 17_000_000 | 14_700_000 |168_000 double | 32_300_000 | 16_000_000 | 14_100_000 |166_000 real| 32_300_000 | no real | no real |203_000 int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000 long| 29_100_000 | 6_600_000 | 4_400_000 | 5_800_000 --**- JavaScript vs C++ speed is at the first link of original bearophile's post and JS is about 10-20 temes slower than C++. Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.
Re: From a C++/JS benchmark
03.08.2011 22:15, Ziad Hatahet: I believe that long in this case is 32 bits in C++, and 64-bits in the remaining languages, hence the same result for int and long in C++. Try with long long maybe? :) -- Ziad 2011/8/3 Denis Shelomovskij verylonglogin@gmail.com mailto:verylonglogin@gmail.com 03.08.2011 18:20, bearophile: The benchmark info: http://chadaustin.me/2011/01/__digging-into-javascript-__performance/ http://chadaustin.me/2011/01/digging-into-javascript-performance/ The code, in C++, JS, Java, C#: https://github.com/chadaustin/__Web-Benchmarks/ https://github.com/chadaustin/Web-Benchmarks/ The C++/JS/Java code runs on a single core. D2 version translated from the C# version (the C++ version uses struct inheritance!): http://ideone.com/kf1tz Bye, bearophile Compilers: C++: cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:1024 Java: Oracle Java 1.6 with hm... Oracle default settings C#: Csc /optimize+ D2: dmd -O -noboundscheck -inline -release Type column: working scalar type Other columns: vertices per second (inaccuracy is about 1%) by language (tests from bearophile's message, C++ test is skinning_test_no_simd.cpp). System: Windows XP, Core 2 Duo E6850 --__- Type |C++ |Java| C# | D2 --__- float | 31_400_000 | 17_000_000 | 14_700_000 |168_000 double | 32_300_000 | 16_000_000 | 14_100_000 |166_000 real| 32_300_000 | no real | no real |203_000 int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000 long| 29_100_000 | 6_600_000 | 4_400_000 | 5_800_000 --__- JavaScript vs C++ speed is at the first link of original bearophile's post and JS is about 10-20 temes slower than C++. Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken. Good! This is my first blunder (it's so easy to complitely forget illogical (for me) language design). So, corrected last row: Type |C++ |Java| C# | D2 - long| 5_500_000 | 6_600_000 | 4_400_000 | 5_800_000 Java is the fastest long language :)
Re: From a C++/JS benchmark
On 8/3/11 9:48 PM, Adam D. Ruppe wrote: System: Windows XP, Core 2 Duo E6850 Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect. It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64). David
Re: From a C++/JS benchmark
System: Windows XP, Core 2 Duo E6850 Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect.
Re: From a C++/JS benchmark
03.08.2011 22:48, Adam D. Ruppe пишет: System: Windows XP, Core 2 Duo E6850 Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect. I meant Windows XP 32 bit (5.1 (Build 2600: Service Pack 3)) (according to what is Windows XP in wikipedia)
Re: From a C++/JS benchmark
Denis Shelomovskij: (tests from bearophile's message, C++ test is skinning_test_no_simd.cpp). For a more realistic test I suggest you to time the C++ version that uses the intrinsics too (only for float). Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken. Languages aren't slow or fast, their implementations produce assembly that's more or less efficient. A D1 version fit for LDC V1 with Tango: http://codepad.org/ewDy31UH Vertices (millions), Linux 32 bit: C++ no simd: 29.5 D:27.6 LDC based on DMD v1.057 and llvm 2.6, ldc -O3 -release -inline G++ V4.3.3, -s -O3 -mfpmath=sse -ffast-math -msse3 It's a bit slower than the C++ version, but for most people that's an acceptable difference (and maybe porting the C++ code to D instead of the C# one and using a more modern LLVM you reduce that loss a bit). Bye, bearophile
Re: From a C++/JS benchmark
Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken. I'm afraid not. dmd's backend isn't good at floating point calculations.
Re: From a C++/JS benchmark
Trass3r: I'm afraid not. dmd's backend isn't good at floating point calculations. Studying a bit the asm it's not hard to find the cause, because this benchmark is quite pure (synthetic, despite I think it comes from real-world code). This is what G++ generates from the C++ code without intrinsics (the version that uses SIMD intrinsics has a similar look but it's shorter): movl (%eax), %edx movss 4(%eax), %xmm0 movl 8(%eax), %ecx leal (%edx,%edx,2), %edx sall $4, %edx addl %ebx, %edx testl %ecx, %ecx movss 12(%edx), %xmm1 movss 20(%edx), %xmm7 movss (%edx), %xmm5 mulss %xmm0, %xmm1 mulss %xmm0, %xmm7 movss 4(%edx), %xmm6 movss 8(%edx), %xmm4 movss %xmm1, (%esp) mulss %xmm0, %xmm5 movss 28(%edx), %xmm1 movss %xmm7, 4(%esp) mulss %xmm0, %xmm6 movss 32(%edx), %xmm7 mulss %xmm0, %xmm1 movss 16(%edx), %xmm3 mulss %xmm0, %xmm7 movss 24(%edx), %xmm2 movss %xmm1, 16(%esp) mulss %xmm0, %xmm4 movss 36(%edx), %xmm1 movss %xmm7, 8(%esp) mulss %xmm0, %xmm3 movss 40(%edx), %xmm7 mulss %xmm0, %xmm2 mulss %xmm0, %xmm1 mulss %xmm0, %xmm7 mulss 44(%edx), %xmm0 leal 12(%eax), %edx movss %xmm7, 12(%esp) movss %xmm0, 20(%esp) This is what DMD generates for the same (or quite similar) piece of code: movsd mov EAX,068h[ESP] imul EDX,EAX,030h add EDX,018h[ESP] fld float ptr [EDX] fmul float ptr 06Ch[ESP] fstp float ptr 038h[ESP] fld float ptr 4[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 03Ch[ESP] fld float ptr 8[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 040h[ESP] fld float ptr 0Ch[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 044h[ESP] fld float ptr 010h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 048h[ESP] fld float ptr 014h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 04Ch[ESP] fld float ptr 018h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 050h[ESP] fld float ptr 01Ch[EDX] mov CL,070h[ESP] xor CL,1 fmul float ptr 06Ch[ESP] fstp float ptr 054h[ESP] fld float ptr 020h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 058h[ESP] fld float ptr 024h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 05Ch[ESP] fld float ptr 028h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 060h[ESP] fld float ptr 02Ch[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 064h[ESP] I think DMD back-end already contains logic to use xmm registers as true registers (not as a floating point stack or temporary holes where to push and pull FP values), so I suspect it doesn't take too much work to modify it to emit FP asm with a single optimization: just keep the values inside registers. In my uninformed opinion all other FP optimizations are almost insignificant compared to this one :-) Bye, bearophile
Re: From a C++/JS benchmark
C++: Skinned vertices per second: 4866 C++ no SIMD: Skinned vertices per second: 4242 D dmd: Skinned vertices per second: 159046 D gdc: Skinned vertices per second: 2345 Compilers: gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) g++ -s -O3 -mfpmath=sse -ffast-math -march=native DMD64 D Compiler v2.054 dmd -O -noboundscheck -inline -release dver.d gcc version 4.6.1 20110627 (gdc 0.30, using dmd 2.054) (GCC) gdc -s -O3 -mfpmath=sse -ffast-math -march=native dver.d Ubuntu 11.04 x64 Core2 Duo E6300
Re: From a C++/JS benchmark
C++: Skinned vertices per second: 4866 C++ no SIMD: Skinned vertices per second: 4242 D dmd: Skinned vertices per second: 159046 D gdc: Skinned vertices per second: 2345 D ldc: Skinned vertices per second: 3791 ldc2 -O3 -release -enable-inlining dver.d
Re: From a C++/JS benchmark
Trass3r: C++ no SIMD: Skinned vertices per second: 4242 ... D gdc: Skinned vertices per second: 2345 Are you able and willing to show me the asm produced by gdc? There's a problem there. Bye, bearophile