[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 Iain Buclaw changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution|--- |FIXED --- Comment #16 from Iain Buclaw --- cfloat_unary_add: 15 secs, 195 ms, 935 μs, and 5 hnsecs std_cfloat_unary_add: 2 secs, 491 ms, 834 μs, and 9 hnsecs cfloat_unary_sub: 14 secs, 926 ms, 587 μs, and 6 hnsecs std_cfloat_unary_sub: 4 secs, 858 ms, 349 μs, and 4 hnsecs cfloat_binary_add: 22 secs, 363 ms, 951 μs, and 9 hnsecs std_cfloat_binary_add: 5 secs, 403 ms, 108 μs, and 9 hnsecs cfloat_binary_sub: 22 secs, 236 ms, and 902 μs std_cfloat_binary_sub: 5 secs, 266 ms, 697 μs, and 6 hnsecs cfloat_binary_mul: 24 secs, 858 ms, 63 μs, and 7 hnsecs std_cfloat_binary_mul: 7 secs, 186 ms, 291 μs, and 8 hnsecs cfloat_binary_div: 30 secs, 225 ms, 114 μs, and 4 hnsecs std_cfloat_binary_div: 17 secs, 900 ms, 164 μs, and 6 hnsecs cfloat_binary_div(FastMath): 29 secs, 230 ms, 821 μs, and 5 hnsecs std_cfloat_binary_div(FastMath): 12 secs, 208 ms, 118 μs, and 7 hnsecs cdouble_unary_add: 2 secs, 788 ms, 525 μs, and 6 hnsecs std_cdouble_unary_add: 2 secs, 922 ms, 224 μs, and 1 hnsec cdouble_unary_sub: 2 secs, 502 ms, and 734 μs std_cdouble_unary_sub: 2 secs, 915 ms, 203 μs, and 9 hnsecs cdouble_binary_add: 2 secs, 869 ms, 820 μs, and 1 hnsec std_cdouble_binary_add: 3 secs, 108 ms, 545 μs, and 4 hnsecs cdouble_binary_sub: 2 secs, 836 ms, 796 μs, and 5 hnsecs std_cdouble_binary_sub: 3 secs, 159 ms, 209 μs, and 3 hnsecs cdouble_binary_mul: 4 secs, 785 ms, 197 μs, and 6 hnsecs std_cdouble_binary_mul: 5 secs, 197 ms, 572 μs, and 9 hnsecs cdouble_binary_div: 14 secs, 238 ms, 332 μs, and 6 hnsecs std_cdouble_binary_div: 15 secs, 933 ms, 301 μs, and 8 hnsecs cdouble_binary_div(FastMath): 10 secs, 700 ms, and 32 μs std_cdouble_binary_div(FastMath): 11 secs, 8 ms, 868 μs, and 5 hnsecs creal_unary_add: 8 secs, 183 ms, 254 μs, and 3 hnsecs std_creal_unary_add: 14 secs, 72 ms, 96 μs, and 2 hnsecs creal_unary_sub: 8 secs, 425 ms, 681 μs, and 9 hnsecs std_creal_unary_sub: 10 secs, 854 ms, 312 μs, and 8 hnsecs creal_binary_add: 3 minutes, 50 secs, 877 ms, 637 μs, and 6 hnsecs std_creal_binary_add: 3 minutes, 57 secs, 397 ms, 952 μs, and 4 hnsecs creal_binary_sub: 4 minutes, 4 secs, 982 ms, 715 μs, and 2 hnsecs std_creal_binary_sub: 4 minutes, 11 secs, 485 ms, 74 μs, and 8 hnsecs creal_binary_mul: 11 minutes, 31 secs, 328 ms, 600 μs, and 7 hnsecs std_creal_binary_mul: 11 minutes, 46 secs, 26 ms, 451 μs, and 2 hnsecs creal_binary_div: 20 minutes, 48 secs, 778 ms, and 747 μs std_creal_binary_div: 20 minutes, 2 secs, 439 ms, and 535 μs creal_binary_div(FastMath): 18 minutes, 38 secs, 613 ms, 679 μs, and 6 hnsecs std_creal_binary_div(FastMath): 18 minutes, 42 secs, 400 ms, 343 μs, and 7 hnsecs --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 Iain Buclaw changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED |--- --- Comment #15 from Iain Buclaw --- Not sure if this should really be marked as resolved/fixed, but anyhow... With the following (lazy) function generator: --- import std.complex : C = Complex; import std.meta : AliasSeq; import std.format : format; static foreach (T; AliasSeq!(cfloat, cdouble, creal)) { // Unary operators mixin(format!"%s %s_unary_add(%s a) { return +a; }" (T.stringof, T.stringof, T.stringof)); mixin(format!"%s %s_unary_sub(%s a) { return -a; }" (T.stringof, T.stringof, T.stringof)); // Binary operators mixin(format!"%s %s_binary_add(%s a, %s b) { return a + b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"%s %s_binary_sub(%s a, %s b) { return a - b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"%s %s_binary_mul(%s a, %s b) { return a * b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"%s %s_binary_div(%s a, %s b) { return a / b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); } static foreach (T; AliasSeq!(float, double, real)) { // Unary operators mixin(format!"C!%s std_c%s_unary_add(C!%s a) { return +a; }" (T.stringof, T.stringof, T.stringof)); mixin(format!"C!%s std_c%s_unary_sub(C!%s a) { return -a; }" (T.stringof, T.stringof, T.stringof)); // Binary operators mixin(format!"C!%s std_c%s_binary_add(C!%s a, C!%s b) { return a + b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"C!%s std_c%s_binary_sub(C!%s a, C!%s b) { return a - b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"C!%s std_c%s_binary_mul(C!%s a, C!%s b) { return a * b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"C!%s std_c%s_binary_div(C!%s a, C!%s b) { return a / b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); } --- On x86_64/GDC, the results are: cfloat_unary_add: movq%xmm0, -8(%rsp) movss -8(%rsp), %xmm0 movss %xmm0, -16(%rsp) movss -4(%rsp), %xmm0 movss %xmm0, -12(%rsp) movq-16(%rsp), %xmm0 ret --- std_cfloat_unary_add: ret cdouble_unary_add: ret --- std_cdouble_unary_add: ret creal_unary_add: fldt8(%rsp) fldt24(%rsp) fxch%st(1) ret --- std_creal_unary_add: movdqa 8(%rsp), %xmm0 movdqa 24(%rsp), %xmm1 movq%rdi, %rax movaps %xmm0, (%rdi) movaps %xmm1, 16(%rdi) ret cfloat_unary_sub: movq%xmm0, -8(%rsp) movss -8(%rsp), %xmm0 movss .LC4(%rip), %xmm2 movaps %xmm0, %xmm1 movss -4(%rsp), %xmm0 xorps %xmm2, %xmm1 xorps %xmm2, %xmm0 movss %xmm1, -16(%rsp) movss %xmm0, -12(%rsp) movq-16(%rsp), %xmm0 ret .LC4: .long -2147483648 .long 0 .long 0 .long 0 --- std_cfloat_unary_sub: movq.LC7(%rip), %xmm1 xorps %xmm1, %xmm0 ret .LC7: .long -2147483648 .long -2147483648 cdouble_unary_sub: movq.LC5(%rip), %xmm2 xorpd %xmm2, %xmm1 xorpd %xmm2, %xmm0 ret .LC5: .long 0 .long -2147483648 .long 0 .long 0 --- std_cdouble_unary_sub: movq%xmm0, -24(%rsp) movq%xmm1, -16(%rsp) movapd -24(%rsp), %xmm2 xorpd .LC8(%rip), %xmm2 movaps %xmm2, -24(%rsp) movsd -16(%rsp), %xmm1 movsd -24(%rsp), %xmm0 ret .LC8: .long 0 .long -2147483648 .long 0 .long -2147483648 creal_unary_sub: fldt8(%rsp) fchs fldt24(%rsp) fchs fxch%st(1) ret --- std_creal_unary_sub: fldt24(%rsp) movq%rdi, %rax fchs fldt8(%rsp) fchs fstpt (%rdi) fstpt 16(%rdi) ret cfloat_binary_add: movq%xmm0, -8(%rsp) movq%xmm1, -16(%rsp) movss -8(%rsp), %xmm1 movss -16(%rsp), %xmm0 addss %xmm0, %xmm1 movss -12(%rsp), %xmm0 addss -4(%rsp), %xmm0 movss %xmm1, -24(
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 --- Comment #14 from ponce --- No problem from here, a lot of our complex code is now SIMD. I doubt we'll see a practical problem apart from the transition work. It's easy to recreate the desired division algorithm manually if ever needed. --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 --- Comment #13 from Iain Buclaw --- (In reply to ponce from comment #4) > RESULTS > > * With ldc 1.8.0 64-bit: > > $ ldc2.exe -O3 -enable-inlining -release divide.d -m64 > $ divide.exe > > With cfloat: 7 secs, 623 ms, 829 ╬╝s, and 9 hnsecs > With cdouble: 7 secs, 594 ms, 449 ╬╝s, and 8 hnsecs > With Complex!float: 7 secs, 988 ms, 642 ╬╝s, and 4 hnsecs > With Complex!double: 15 secs, 501 ms, 128 ╬╝s, and 4 hnsecs > > > * With ldc 1.8.0 32-bit: > > $ ldc2.exe -O3 -enable-inlining -release divide.d -m32 > $ divide.exe > > With cfloat: 7 secs, 618 ms, 202 ╬╝s, and 1 hnsec > With cdouble: 7 secs, 593 ms, 777 ╬╝s, and 2 hnsecs > With Complex!float: 7 secs, 958 ms, 692 ╬╝s, and 9 hnsecs > With Complex!double: 15 secs, 414 ms, and 344 ╬╝s > > > This show that even with latest LDC you can have a regression. > > I appreciate that std.complex gives more precision in the divide operation, > it's also something that is _different_ from builtin complex it replaces. A bug probably should be raised against LDC for not using range reduction (i.e: Smiths algorithm) in their native complex division implementation. The slowdown is not a regression, LDC is just using the wrong algorithm by default (i.e: the "fast" naive version should be generated only when compiling with `-ffast-math`). GDC and LDC could coordinate with each other and predefine `version(FastMath)` when the `-ffast-math` flag is given on the command-line. --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 Dlang Bot changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #12 from Dlang Bot --- dlang/phobos pull request #7814 "fix Issue 18627 - Use cephes algorithm for complex divide" was merged into master: - 70595f5d51011a6258d001523c8749411b9d8152 by Iain Buclaw: fix Issue 18627 - Use cephes algorithm for complex divide https://github.com/dlang/phobos/pull/7814 --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 Dlang Bot changed: What|Removed |Added Keywords||pull --- Comment #11 from Dlang Bot --- @ibuclaw created dlang/phobos pull request #7814 "fix Issue 18627 - Use cephes algorithm for complex divide" fixing this issue: - fix Issue 18627 - Use cephes algorithm for complex divide https://github.com/dlang/phobos/pull/7814 --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 --- Comment #10 from Iain Buclaw --- (In reply to ponce from comment #4) > This benchmark is a variation that does only division. > > --- divide.d * With gdc -O2 -frelease -m64 With cfloat: 11 secs, 204 ms, 475 μs, and 2 hnsecs With cdouble: 13 secs, 420 ms, 497 μs, and 6 hnsecs With Complex!float: 4 secs, 689 ms, 546 μs, and 2 hnsecs With Complex!double: 8 secs, 903 ms, 172 μs, and 4 hnsecs * With gdc -O2 -frelease -m32 With cfloat: 29 secs, 471 ms, 678 μs, and 9 hnsecs With cdouble: 29 secs, 176 ms, 189 μs, and 2 hnsecs With Complex!float: 13 secs, 379 ms, 856 μs, and 8 hnsecs With Complex!double: 18 secs, 240 ms, 975 μs, and 5 hnsecs Native complex floating point must die. --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 --- Comment #9 from ponce --- I think at the very least std.complex should contain a function to divide Complex without the additional precision provided by the check with the 2 fabs(). People that want speed could opt-in, and others will enjoy increased precision without noticing. --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 --- Comment #8 from ponce --- @Seb: It's not only about DMD, there is a 2x performance regression with Complex!double vs cdouble using LDC. There are probably more I haven't exposed yet. And yes, I use cdouble for designing IIR filters, in a real-time program. Our main product use builtin complexes, it's downloaded 2000 times per month. --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 Seb changed: What|Removed |Added CC||greensunn...@gmail.com --- Comment #7 from Seb --- > Division with DMD 32-bit: Using DMD for any performance arguments is a bit of a moot point as DMD's optimizer is pretty bad. So this would halt almost all development as there are many many performance regressions with DMD. --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 --- Comment #6 from ponce --- Conversely complex divide seems faster with DMD with std.complex than builtins. --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 --- Comment #5 from ponce --- Division with DMD 32-bit: With cfloat: 1 minute, 18 secs, 451 ms, 932 ╬╝s, and 9 hnsecs With cdouble: 1 minute, 19 secs, 747 ms, 70 ╬╝s, and 5 hnsecs With Complex!float: 27 secs, 412 ms, 926 ╬╝s, and 5 hnsecs With Complex!double: 25 secs, 39 ms, 159 ╬╝s, and 2 hnsecs --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 --- Comment #4 from ponce --- This benchmark is a variation that does only division. --- divide.d import std.string; import std.datetime; import std.datetime.stopwatch : benchmark, StopWatch; import std.complex; import std.stdio; import std.math; void main() { int[] divider = new int[1024]; cfloat[] A = new cfloat[1024]; cdouble[] B = new cdouble[1024]; Complex!float[] C = new Complex!float[1024]; Complex!double[] D = new Complex!double[1024]; foreach(i; 0..1024) { divider[i] = (i*69060) / 1; // Initialize with something A[i] = i + 1i; B[i] = i + 1i; C[i] = Complex!float(i, 1); D[i] = Complex!double(i, 1); } void justDivide(ComplexType)(ComplexType[] arr) { int size = cast(int)(arr.length); for (int i = 0; i < size; ++i) { arr[i] = divider[i] / arr[i]; } } void fA() { justDivide!(cfloat)(A); } void fB() { justDivide!(cdouble)(B); } void fC() { justDivide!(Complex!float)(C); } void fD() { justDivide!(Complex!double)(D); } auto r = benchmark!(fA, fB, fC, fD)(100); { writefln("With cfloat: %s", r[0] ); writefln("With cdouble: %s", r[1] ); writefln("With Complex!float: %s", r[2] ); writefln("With Complex!double: %s", r[3] ); } } RESULTS * With ldc 1.8.0 64-bit: $ ldc2.exe -O3 -enable-inlining -release divide.d -m64 $ divide.exe With cfloat: 7 secs, 623 ms, 829 ╬╝s, and 9 hnsecs With cdouble: 7 secs, 594 ms, 449 ╬╝s, and 8 hnsecs With Complex!float: 7 secs, 988 ms, 642 ╬╝s, and 4 hnsecs With Complex!double: 15 secs, 501 ms, 128 ╬╝s, and 4 hnsecs * With ldc 1.8.0 32-bit: $ ldc2.exe -O3 -enable-inlining -release divide.d -m32 $ divide.exe With cfloat: 7 secs, 618 ms, 202 ╬╝s, and 1 hnsec With cdouble: 7 secs, 593 ms, 777 ╬╝s, and 2 hnsecs With Complex!float: 7 secs, 958 ms, 692 ╬╝s, and 9 hnsecs With Complex!double: 15 secs, 414 ms, and 344 ╬╝s This show that even with latest LDC you can have a regression. I appreciate that std.complex gives more precision in the divide operation, it's also something that is _different_ from builtin complex it replaces. --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 Iain Buclaw changed: What|Removed |Added CC||ibuc...@gdcproject.org --- Comment #3 from Iain Buclaw --- FYI, GDC is missing, but I'll post it anyway, along with DMD as a comparative benchmark, because each machine is different and DMD may optimize weirdly for one CPU but is perfectly fine for another (see for instance issue 5100) DMD64 D Compiler v2.076.1 --- $ dmd complex.d -O -inline -release With cfloat: 75 ms, 688 μs, and 2 hnsecs With cdouble: 61 ms, 546 μs, and 7 hnsecs With Complex!float: 161 ms, 816 μs, and 8 hnsecs With Complex!double: 109 ms, 66 μs, and 1 hnsec --- There seems to be room for improvement in dmd or the general phobos implementation. gdc (GCC) 8.0.1 20180226 (2.076.1 library and patches) --- $ gdc complex.d -O2 -frelease With cfloat: 154 ms, 871 μs, and 8 hnsecs With cdouble: 59 ms, 205 μs, and 7 hnsecs With Complex!float: 32 ms, 566 μs, and 5 hnsecs With Complex!double: 34 ms, 961 μs, and 6 hnsecs --- However with gdc, std.complex is /faster/ than native. --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 --- Comment #2 from ponce --- I've posted there, thanks. --
[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching
https://issues.dlang.org/show_bug.cgi?id=18627 greenify changed: What|Removed |Added CC||greeen...@gmail.com --- Comment #1 from greenify --- See also: https://github.com/dlang/dmd/pull/7640 --