Hi Bill, thanks for answering so fast!
I still think that architecture specific optimizations are not the source of the problem, since the two generic C versions also have the 8% performance discrepancy. If the C code is unchanged, what about memory allocation? Any changes there? I'll start looking into code changes myself, to see if I can identify any changes that might cause the problem. Marcus Am Sonntag, 5. März 2017 02:56:47 UTC+1 schrieb Bill Hart: > > I can't think of anything specific that we changed that should affect the > speed in this way. We literally didn't touch the assembly code for any of > those architectures and the C code has only changed for powering mod n. > > You are probably noticing a slowdown due to the library being rearranged > slightly at the level of linking, or something like that. > > Of course there is no Broadwell specific code, and perhaps when there is, > you will again notice a speedup. > > Bill. > > On 4 March 2017 at 21:57, <kess...@schema.de <javascript:>> wrote: > >> Hi, >> >> first of all, thanks for all the great MPIR work! I've been using it for >> about 4 years to compute visually compelling deep Mandelbrot zoom videos. >> >> Yesterday I've downloaded 3.0.0 and compiled it using VS 2015 U3 on an >> Intel Core i7 6900K (8 cores, Broadwell) on Windows 10 64. >> >> Unfortunately, 3.0.0 seems to be slower than 2.7.2 by about 8% when small >> floats are used. By small floats I mean a precision of up to 256 bits (4 >> limbs on x64). >> >> Compilation worked flawlessly for all of the 10 architectures I've >> selected. Just to make sure Visual Studio updates are not the source of the >> problem, I also recompiled the 7 architectures I've been testing with 2.7.2. >> >> The stats below are based on several hundred million Mandelbrot >> iterations for each data point. All 16 threads of the 6900K are used and >> all of them are at 100% capacity. >> >> I get the following speedup matrix for 128 precision floats over all >> compiled versions and architectures: >> >> Results from file: Run_2017-03-04T20_10_18.xml; number model: GMP128 >> 1 mpir_3_0_0_x64_gc MFlops: 252.6 >> 2 mpir_2_7_2_x64_gc MFlops: 274.7 >> Speedup: 8.73% >> 3 mpir_3_0_0_x64_haswell_avx MFlops: 357.9 >> Speedup: 41.67% 30.30% >> 4 mpir_3_0_0_x64_skylake_avx MFlops: 365.9 >> Speedup: 44.82% 33.20% 2.22% >> 5 mpir_3_0_0_x64_haswell MFlops: 368.1 >> Speedup: 45.72% 34.02% 2.86% 0.62% >> 6 mpir_3_0_0_x64_skylake MFlops: 371.0 >> Speedup: 46.84% 35.05% 3.65% 1.39% 0.77% >> 7 mpir_3_0_0_x64_core2 MFlops: 377.0 >> Speedup: 49.23% 37.26% 5.34% 3.05% 2.41% 1.63% >> 8 mpir_3_0_0_x64_sandybridge_ivybridge MFlops: 386.7 >> Speedup: 53.07% 40.79% 8.05% 5.70% 5.05% 4.25% >> 2.57% >> 9 mpir_3_0_0_x64_nehalem_westmere MFlops: 389.3 >> Speedup: 54.10% 41.74% 8.78% 6.41% 5.76% 4.95% >> 3.26% 0.67% >> 10 mpir_3_0_0_x64_nehalem MFlops: 389.5 >> Speedup: 54.19% 41.82% 8.84% 6.47% 5.82% 5.01% >> 3.32% 0.73% 0.06% >> 11 mpir_3_0_0_x64_sandybridge MFlops: 395.1 >> Speedup: 56.39% 43.84% 10.39% 7.99% 7.33% 6.51% >> 4.80% 2.17% 1.48% 1.43% >> 12 mpir_2_7_2_x64_haswell MFlops: 398.3 >> Speedup: 57.66% 45.01% 11.28% 8.87% 8.20% 7.37% >> 5.65% 3.00% 2.31% 2.25% 0.81% >> 13 mpir_2_7_2_x64_sandybridge_ivybridge MFlops: 404.3 >> Speedup: 60.04% 47.20% 12.97% 10.51% 9.83% 8.99% >> 7.24% 4.55% 3.85% 3.79% 2.33% 1.51% >> 14 mpir_2_7_2_x64_sandybridge MFlops: 405.2 >> Speedup: 60.40% 47.52% 13.22% 10.76% 10.07% 9.23% >> 7.48% 4.78% 4.08% 4.02% 2.56% 1.74% 0.22% >> 15 mpir_2_7_2_x64_nehalem_westmere MFlops: 417.3 >> Speedup: 65.16% 51.91% 16.58% 14.05% 13.35% 12.48% >> 10.67% 7.90% 7.18% 7.12% 5.61% 4.76% 3.20% 2.97% >> 16 mpir_2_7_2_x64_core2 MFlops: 419.0 >> Speedup: 65.85% 52.54% 17.07% 14.53% 13.82% 12.95% >> 11.14% 8.35% 7.62% 7.56% 6.05% 5.20% 3.63% >> 3.40% 0.42% >> 17 mpir_2_7_2_x64_nehalem MFlops: 422.8 >> Speedup: 67.37% 53.94% 18.14% 15.58% 14.86% 13.99% >> 12.16% 9.34% 8.61% 8.55% 7.02% 6.16% 4.58% >> 4.35% 1.34% 0.92% >> >> 1 2 3 4 5 6 7 >> 8 9 10 11 12 13 14 15 >> 16 >> >> I've taken these measurements three times with the same results. >> >> The six fastest versions are all 2.7.2. >> >> Note that architectural compilation and the Broadwell CPU do not seem to >> be the issue, since the slowest two versions, the generic C >> mpir_3_0_0_x64_gc and mpir_2_7_2_x64_gc also differ by about 8%. Both >> compiled on the same machine within 5 minutes of each other with VS 2015. >> >> Another hint that architectural compilation and optimization is working >> fine, is that once I test with 1024 bits precision, the fastest version is >> mpir_3_0_0_x64_skylake_avx (the Broadwell CPU used in this test already has >> most of the improvements of Skylake). Unfortunately, I very rarely zoom >> down to a magnification that needs 1024 bits. >> >> I have not done any tuning yet, but my understanding is that for limb >> sizes 1, 2 or 3 it should not matter anyway. >> >> Any hints or ideas on what I may be doing wrong? >> >> Does this also happen on other OSes/CPUs? >> >> Thanks and best regards, >> >> Marcus >> >> -- >> You received this message because you are subscribed to the Google Groups >> "mpir-devel" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to mpir-devel+...@googlegroups.com <javascript:>. >> To post to this group, send email to mpir-...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/mpir-devel. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To unsubscribe from this group and stop receiving emails from it, send an email to mpir-devel+unsubscr...@googlegroups.com. To post to this group, send email to mpir-devel@googlegroups.com. Visit this group at https://groups.google.com/group/mpir-devel. For more options, visit https://groups.google.com/d/optout.