Re: [mpir-devel] Mpir 3.0.0 seems to be about 8% slower than 2.7.2 for small floats (1-4 limbs)

kesseler Sun, 05 Mar 2017 02:29:09 -0800

Hi Bill,

thanks for answering so fast!


I still think that architecture specific optimizations are not the source 
of the problem, since the two generic C versions also have the 8% 
performance discrepancy.

If the C code is unchanged, what about memory allocation? Any changes there?

I'll start looking into code changes myself, to see if I can identify any 
changes that might cause the problem.

Marcus


Am Sonntag, 5. März 2017 02:56:47 UTC+1 schrieb Bill Hart:
>
> I can't think of anything specific that we changed that should affect the 
> speed in this way. We literally didn't touch the assembly code for any of 
> those architectures and the C code has only changed for powering mod n.
>
> You are probably noticing a slowdown due to the library being rearranged 
> slightly at the level of linking, or something like that.
>
> Of course there is no Broadwell specific code, and perhaps when there is, 
> you will again notice a speedup.
>
> Bill.
>
> On 4 March 2017 at 21:57, <kess...@schema.de <javascript:>> wrote:
>
>> Hi,
>>
>> first of all, thanks for all the great MPIR work! I've been using it for 
>> about 4 years to compute visually compelling deep Mandelbrot zoom videos.
>>
>> Yesterday I've downloaded 3.0.0 and compiled it using VS 2015 U3 on an 
>> Intel Core i7 6900K (8 cores, Broadwell) on Windows 10 64.
>>
>> Unfortunately, 3.0.0 seems to be slower than 2.7.2 by about 8% when small 
>> floats are used. By small floats I mean a precision of up to 256 bits (4 
>> limbs on x64).
>>
>> Compilation worked flawlessly for all of the 10 architectures I've 
>> selected. Just to make sure Visual Studio updates are not the source of the 
>> problem, I also recompiled the 7 architectures I've been testing with 2.7.2.
>>
>> The stats below are based on several hundred million Mandelbrot 
>> iterations for each data point. All 16 threads of the 6900K are used and 
>> all of them are at 100% capacity.
>>
>> I get the following speedup matrix for 128 precision floats over all 
>> compiled versions and architectures:
>>
>> Results from file: Run_2017-03-04T20_10_18.xml; number model: GMP128
>>    1 mpir_3_0_0_x64_gc                          MFlops:    252.6
>>    2 mpir_2_7_2_x64_gc                          MFlops:    274.7 
>> Speedup:     8.73%
>>    3 mpir_3_0_0_x64_haswell_avx                 MFlops:    357.9 
>> Speedup:    41.67%    30.30%
>>    4 mpir_3_0_0_x64_skylake_avx                 MFlops:    365.9 
>> Speedup:    44.82%    33.20%     2.22%
>>    5 mpir_3_0_0_x64_haswell                     MFlops:    368.1 
>> Speedup:    45.72%    34.02%     2.86%     0.62%
>>    6 mpir_3_0_0_x64_skylake                     MFlops:    371.0 
>> Speedup:    46.84%    35.05%     3.65%     1.39%     0.77%
>>    7 mpir_3_0_0_x64_core2                       MFlops:    377.0 
>> Speedup:    49.23%    37.26%     5.34%     3.05%     2.41%     1.63%
>>    8 mpir_3_0_0_x64_sandybridge_ivybridge       MFlops:    386.7 
>> Speedup:    53.07%    40.79%     8.05%     5.70%     5.05%     4.25%    
>>  2.57%
>>    9 mpir_3_0_0_x64_nehalem_westmere            MFlops:    389.3 
>> Speedup:    54.10%    41.74%     8.78%     6.41%     5.76%     4.95%    
>>  3.26%     0.67%
>>   10 mpir_3_0_0_x64_nehalem                     MFlops:    389.5 
>> Speedup:    54.19%    41.82%     8.84%     6.47%     5.82%     5.01%    
>>  3.32%     0.73%     0.06%
>>   11 mpir_3_0_0_x64_sandybridge                 MFlops:    395.1 
>> Speedup:    56.39%    43.84%    10.39%     7.99%     7.33%     6.51%    
>>  4.80%     2.17%     1.48%     1.43%
>>   12 mpir_2_7_2_x64_haswell                     MFlops:    398.3 
>> Speedup:    57.66%    45.01%    11.28%     8.87%     8.20%     7.37%    
>>  5.65%     3.00%     2.31%     2.25%     0.81%
>>   13 mpir_2_7_2_x64_sandybridge_ivybridge       MFlops:    404.3 
>> Speedup:    60.04%    47.20%    12.97%    10.51%     9.83%     8.99%    
>>  7.24%     4.55%     3.85%     3.79%     2.33%     1.51%
>>   14 mpir_2_7_2_x64_sandybridge                 MFlops:    405.2 
>> Speedup:    60.40%    47.52%    13.22%    10.76%    10.07%     9.23%    
>>  7.48%     4.78%     4.08%     4.02%     2.56%     1.74%     0.22%
>>   15 mpir_2_7_2_x64_nehalem_westmere            MFlops:    417.3 
>> Speedup:    65.16%    51.91%    16.58%    14.05%    13.35%    12.48%    
>> 10.67%     7.90%     7.18%     7.12%     5.61%     4.76%     3.20%     2.97%
>>   16 mpir_2_7_2_x64_core2                       MFlops:    419.0 
>> Speedup:    65.85%    52.54%    17.07%    14.53%    13.82%    12.95%    
>> 11.14%     8.35%     7.62%     7.56%     6.05%     5.20%     3.63%    
>>  3.40%     0.42%
>>   17 mpir_2_7_2_x64_nehalem                     MFlops:    422.8 
>> Speedup:    67.37%    53.94%    18.14%    15.58%    14.86%    13.99%    
>> 12.16%     9.34%     8.61%     8.55%     7.02%     6.16%     4.58%    
>>  4.35%     1.34%     0.92%
>>                                                                           
>>        1         2         3         4         5         6         7        
>>  8         9        10        11        12        13        14        15    
>>     16
>>
>> I've taken these measurements three times with the same results.
>>
>> The six fastest versions are all 2.7.2.
>>
>> Note that architectural compilation and the Broadwell CPU do not seem to 
>> be the issue, since the slowest two versions, the generic C 
>> mpir_3_0_0_x64_gc and mpir_2_7_2_x64_gc also differ by about 8%. Both 
>> compiled on the same machine within 5 minutes of each other with VS 2015.
>>
>> Another hint that architectural compilation and optimization is working 
>> fine, is that once I test with 1024 bits precision, the fastest version is 
>> mpir_3_0_0_x64_skylake_avx (the Broadwell CPU used in this test already has 
>> most of the improvements of Skylake). Unfortunately, I very rarely zoom 
>> down to a magnification that needs 1024 bits.
>>
>> I have not done any tuning yet, but my understanding is that for limb 
>> sizes 1, 2 or 3 it should not matter anyway.
>>
>> Any hints or ideas on what I may be doing wrong?
>>
>> Does this also happen on other OSes/CPUs?
>>
>> Thanks and best regards,
>>
>> Marcus
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "mpir-devel" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to mpir-devel+...@googlegroups.com <javascript:>.
>> To post to this group, send email to mpir-...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/mpir-devel.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mpir-devel+unsubscr...@googlegroups.com.
To post to this group, send email to mpir-devel@googlegroups.com.
Visit this group at https://groups.google.com/group/mpir-devel.
For more options, visit https://groups.google.com/d/optout.

Re: [mpir-devel] Mpir 3.0.0 seems to be about 8% slower than 2.7.2 for small floats (1-4 limbs)

Reply via email to