On Jul 22, 11:27 am, Jason Moxham <[email protected]> wrote:
> On Friday 17 July 2009 08:32:30 Cactus wrote:
>
> > On Jul 16, 11:55 pm, Jason Moxham <[email protected]> wrote:
> > > A comparison of MPIR-1.2.1 on intel nehalem  built as a core2 on Linux
> > > and Windows.
>
> > > This is in cycles and first col is Linux and second col is windows ,
> > > ignore the # , they are wrong
>
> > > ./speed -c -s 1-40 mpn_add_n colfile=1,win64_bat/time_add
> > > overhead 5.71 cycles, precision 1000000 units of 3.75e-10 secs, CPU freq
> > > 2664.77 MHz
> > >             mpn_add_n colfile=1,win64_bat/time_add
> > > 1              #12.38         12.38
> > > 2              #11.43         12.38
> > > 3              #13.09         13.50
> > > 4              #18.41         18.10
> > > 5              #23.81         23.81
> > > 6              #24.00         24.45
> > > 7              #27.76         27.25
> > > 8              #26.19         26.19
> > > 9              #28.42         30.00
> > > 10             #29.62         31.43
> > > 11             #32.31         32.86
> > > 12             #34.50         34.29
> > > 13             #37.39         37.14
> > > 14             #40.00         38.73
> > > 15             #41.91         42.86
> > > 16             #40.48         42.22
> > > 17             #44.77         45.72
> > > 18             #45.85         46.77
> > > 19             #48.39         48.19
> > > 20             #49.53         49.76
> > > 21             #53.66         53.53
> > > 22             #53.81         54.65
> > > 23             #55.24         56.33
> > > 24             #56.91         57.31
> > > 25             #60.48         60.96
> > > 26             #60.84         62.65
> > > 27             #62.60         63.43
> > > 28             #65.72         64.76
> > > 29             #68.58         67.46
> > > 30             #67.39         69.80
> > > 31             #72.67         69.53
> > > 32             #72.39         72.96
> > > 33             #75.72         77.15
> > > 34             #76.91         78.81
> > > 35             #76.20         78.26
> > > 36             #79.06         80.72
> > > 37             #83.34         83.43
> > > 38             #84.77         84.90
> > > 39             #85.62         85.34
> > > 40             #87.49         88.58
>
> > > ./speed -c -s 1-40 mpn_addmul_1.333 colfile=1,win64_bat/time_addmul1
> > > overhead 5.71 cycles, precision 1000000 units of 3.75e-10 secs, CPU freq
> > > 2664.77 MHz
> > >         mpn_addmul_1.333 colfile=1,win64_bat/time_addmul1
> > > 1              #11.43         37.14
> > > 2              #17.15         28.33
> > > 3              #23.15         31.43
> > > 4              #29.05         39.53
> > > 5              #34.62         46.11
> > > 6              #40.52         49.23
> > > 7              #45.72         54.29
> > > 8              #51.75         57.60
> > > 9              #57.15         65.12
> > > 10             #63.09         67.21
> > > 11             #69.00         72.63
> > > 12             #74.97         75.98
> > > 13             #80.01         82.07
> > > 14             #76.77         85.72
> > > 15             #80.32         90.48
> > > 16             #84.24         94.29
> > > 17             #91.04        101.52
> > > 18             #95.12        104.77
> > > 19             #98.95        110.62
> > > 20            #103.40        114.86
> > > 21            #108.45        118.74
> > > 22            #112.95        124.58
> > > 23            #116.22        129.26
> > > 24            #120.21        133.34
> > > 25            #127.04        138.46
> > > 26            #131.78        142.97
> > > 27            #135.26        147.81
> > > 28            #140.71        152.63
> > > 29            #146.01        156.67
> > > 30            #149.29        161.39
> > > 31            #151.82        167.20
> > > 32            #157.58        171.12
> > > 33            #163.28        177.38
> > > 34            #168.20        181.00
> > > 35            #173.00        185.66
> > > 36            #176.42        191.12
> > > 37            #183.57        195.69
> > > 38            #185.68        200.07
> > > 39            #190.67        204.46
> > > 40            #191.45        209.37
>
> > > ./speed -c -s 1-40 mpn_addmul_2 colfile=1,win64_bat/time_addmul2
> > > overhead 5.71 cycles, precision 1000000 units of 3.75e-10 secs, CPU freq
> > > 2664.77 MHz
> > >          mpn_addmul_2 colfile=1,win64_bat/time_addmul2
> > > 1                 n/a         #0.00
> > > 2              #25.10         31.82
> > > 3              #33.71         39.53
> > > 4              #41.34         49.40
> > > 5              #51.20         57.57
> > > 6              #57.15         65.53
> > > 7              #65.72         73.81
> > > 8              #74.00         81.95
> > > 9              #83.00         90.01
> > > 10             #91.31         97.15
> > > 11             #99.70        104.77
> > > 12            #108.22        115.16
> > > 13            #116.44        122.20
> > > 14            #121.93        130.01
> > > 15            #131.60        139.69
> > > 16            #139.31        147.51
> > > 17            #149.05        155.11
> > > 18            #156.21        164.43
> > > 19            #165.05        171.73
> > > 20            #172.72        180.22
> > > 21            #181.55        188.48
> > > 22            #189.37        199.52
> > > 23            #199.07        203.65
> > > 24            #206.69        212.16
> > > 25            #212.50        221.57
> > > 26            #222.27        229.50
> > > 27            #231.47        237.38
> > > 28            #238.32        246.91
> > > 29            #247.81        254.30
> > > 30            #255.24        261.65
> > > 31            #261.85        272.16
> > > 32            #271.55        279.69
> > > 33            #280.37        285.91
> > > 34            #288.60        295.80
> > > 35            #297.46        302.80
> > > 36            #304.37        311.85
> > > 37            #312.94        320.60
> > > 38            #320.62        327.36
> > > 39            #328.93        337.28
> > > 40            #338.25        343.47
>
> > > ./speed -c -s 1-40 mpn_mul_basecase colfile=1,win64_bat/time_mulbase
> > > overhead 5.71 cycles, precision 1000000 units of 3.75e-10 secs, CPU freq
> > > 2664.77 MHz
> > >         mpn_mul_basecase colfile=1,win64_bat/time_mulbase
> > > 1               #9.52          8.57
> > > 2              #28.79         48.57
> > > 3              #50.85         72.54
> > > 4              #82.68        100.00
> > > 5             #114.62        135.24
> > > 6             #155.92        178.51
> > > 7             #210.50        232.87
> > > 8             #267.02        289.22
> > > 9             #341.31        438.35
> > > 10            #416.70        431.50
> > > 11            #504.12        523.84
> > > 12            #592.83        612.98
> > > 13            #698.18        727.01
> > > 14            #803.90        828.87
> > > 15            #927.52        951.48
> > > 16           #1044.89       1069.58
> > > 17           #1197.26       1222.67
> > > 18           #1328.40       1350.14
> > > 19           #1480.56       1504.67
> > > 20           #1633.17       1652.56
> > > 21           #1818.39       1838.12
> > > 22           #1988.85       2004.68
> > > 23           #2185.06       2194.37
> > > 24           #2355.79       2372.55
> > > 25           #2589.94       2599.87
> > > 26           #2769.85       2792.36
> > > 27           #2993.14       3007.09
> > > 28           #3205.00       3403.94
> > > 29           #3445.63       3475.97
> > > 30           #3678.54       3708.50
> > > 31           #3929.52       3976.62
> > > 32           #4184.47       4208.07
> > > 33           #4478.52       4500.55
> > > 34           #4726.07       4756.50
> > > 35           #5034.90       5048.47
> > > 36           #5295.16       5317.34
> > > 37           #5626.73       5646.50
> > > 38           #5902.33       5917.02
> > > 39           #6225.96       6252.33
> > > 40           #6541.47       6563.06
>
> > > ./speed -c -s 1-40 mpn_sqr_basecase colfile=1,win64_bat/time_sqrbase
> > > overhead 5.71 cycles, precision 1000000 units of 3.75e-10 secs, CPU freq
> > > 2664.77 MHz
> > >         mpn_sqr_basecase colfile=1,win64_bat/time_sqrbase
> > > 1               #7.62          7.62
> > > 2              #17.14         17.14
> > > 3              #43.81         47.53
> > > 4              #74.29         85.72
> > > 5             #106.19        114.45
> > > 6             #133.83        145.72
> > > 7             #171.69        183.59
> > > 8             #213.36        223.73
> > > 9             #259.08        270.49
> > > 10            #310.52        322.39
> > > 11            #367.15        376.37
> > > 12            #426.64        439.26
> > > 13            #494.58        508.89
> > > 14            #562.71        572.72
> > > 15            #641.03        649.69
> > > 16            #713.74        728.12
> > > 17            #803.52        812.86
> > > 18            #890.97        899.63
> > > 19            #982.98        995.49
> > > 20           #1082.37       1092.61
> > > 21           #1183.96       1204.66
> > > 22           #1291.96       1300.94
> > > 23           #1405.15       1405.02
> > > 24           #1521.08       1531.02
> > > 25           #1651.29       1651.48
> > > 26           #1769.72       1778.50
> > > 27           #1895.14       1911.57
> > > 28           #2030.37       2525.85
> > > 29           #2195.53       2188.53
> > > 30           #2317.56       2333.57
> > > 31           #2470.87       2481.63
> > > 32           #2627.25       2636.25
> > > 33           #2791.54       2805.56
> > > 34           #2936.54       2958.32
> > > 35           #3118.37       3126.90
> > > 36           #3280.58       3298.26
> > > 37           #3454.08       3479.90
> > > 38           #3648.92       3655.95
> > > 39           #3844.89       3844.73
> > > 40           #4031.90       4078.56
>
> > > The very small differences between mpn_add_n on Linux and Windows show
> > > that the other differences are not just down to how we call the cpu timer
> > > or function call overheads, therefore they are real timing differences
> > > and not some artifact. So we hopefully can improve this. I would of
> > > preferred AMD timings as I am more familiar with that chip.
>
> > Hi Jason,
>
> > We have done these comparisons before and none of these figures are
> > surprising since they reflect the three different staregies that I
> > have to use for *nix to Windows assembler code conversion.
>
> > 1. If a *nix assembler function (a) doesn't use the stack, and (b) can
> > leave two scratch registers unused, it can be converted by simply
> > remapping the registers and, except for use of different registers, it
> > will be identical on *nix and Windows.  This applies, for example, to
> > mpn_add_n.  This is a Windows leaf function that does not need to
> > support exception handling and stack unwinding. Other conversions
> > require Windows frame functions with exception support.
>
> > 2. If not enough scratch registers are available, I have to save and
> > restore registers on the stack but when the function is simple enough
> > I can still remap the registers.  This gives a constant (independent
> > of limb count) overhead.
>
> > 3. For complex functions remapping the registers can be too hard to do
> > so in such cases I save registers on the stack and then move input
> > parameters from their Windows registers to where the *nix assembler
> > expects them to be.  This again gives a constant overhead, one that is
> > a bit higher than in 2 above.
>
> > To reduce the overhead in 2 it is necessary to change the assembler to
> > use fewer registers. This is becaause rsi and rdi are scratch
> > registers on *nix but not on Windows.
>
> > Some overhead can be saved on 3 by remapping registers but these are
> > functions where the overhead is typically a small proportion of the
> > functions average cost since these are generally the functions where
> > the cost is quadratic on limb count.
>
> >      Brian
>
> Hi
>
> Just got my internet back , Phew...
>
> I was aware of the different methods and why , I was just surprised by the
> size of differences , it seems like the windows versions are using more
> cycles than they "should". Even for the linux code I only really concentrated
> on the inner loops , there is no point in tuning the setup code while the
> inner loops are changing. I plan to change the inner loops yet again , so
> there is little point in changing the outer code now.

Hi Jason

I wrote a Python curve fitter for speed output from all the assembler
code and it gave the right 'cost per limb' when compared with your
linux figures.

I agree that there is a constant fixed overhead for the extar Windows
operations but it didn't seem unusually high when I looked at it -
admittedly some time ago.

     Brian

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to