Re: [PATCH 09/20] Prepare for creating hidden aliases of all routines
Richard Henderson r...@twiddle.net writes: index 1b27998..ff0dc45 100644 --- a/gmp-h.in +++ b/gmp-h.in @@ -251,6 +251,10 @@ typedef __mpq_struct *mpq_ptr; __GMP_PUBLIC_DATA - for declaring data variables __GMP_PUBLIC_ALIAS - for re-declaring symbols with another name + When using GCC and building an ELF shared library, we'll arrange for + the rest of GMP to see a hidden alias. The true public symbol will + be created in the individual source files. + This is supposed to be an implementation detail, right? I.e., it should depend only on if gmp *itself* is compiled with gcc, and work the same no matter which compiler is used for the application *using* gmp (and which will be processing the public gmp.h header)? I admit I don't understand the fine details here. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: [PATCH 01/20] Delete mpn/generic/sizeinbase.c
Richard Henderson r...@twiddle.net writes: As far as I can tell it hasn't been used since 2002-02-09 Kevin Ryde ke...@swox.se * configure.in, mpn/Makefile.am, gmp-impl.h (mpn_sizeinbase): Remove. * mpn/generic/sizeinbase.c: Remove file. removed it from MPN_OBJECTS. It's certainly never built. Does anyone remember why it was deleted back then? I think it makes a lot of sense as a public mpn function. The file has seen some non-trivial updates by Torbjörn and Marco in recent years. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so
Richard Henderson r...@twiddle.net writes: Although I'm frankly a bit confused as to why we're using m4 for the assembly macro-isation, as opposed to the C preprocessor. Different taste, I guess. I don't know why GMP moved to m4, but personally, I wouldn't use cpp for anything that isn't C code, it's too deeply tied to the notion of a C token. And I think the additional power of m4 is used in gmp in a couple of places, e.g., ifelse tests applied to macro arguments, and macros expanding to macro definitions. And I don't think the macrology would get much prettier with cpp-style token pasting... In particular, when -g is passed through from gcc to gas, gas sees and records the # file line markers that the preprocessor leaves, and puts that into the auto-generated line number debugging info. With our m4 scheme we get no such # markers and instead gas records line number info assocated with a no longer extant temp file. The non-existant temp file definately is annoying, you have to comment out a rm command in the m4-ccas script when debugging gmp assembly files. As for #line markers, you can get them with m4 -s (not sure if it's a GNUism or traditional m4). But when using non-trivial M4 macros, I find it's often more helpful to have gdb guide you in the .s file after macro expansion. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so
David Miller da...@davemloft.net writes: And it causes the debugging problem Richard mentioned too. I really want to step in the orignal source file, the thing I'm going to edit to fix the bug, not some intermediate file. Does it help to just add -s to the m4 invocation? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so
Richard Henderson r...@twiddle.net writes: But perhaps more importantly, everyone who programs in C understands how the preprocessor works. Actually, I think that's incorrect. Everyone has some *familiarity* with the C preprocessor, which surely is an advantage. And maybe most C programmers think they they understand it. But in my experience, very few understand the fine details of cpp macro expansion. E.g., you probably wrote those GLUE and GLU1 macros without any difficulty, but understanding how that works and why really makes you a Great Guru to the vast majority of C programmers. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so
Torbjorn Granlund t...@gmplib.org writes: ni...@lysator.liu.se (Niels Möller) writes: I would expect #line to cause syntax problem for macines where # is not a comment charachter. Like ARM, where #17 is the small constant argument 17. GNU as on my arm doesn't complain about #line as generated by m4 -s. Not sure if it generates any useful debug info from it, though. But I imagine one might need some configure check to disable -s under certain circumstances. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so
Torbjorn Granlund t...@gmplib.org writes: Ehum, I don't understand with which cpp quirk that indirection is coping... The point of the indirection is to get macro arguments expanded *before* substitution. Which matters only (I think) when using the # and ## cpp operators. Example: gcc -E on this file #define S1(x) prefix_##x #define S(x) S1(x) #define FOO bar S(FOO) S1(FOO) produces prefix_bar prefix_FOO Perhaps we should use m4 for preprocessing C instead? :-) That would certainly cause some additional confusion. Any suggestion for appropriate m4 quote characters to use? ;-) Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: gmp-devel list
Zimmermann Paul paul.zimmerm...@loria.fr writes: the gmp-devel list is for Technical discussions between developers. We have seen recently several patches posted, which I believe do no match the list definition. If there is no other way to transfer source code, maybe one should create a gmp-patch list? We need a forum for both posting patches, and discussing the same patches. To me, it seems easiest to have both on the same list. I take it the *huge* patch containing the generated configure script was a mistake. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: [PATCH 2/2] Optimize 64-bit mpn_add_N and mpn_sub_N for sparc T3 and later.
Torbjorn Granlund t...@gmplib.org writes: I optimised submul_1.asm, and then edited both addmul_1 and submul_1 to use as similar operand order as possible. So remaining differences are necessary? I don't remember much sparc assembly, but it seems carry handling is done slightly differently. But appearantly there was no need for negation-on-the-fly tricks in submul_1, you just use subcc instead of addcc for the final subtraction (maybe we discussed that trick in some other context?). Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_cnd_add_n
Torbjorn Granlund t...@gmplib.org writes: OK with me, but either test on powerpc64 before checking in, or keep an eye on nightbuild regressions and fix any problem there. I checked it now, but darn, it did break on powerpc. Register r6 was used instead of the symbolic name n in a couple of places. With the following additional changes (complete file in the tree ~nisse/hack/gmp on shell) it appears to work again. It would be good if someone who actually understands powerpc assembly could have a look before I check it in. And maybe it's better to replace the hardcoded r7 when used in the loop by some symbolic name? Something like define(`ulimb', `r7') C Overlap with n input argument Regards, /Niels --- a/mpn/powerpc64/mode64/aorscnd_n.asmSun Mar 10 10:00:12 2013 +0100 +++ b/mpn/powerpc64/mode64/aorscnd_n.asmSun Mar 10 12:30:31 2013 +0100 @@ -64,11 +64,11 @@ subfic cnd, cnd, 0 subfe cnd, cnd, cnd - rldicl. r0, r6, 0,62C r0 = n 3, set cr0 + rldicl. r0, n, 0,62 C r0 = n 3, set cr0 cmpdi cr6, r0, 2 - addir6, r6, 3 C compute count... - srdir6, r6, 2 C ...for ctr - mtctr r6 C copy count into ctr + addin, n, 3 C compute count... + srdin, n, 2 C ...for ctr + mtctr n C copy count into ctr beq cr0, L(b00) blt cr6, L(b01) beq cr6, L(b10) @@ -122,7 +122,7 @@ b L(ret) L(b00):CLRCB C clear/set cy -L(go): ld r6, 0(up) C load s1 limb +L(go): ld r7, 0(up) C load s1 limb ld r27, 0(vp) C load s2 limb ld r8, 8(up) C load s1 limb ld r9, 8(vp) C load s2 limb @@ -139,8 +139,8 @@ addiup, up, 32 addivp, vp, 32 -L(top):ADDSUBC r28, r27, r6 - ld r6, 0(up) C load s1 limb +L(top):ADDSUBC r28, r27, r7 + ld r7, 0(up) C load s1 limb ld r27, 0(vp) C load s2 limb ADDSUBC r29, r9, r8 ld r8, 8(up) C load s1 limb @@ -164,7 +164,7 @@ and r0, r0, cnd bdnzL(top) C decrement ctr and loop back -L(end):ADDSUBC r28, r27, r6 +L(end):ADDSUBC r28, r27, r7 ADDSUBC r29, r9, r8 ADDSUBC r30, r11, r10 ADDSUBC r31, r0, r12 -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_cnd_add_n
ni...@lysator.liu.se (Niels Möller) writes: Here's a patch that reorders the arguments for mpn_addcnd_n and mpn_subcnd_n (I think it's best to keep this change separate from the renaming, since the potential problems are quite different). This seems to work now, after additional fixes for powerpc64 and x86_64 w*ndows. Here's the next patch, doing the rename of functions and files. I generated it with hg diff --git, to get concise diffs for the renamed files (to apply, I think one need a recent patch program, or use git apply, where the latter apparently doesn't require any git repo). I plan to check this in fairly soon. Regards, /Niels diff --git a/ChangeLog b/ChangeLog --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,32 @@ +2013-03-12 Niels Möller ni...@lysator.liu.se + + New names mpn_cnd_add_n and mpn_cnd_sub_n. + * mpn/generic/cnd_add_n.c (mpn_cnd_add_n): Renamed file and + function, from addcnd.c:mpn_addcnd_n. + * mpn/generic/cnd_sub_n.c (mpn_cnd_sub_n): Renamed, from + subcnd.c:mpn_subcnd_n. + * mpn/arm/cnd_aors_n.asm: Renamed file, from aorscnd.asm, and + renamed functions. + * mpn/x86_64/cnd_aors_n.asm: Analogous renaming. + * mpn/powerpc64/mode64/cnd_aors_n.asm: Analogous renaming. + * gmp-impl.h (mpn_cnd_add_n, mpn_cnd_add_n): Updated prototypes + with new names. + * configure.ac: Updated for new names. + * tests/refmpn.c (refmpn_cnd_add_n): Renamed, from refmpn_addcnd_n. + (refmpn_cnd_sub_n): Renamed, from refmpn_subcnd_n. + * tests/tests.h (refmpn_cnd_add_n, refmpn_cnd_sub_n): Updated + prototypes with new names. + * tune/common.c (speed_mpn_cnd_add_n): Renamed, from + speed_mpn_addcnd_n, call mpn_cnd_add_n. + (speed_mpn_cnd_sub_n): Renamed, from speed_mpn_subcnd_n, call + mpn_cnd_sub_n. + * tune/speed.h (speed_mpn_cnd_add_n, speed_mpn_cnd_sub_n): Updated + prototypes with new names. + * tune/speed.c (routine): Updated list with new names. + * tests/devel/try.c: Updated for new mpn_cnd_* names. + * mpn/generic/sbpi1_div_sec.c: Likewise. + * mpn/generic/powm_sec.c: Likewise. + 2013-03-12 Torbjorn Granlund t...@gmplib.org * configure.ac: Add missing to extra_functions for coreibwl. diff --git a/configure.ac b/configure.ac --- a/configure.ac +++ b/configure.ac @@ -2718,7 +2718,7 @@ add_n_sub_n addaddmul_1msb0 gmp_mpn_functions=$extra_functions \ - add add_1 add_n sub sub_1 sub_n addcnd_n subcnd_n neg com \ + add add_1 add_n sub sub_1 sub_n cnd_add_n cnd_sub_n neg com \ mul_1 addmul_1 submul_1 \ add_err1_n add_err2_n add_err3_n sub_err1_n sub_err2_n sub_err3_n \ lshift rshift dive_1 diveby3 divis divrem divrem_1 divrem_2 \ @@ -2775,7 +2775,7 @@ tmp_mulfunc=aors_err2_n ;; add_err3_n|sub_err3_n) tmp_mulfunc=aors_err3_n ;; - addcnd_n|subcnd_n) tmp_mulfunc=aorscnd_n ;; + cnd_add_n|cnd_sub_n) tmp_mulfunc=cnd_aors_n ;; addmul_1|submul_1) tmp_mulfunc=aorsmul_1 ;; mul_2|addmul_2)tmp_mulfunc=aormul_2 ;; popcount|hamdist) tmp_mulfunc=popham;; @@ -3268,7 +3268,6 @@ #undef HAVE_NATIVE_mpn_add_n_sub_n #undef HAVE_NATIVE_mpn_add_nc #undef HAVE_NATIVE_mpn_addaddmul_1msb0 -#undef HAVE_NATIVE_mpn_addcnd_n #undef HAVE_NATIVE_mpn_addlsh1_n #undef HAVE_NATIVE_mpn_addlsh2_n #undef HAVE_NATIVE_mpn_addlsh_n @@ -3301,6 +3300,8 @@ #undef HAVE_NATIVE_mpn_bdiv_dbm1c #undef HAVE_NATIVE_mpn_bdiv_q_1 #undef HAVE_NATIVE_mpn_pi1_bdiv_q_1 +#undef HAVE_NATIVE_mpn_cnd_add_n +#undef HAVE_NATIVE_mpn_cnd_sub_n #undef HAVE_NATIVE_mpn_com #undef HAVE_NATIVE_mpn_copyd #undef HAVE_NATIVE_mpn_copyi @@ -3357,7 +3358,6 @@ #undef HAVE_NATIVE_mpn_sqr_diag_addlsh1 #undef HAVE_NATIVE_mpn_sub_n #undef HAVE_NATIVE_mpn_sub_nc -#undef HAVE_NATIVE_mpn_subcnd_n #undef HAVE_NATIVE_mpn_sublsh1_n #undef HAVE_NATIVE_mpn_sublsh2_n #undef HAVE_NATIVE_mpn_sublsh_n diff --git a/gmp-impl.h b/gmp-impl.h --- a/gmp-impl.h +++ b/gmp-impl.h @@ -1555,10 +1555,10 @@ __GMP_DECLSPEC mp_size_t mpn_powm_sec_itch (mp_size_t, mp_size_t, mp_size_t); #define mpn_tabselect __MPN(tabselect) __GMP_DECLSPEC void mpn_tabselect (volatile mp_limb_t *, volatile mp_limb_t *, mp_size_t, mp_size_t, mp_size_t); -#define mpn_addcnd_n __MPN(addcnd_n) -__GMP_DECLSPEC mp_limb_t mpn_addcnd_n (mp_limb_t, mp_ptr, mp_srcptr, mp_srcptr, mp_size_t); -#define mpn_subcnd_n __MPN(subcnd_n) -__GMP_DECLSPEC mp_limb_t mpn_subcnd_n (mp_limb_t, mp_ptr, mp_srcptr, mp_srcptr, mp_size_t); +#define mpn_cnd_add_n __MPN(cnd_add_n) +__GMP_DECLSPEC mp_limb_t mpn_cnd_add_n (mp_limb_t, mp_ptr, mp_srcptr, mp_srcptr, mp_size_t); +#define mpn_cnd_sub_n __MPN(cnd_sub_n) +__GMP_DECLSPEC mp_limb_t mpn_cnd_sub_n (mp_limb_t, mp_ptr, mp_srcptr, mp_srcptr, mp_size_t); #define mpn_sb_div_qr_sec __MPN
Re: mpn_cnd_add_n
ni...@lysator.liu.se (Niels Möller) writes: I plan to check this in fairly soon. Checked in this renaming now. Next, I'd like to make mpn_cnd_add_n, mpn_cnd_sub_n and mpn_zero_p public (I guess that also implies some documentation...). mpn_zero_p is an inline function. In gmp.h, setting up inlining looks pretty hairy, with __GMP_EXTERN_INLINE defined in a couple of different ways depending on compiler. On the other hand, gmp-impl.h seems to simply use static inline, unconditionally. If this is good enough for all supported compilers, maybe it's good enough also in gmp.h? Potential problems: * We may still need some __GMP_FORCE_* logic for binary compatibility with applications which expect the functions to be exported. * We might break pointer equality (different compilation units might get different copies of non-inlined versions). I haven't thought very carefully about it, but I guess we already have that problem for compilers where __GMP_EXTERN_INLINE expands to static inline (DEC, SCO, and SunPro C compilers). If it is a problem. I think this can be worked around with some macro indirection to make sure that pointers will refer to a unique function in the library. Anyway, unless someone things cleaning this up now is important, I guess I'll try to do mpn_zero_p in the same way as other inline functions in gmp.h. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
ARM neon pseudo op
On my pandaboard (with a cortex-a9), I run Debian GNU/Linux, and the assembler calls itself $ as --version GNU assembler (GNU Binutils for Debian) 2.22 It refuses to assemble the new shiny mpn/arm/neon/*.asm files. With somewhat confusing error messages like tmp-lshift.s: Assembler messages: tmp-lshift.s:111: Error: selected processor does not support ARM mode `vdup.32 d6,r3' tmp-lshift.s:113: gcc -std=gnu99 -c -DHAVE_CONFIG_H -I. -I/home/nisse/hack/gmp/mpn -I.. -D__GMP_WITHIN_GMP -I/home/nisse/hack/gmp -DOPERATION_rshift -marm -O2 -pedantic -fomit-frame-pointer -march=armv7-a -mtune=cortex-a9 -g -Wa,--noexecstack tmp-rshift.s -o rshift.o Error: selected processor does not support ARM mode `vdup.32 d7,r3' tmp-lshift.s:119: Error: selected processor does not support ARM mode `vshl.u64 d18,d19,d7' I had to patch the files like diff -r ea84e60fd6dc mpn/arm/neon/lorrshift.asm --- a/mpn/arm/neon/lorrshift.asmTue Apr 02 07:40:32 2013 +0200 +++ b/mpn/arm/neon/lorrshift.asmTue Apr 02 11:32:05 2013 +0200 @@ -21,6 +21,8 @@ dnl along with the GNU MP Library. If include(`../config.m4') + .fpuneon + C cycles/limb cycles/limb cycles/limb good to convince the assembler. Not sure if it's best to do this at the top of each file, or if it should somehow be handled by ASM_START. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM neon pseudo op
Torbjorn Granlund t...@gmplib.org writes: I need the first few dozen lines from the configure output to have a guess about what might go wrong. You might actually compare the output to that of panda.gmplib.org yourself. http://gmplib.org/devel/testmachines/build/success/panda.gmplib.org-stat:standard.txt Hmm, it's supposed to be a flag (-mfpu=neon) on the gcc command line? Then it's my fault... I had set CFLAGS manually to what gmp's configure selected for me some month or two ago, plus a -g. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM public key benchmark
ni...@lysator.liu.se (Niels Möller) writes: I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd like to try. That wasn't a clear win... I use addmul_1 and submul_1 as a fallback (and I always do in-place operation, so that works). Now, cnd_sub_n beats submul_1 (except for n == 2, which I don't use): $ GMP_CPU_FREQUENCY=1e9 ./speed -C -s 1-10,100 mpn_submul_1.1 mpn_cnd_sub_n clock_gettime is 1.000ns accurate overhead 8.87 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz mpn_submul_1.1 mpn_cnd_sub_n 1#19.8927 21.6831 2#10.9752 12.4106 3 9.5514 #8.9371 4 8.5227 #6.6696 5 7.8316 #6.7412 6 7.1571 #6.0339 7 7.2859 #5.3320 8 6.8553 #4.8715 9 6.6945 #5.0376 10 6.3129 #4.8351 1005.5065 #3.2110 But for addition, mpn_addmul_1 beats mpn_cnd_add_n for many small sizes, $ GMP_CPU_FREQUENCY=1e9 ./speed -C -s 1-10,100 mpn_addmul_1.1 mpn_cnd_add_n clock_gettime is 1.000ns accurate overhead 8.94 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz mpn_addmul_1.1 mpn_cnd_add_n 1#19.8927 21.2256 2#10.8574 11.6940 3 #8.02358.5240 4 #6.45616.5216 5 #6.03086.5071 6 #5.49375.9282 7 #5.20635.3603 8 4.8838 #4.7493 9 #4.92494.9533 10#4.53644.8244 1003.4846 #3.2842 Some questions: 1. I guess one can expect submul_1 to always be a bit slower than addmul_1, since submul_1 needs additional arithmetics besides the umaal? One could perhaps do some negations on the fly, a - b C = - ((-a) + b*C), maybe that would be advantageous? 2. cnd_add_n should be at least as fast as addmul_1, shouldn't it? It appears to be 0.25 c/l faster for larger operands, so maybe it's only a question of optimizing loop setup and feedin? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM public key benchmark
ni...@lysator.liu.se (Niels Möller) writes: So it should be doable with the addmul_1 loop and two additional, non-recurrency, not instructions per limb, and then maybe some extra logic for the return value. One could aim for 4.25 c/l, I guess. The below seems to give correct results. But still 5.25 c/l. Maybe scheduling can be improved, I just put the new mvn instructions immediately preceding umaal and str. Regards, /Niels dnl ARM mpn_submul_1. dnl Copyright 2012 Free Software Foundation, Inc. dnl This file is part of the GNU MP Library. dnl The GNU MP Library is free software; you can redistribute it and/or modify dnl it under the terms of the GNU Lesser General Public License as published dnl by the Free Software Foundation; either version 3 of the License, or (at dnl your option) any later version. dnl The GNU MP Library is distributed in the hope that it will be useful, but dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public dnl License for more details. dnl You should have received a copy of the GNU Lesser General Public License dnl along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. include(`../config.m4') Ccycles/limb C StrongARM: - C XScale - C Cortex-A7 ? C Cortex-A8 ? C Cortex-A9 5.25 C Cortex-A15 ? C TODO C * Micro-optimise feed-in code. C * Optimise for n=1,2 by delaying register saving. C * Try using ldm/stm. define(`rp',`r0') define(`up',`r1') define(`n', `r2') define(`v0',`r3') ASM_START() PROLOGUE(mpn_submul_1) stmfd sp!, { r4, r5, r6, r7 } andsr6, n, #3 mov r12, #0 beq L(fi0) cmp r6, #2 bcc L(fi1) beq L(fi2) L(fi3): ldr r4, [up], #4 ldr r6, [rp, #0] ldr r5, [up], #4 b L(lo3) L(fi0): ldr r5, [up], #4 ldr r7, [rp], #4 ldr r4, [up], #4 b L(lo0) L(fi1): ldr r4, [up], #4 ldr r6, [rp], #8 subsn, n, #1 beq L(1) ldr r5, [up], #4 b L(lo1) L(fi2): ldr r5, [up], #4 ldr r7, [rp], #12 ldr r4, [up], #4 b L(lo2) ALIGN(16) L(top): ldr r6, [rp, #-8] ldr r5, [up], #4 mvn r7, r7 str r7, [rp, #-12] L(lo1): mvn r6, r6 umaal r6, r12, r4, v0 ldr r7, [rp, #-4] ldr r4, [up], #4 mvn r6, r6 str r6, [rp, #-8] L(lo0): mvn r7, r7 umaal r7, r12, r5, v0 ldr r6, [rp, #0] ldr r5, [up], #4 mvn r7, r7 str r7, [rp, #-4] L(lo3): mvn r6, r6 umaal r6, r12, r4, v0 ldr r7, [rp, #4] ldr r4, [up], #4 mvn r6, r6 str r6, [rp], #16 L(lo2): mvn r7, r7 umaal r7, r12, r5, v0 subsn, n, #4 bhi L(top) ldr r6, [rp, #-8] mvn r7, r7 str r7, [rp, #-12] L(1): mvn r6, r6 umaal r6, r12, r4, v0 mvn r6, r6 str r6, [rp, #-8] mov r0, r12 ldmfd sp!, { r4, r5, r6, r7 } bx lr EPILOGUE() -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM public key benchmark
Torbjorn Granlund t...@gmplib.org writes: Have you considered complementing C instead? Not until now. Actually looks nice: A - b C = A + b (~C) + b - b B^n So this saves one not instruction, and we have to add and subtract the scalar b from incoming and outgoing carry. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM public key benchmark
ni...@lysator.liu.se (Niels Möller) writes: I'll also try using fewer updates of the up pointer, that seems to save half a cycle, and could perhaps speed up addmul_1 too. No speedup for addmul_1, unfortunately, but a saving for submul_1. Here are new versions of both files (for mpn/arm/v6). I wonder if this submul_1 complement trick is useful on some other platforms too, e.g., 64-bit sparc? Running at 3.25 and 3.9 c/l on A9: $ GMP_CPU_FREQUENCY=1e9 ./speed -s1-1000 -f 1.2 -C mpn_addmul_1.17 mpn_submul_1.17 clock_gettime is 1.000ns accurate overhead 8.98 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz mpn_addmul_1.17 mpn_submul_1.17 1 19.9985 #18.7576 2 10.9929 #10.8108 3 #7.98808.6664 4 #6.40996.9205 5 #5.92516.3792 6 #5.43846.1228 7 #5.23365.9634 8 #4.86485.4154 9 #4.83595.2633 10#4.54235.2216 12#4.31224.8634 14#4.18764.8111 16#4.08814.6616 19#4.00454.5861 22#3.85924.4916 26#3.71914.4362 31#3.71944.3459 37#3.64374.2051 44#3.83684.3953 52#3.50394.1120 62#3.64974.2448 74#3.60154.1965 88#3.53764.1343 105 #3.50874.0832 126 #3.49884.1222 151 #3.43973.9997 181 #3.39003.9654 217 #3.34833.9500 260 #3.29943.9175 312 #3.33133.8723 374 #3.31733.8869 448 #3.26573.9315 537 #3.27993.9140 644 #3.30773.8574 772 #3.26213.9059 926 #3.26733.8317 dnl ARM mpn_submul_1. dnl Copyright 2012, 2013 Free Software Foundation, Inc. dnl This file is part of the GNU MP Library. dnl The GNU MP Library is free software; you can redistribute it and/or modify dnl it under the terms of the GNU Lesser General Public License as published dnl by the Free Software Foundation; either version 3 of the License, or (at dnl your option) any later version. dnl The GNU MP Library is distributed in the hope that it will be useful, but dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public dnl License for more details. dnl You should have received a copy of the GNU Lesser General Public License dnl along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. include(`../config.m4') Ccycles/limb C StrongARM: - C XScale - C Cortex-A7 ? C Cortex-A8 ? C Cortex-A9 3.9 C Cortex-A15 ? C TODO C * Micro-optimise feed-in code. C * Optimise for n=1,2 by delaying register saving. C * Try using ldm/stm. define(`rp',`r0') define(`up',`r1') define(`n', `r2') define(`v0',`r3') ASM_START() PROLOGUE(mpn_submul_1) stmfd sp!, { r4, r5, r6, r7 } andsr6, n, #3 mov r12, v0 beq L(fi0) cmp r6, #2 bcc L(fi1) beq L(fi2) L(fi3): ldr r4, [up], #12 mvn r4, r4 ldr r6, [rp, #0] ldr r5, [up, #-8] b L(lo3) L(fi0): ldr r5, [up], #16 mvn r5, r5 ldr r7, [rp], #4 ldr r4, [up, #-12] b L(lo0) L(fi1): ldr r4, [up], #20 mvn r4, r4 ldr r6, [rp], #8 subsn, n, #1 beq L(1) ldr r5, [up, #-16] b L(lo1) L(fi2): ldr r5, [up], #8 mvn r5, r5 ldr r7, [rp], #12 ldr r4, [up, #-4] b L(lo2) ALIGN(16) L(top): ldr r6, [rp, #-8] ldr r5, [up], #16 str r7, [rp, #-12] L(lo1): umaal r6, r12, r4, v0 mvn r5, r5 ldr r7, [rp, #-4] ldr r4, [up, #-12] str r6, [rp, #-8] L(lo0): umaal r7, r12, r5, v0 mvn r4, r4 ldr r6, [rp, #0] ldr r5, [up, #-8] str r7, [rp, #-4] L(lo3): umaal r6, r12, r4, v0 mvn r5, r5 ldr r7, [rp, #4] ldr r4, [up, #-4] str r6, [rp], #16 L(lo2): umaal r7, r12, r5, v0 mvn r4, r4 subsn, n, #4 bhi L(top) ldr r6, [rp, #-8] str r7, [rp, #-12] L(1): umaal r6, r12, r4, v0 str r6, [rp, #-8] sub r0, v0, r12 ldmfd sp!, { r4, r5, r6, r7 } bx lr EPILOGUE() dnl ARM mpn_addmul_1. dnl Copyright 2012 Free Software Foundation, Inc. dnl This file is part of the GNU MP Library. dnl The GNU MP Library is free
Re: ARM public key benchmark
Torbjorn Granlund t...@gmplib.org writes: I sometimes get better A9 performance with *discrete* pointer updates, not one-out-of-four autoincrement pointer updates like used here. I think the code you started with had that one-out-of-four trick for str, already? Right, it uses a single update of rp (which is used for both loads and stores), I just changed it to handle up updates in a similar way. I had on the other hand not realised David's ones complement + pre-invert carry trick. Not sure I understand what you are referring to here. I haven't been following the sparc developments very closely (and I don't remember much of sparc assembly). Cool! Looks like it is actually faster than 3.9 for some alignments/sizes. It seems one iteration takes 15.5 cycles. I guess that means that even and odd iterations are executed differently? It would be nice to get it down to 15 cycles (3.75 c/l) (the addmul_1 iteration takes 13 cycles, there's no good reason the four additional mvn instructions should cost more than two cycles). But I find instruction scheduling both very hard and tedious. Did you time this on some other CPU too? No. When I get home (I don't log in to the gmp machines from the office network), I might get time to try it on the appropriate gmp machine. 1. Use descrete ptr updates for up and/or rp. Maybe. Costs additional instructions, but with more freedom on where to place the pointer increment. Loopmixer would help. A digression: I'm running Debian GNU/Linux on my pandaboard system. The Linux way to get access to the instruction counter seems to be via perf_event_open. However, when I tried it, it seems no hardware-based events exist (I do get access to the software-based ones though, so the interface is partially working). Also, clock_get_time with CLOCK_PROCESS_CPUTIME_ID gives very poor accuracy, so maybe the entire high res timers-subsystem is non-working. Any clues on where to look for solving this problem is appreciated. The obvious (to me) things seem to be enabled in the kernel config $ zgrep 'PERF_EV\|HIGH_RES' /proc/config.gz CONFIG_HIGH_RES_TIMERS=y CONFIG_HAVE_PERF_EVENTS=y CONFIG_PERF_EVENTS=y CONFIG_HW_PERF_EVENTS=y And it's no use to even think of porting the loop mixer to arm without access to cycle-accurate timing. 2. Move the one-out-of-four autoincrement updates to other ldr/str insns. Could try that. 3. Use ldm/stm. Often an A9 win. If we want to schedule loads early, that seems to rule out using a single ldm to load all values used in the loop. Right? But two ldm, loading two limbs at a time, could work. stm seems easier. Any changes of this type will break the current loop setup logic, I'm afraid. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM public key benchmark
Richard Henderson r...@twiddle.net writes: Looking around the web it seems that what most folks do is write a minimal kernel module that toggles the bit that allows userspace access to the cycle counter MSRs. I've seen some code snippets to do that too. I have a dual core system; one known issue when I tried searching for this some month ago was that it set the needed register bits only on the core on which the kernel module happened to be loaded. So that seems like a partial solution only. And I have no idea about how to write a Linux kernel module that makes a certain function be executed on all available cores; if I can find out, I'll definitely try that. I havn't fiddled with this myself, because the A15 includes a real high-resolution clock, which is accessed via CLOCK_MONOTONIC. On my system, both CLOCK_MONOTONIC and CLOCK_PROCESS_CPUTIME_ID seem to tick in units of roughly 30 us. Which is disappointingly poor resolution. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM public key benchmark
Torbjorn Granlund t...@gmplib.org writes: The newer sparc adds 64-bit carrying adds, but they still don't have corresponding subtraction instructions. Se David sets carry before entering the loop, and ones complements the subtrahend. Ah. I think I even suggested that trick, for mpn_sub_n. Not obviously applicable to submul_1, though, which made me think you were referring to something different. I assume that ldm loads the registers in some secific order, such as lowest numbered first. I guess it's lowest numbered first (and lowest memory address). But a loop with use r7 ldm up!, {r4,r5,r6,r7} use r4 looks like poor scheduling betwen load of r4 and use of it, and the ldm can't be moved earlier since it clobbers r7. But I have a pretty vague idea about how this really works. Then, it could lift the screboard bit for availale register values while ldm executes. I'm not sure if I understand, but if ldm is executed in the background, with r4 loaded first, there's still tight between load of r4 and use. From what I've learnt so far about A9 scheduling, I'd want to place the load of r4 before the last use of r7, and then I can't use ldm. Using ldm with just two register might be pointless. Also, it will for 50% of alignments take 2 cycles. Doing three registers is (as we've discussed in the past) more applealing. Right, rewriting the loop with 3-way unrolling would be an interesting experiment. But I don't think I'll look into that soon. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
Torbjorn Granlund t...@gmplib.org writes: The multitude and pattern of mask computations make side channel leakage worse if the mask computation is made stupidly. I don't trust compilers here, since they might use a conditional move or other leaky method. One possible variant, LIMB_HIGHBIT_TO_MASK ( ~((i-k) | (k-i))) BTW, a comment on LIMB_HIGHBIT_TO_MASK, which I have intended to write for some time. Current definition uses a condition when arithmetic shift is unavailable, #define LIMB_HIGHBIT_TO_MASK(n) \ (((mp_limb_signed_t) -1 1) 0 \ ? (mp_limb_signed_t) (n) (GMP_LIMB_BITS - 1) \ : (n) GMP_LIMB_HIGHBIT ? MP_LIMB_T_MAX : CNST_LIMB(0)) I'd suggest changing the else part to a shift + negation, : - ((n) (GMP_LIMB_BITS - 1)) But I don't know if there's still any C implementation which lack arithmetic shift, so I don't know how to test that change. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: mini-gmp mpz_addmul_ui and mpz_submul_ui
Torbjorn Granlund t...@gmplib.org writes: ni...@lysator.liu.se (Niels Möller) writes: I'm considering adding mpz_addmul_ui and mpz_submul_ui to mini-gmp. OK...due to popular demand? I just added mini-gmp in nettle. I don't use it for the librry itself (I might make that an option later), but for generating ecc-related tables at build time. This simplifies cross compilation, since the program in question is run on the build system, so by using mini-gmp, I get away without depending on gmp being installed in the build system, and without any configure tests for figuring out how to link with it. To do this, I had to replace gmp_printf by calls to mpz_out_str (which seems unavoidable; I don't want a printf implementation in mini-gmp), but I also had to replace a call to mpz_submul_ui, which seems easy enough to implement. I have no strong opinion, but we should not forget the meaning of the mini moniker. Right. I'm leaning towards the simpler implementation. Avoiding some temporary storage is not terribly important for the sizes mini-gmp is intended for. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Public mpz_ptr and mpz_srcptr
ni...@lysator.liu.se (Niels Möller) writes: Here's another possible use of mpz pointers: And another one, I just stumbled upon: Passing mpz_t to varargs functions, mpz_srcptr n = va_arg(args, mpz_srcptr); (I just edited that line to use mpz_srcptr. When I wrote the previous version 10 years ago I used to be const MP_INT *, since that seemed to be a more supported piece of GMP API. But mpz_srcptr looks less ugly). Is there any other supported way to do that? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: _basecase or _sec? [
Torbjorn Granlund t...@gmplib.org writes: I see the need of the following: function mul gcdext add, sub mod (div_r_sec) is more important than general division. And modular inverse is more important than general gcdext. I've also seen some need for add_1/sub_1. Modular inverse is a bit tricky, I have an implementation (at http://git.lysator.liu.se/nettle/nettle/blobs/master/sec-modinv.c) which is some 50 time slower than mpn_gcdext. As far as I'm aware, this is a novel algorithm. I think it could be extended to return the gcd and/or a success/fail indication without leaking. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: _basecase or _sec? [
Torbjorn Granlund t...@gmplib.org writes: Hmm...can one use Newton for any a^(-1) mod b (when that exists)? (I listed gcdext for the sake of inverses, and omitted gcd.) I think Newton analogues exist only when b is a power, not in general. And the most important case is prime b. I am not familiar with the Jebelean (or Möller!) criteria for gcd, but my intuition suggests that there exists a O(k)-bit matrix which reduces k bits from the operands. And the same intuition suggests that it can be computed from O(k) high bits. Am I wrong? You need to have roughly *2k* bits of each operand to make k bits progress. And you get into trouble at least in the case that those 2k bits are equal (you need to subtract one operand from the other, but which depends on the other bits). Let me think aloud. As a start, one should try to construct an algorithm which makes (at least) two bits of progress per iteration, with all problematic special cases handled in a side-channel silent manner. I looked into it briefly before writing my current code, but I couldn't find anything easy. And I'm afraid the table lookup even for just two bits of progress needs 5 bits from each operand, i.e., 1024 entries, which might leak information through the cache instead. Right-to-left algorithms will likely have easier bookkeeping than left-to-right algorithms. Instead of a lehmer-like algorithms which eliminates k bits of each operand using a vector/matrix multiply, it might be better (smaller table) to eliminate k bits from the larger operand only. But then we need to keep track of which operand is largest. I think this simplest appproach was the one I looked a bit into, and it didn't seem very promising. I think one can always eliminate two bits, by setting m -- min(a,b) M -- max(a,b) a -- (M + k m) / 4 b -- m where k = ± 1, which ever is needed to make M + k m a multiple of 4. And then some additional logic for the case of even numbers (if M = 0 (mod 4), just set k = 0, and if M = 2 (mod 4), set k = 2). But to go beyond two bits, one would need something like k-ary gcd, which might introduce false factors. The k \in {0, 1, 2, -1} could be a conditional subtraction and addition 1. Subtract: d -- a - [a odd] b (if a odd) 2. Swap: b -- a, d -- -d (if underflow above, d 0) 3. Add: a -- a + 2 b (under certain conditions) 4. Shift:a -- a / 4 Always Current code does (1) and (2) only (and a shift by one bit). If one wants a single addmul_1, one could choose k \in {0,1,2,3} instead, but one still *needs* to take sign (a-b) into account; it's no use to add the larger operand to the smaller. Analysis of the guaranteed progress after n iterations is going to get more complicated with k 0. For any variant, I think the iteration needs to take into account: Low bits of a and b (one can try to maintain b odd at all times, but one needs to handle the case of even a). Or high bits for a left-to-right algorithm, but I think that's going to be less convenient. Sign of (a - b). The straight-forward way to determine the sign of (a - b) is to actually compute the difference. The result of this subtraction is directly useful for the plain binary algorithm, but a bit wasted if we try something more sophisticated. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: _basecase or _sec? [
Torbjorn Granlund t...@gmplib.org writes: I think it exists also for b can be factorised into prime powers... Sure, any factorization can be used with CRT, but it's only powers which allow newton-like hensel-lifts, as far as I'm aware. And the most important case is when b is prime. You need to have roughly *2k* bits of each operand to make k bits progress. And you get into trouble at least in the case that those 2k bits are equal (you need to subtract one operand from the other, but which depends on the other bits). We don't need to insist on keeping operands positive. Hmm. In general, one needs to replace the largest number, to make progres. But I guess in the case of many high bits being equal, it might not matter. My thought was not to have a table, but plain loop which executed a limb-size dependent number of iterations, and has not internal branches. I see. Makes some sense. To get a nice progress guarantee, one such large iteration would need to apply the matrix, and then do a small number of additional reductions. I have not looked at right-to-left algorithms for gcdext, at least not in a very long time. Do they resolve gcd(a,b) = ax + by or something like gcd(a,b)*2^t = ax + by? I haven't looked at more sophisticated right-to-left gcd for a long time (Stehlé and Zimmermann). But the plain binary algorithm shifts out the zero bits as they appear. And the co-factors (to-be inverse) are divided by 2 (mod b) in each step, by a shift and a conditional add. It's possible, and it might be more efficient, to divide the cofactors by 2^k (mod b) at the end, instead of one bit at a time. Right, answering my question above. Cannot false factors of 2 be handled? As long as the iteration only makes the update a -- a + k b, no false factors can appear. The problem is the k-ary algorithm (Weber's accelarated gcd), which uses an update of the form a -- u a + v b, with |u| != 1. And as for factors of two, I have only considered inversion modulo an odd b. I suspect non-leaky code has to compute the sign as a full subtraction. It should be possible to do a non-leaky loop examining all limbs, without carry propagation. Just doing comparisions a[k] b[k] and a[k] == b[k], combining results with logic operations. No idea if that can be faster than a plain subtraction. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Changes to mini-gmp and 5.1.2
Torbjorn Granlund t...@gmplib.org writes: Should we move any of the mini-gmp changes to 5.1.2? I think the following would make sense to include: 2013-02-25 Niels Möller ni...@lysator.liu.se * mini-gmp/tests/t-double.c (testmain): Declare double variables as volatile, to drop extended precision. * mini-gmp/tests/testutils.c (testfree): New function. Use it everywhere where test programs deallocate storage allocated via the mini-gmp allocation functions, including uses of mpz_get_str for various test failure messages. IIRC, the second change fixes crashes when displaying messages about failing tests. 2013-02-19 Marco Bodrato bodr...@mail.dm.unipi.it * mini-gmp/mini-gmp.c: Move asserts to work-around a compiler bug. * mini-gmp/tests/t-reuse.c: Fix typo causing the same negation condition to be applied to all operands. (See 2013-02-03, Torbjorn) Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: caching of transforms used for large multiplications
Daniel Lichtblau d...@wolfram.com writes: My question(s): Are these NTT intermediate results cached? No. I have a tentative interface for this in my small-prime fft code which haven't yet been integrated with gmp. I think Paul Zimmermann may also have some patches along those lines for the Schönhage-Strassen fft which is used in GMP now. If not, is there a way to tell GMP to keep them around for subsequent reuse? I suspect such caching might be a generally useful thing to have if it is not already present. Interfaces for operations on invariant numbers have been discussed for a long time, but no public interfaces have been implemented (and few internal). Caching transforms may be useful also for smaller multiplications (I think I saw a speedup of a few % already with Karatsuba, when I played with that). And for division for all sizes. For fast GCD it could amount to a notable speed gain, I think. (I believe Niels remarked specifically on this but do not have the reference handy. Sorry.) In GCD, there are certain large unbalanced multiplication. Splitting the large factor and reusing the transform of the small factor is one possible approach to optimization. Another is transforming the 2x2 matrix elements once (since they're used twice for a matrix x vector multiplication), that's done in the small-prime fft code. And yet another is using wraparound arithmetic. (And there are surely other tricks as well). Hmm, looking at the GCD code now, it makes some limited use of wraparound arithmetic. Only for the matrix x vector multiplication for the first of the two recursive reductions in hgcd. The gmp function to look for is mpn_mulmod_bnm1. I also dug up some old slides. Below are some estimates extracted from a presentation I made at KTH, May 15, 2008. I don't remember much of the details, but for the largest savings, you'd want to not only cache transforms, but also do some linear operations in the transform domain. E.g,, to multiply two 2x2 matrices, first FFT-transform the 4 elements of each matrix, *then* expand them linearly to two 7-element vectors for Strassen multiplication. Do the 7 pointwise multiplications, linearly transform them back to four elements, and finally, four inverse FFTs, one for each element of the result matrix. Regards, /Niels \section{Further work} \titleslide{Further work} \begin{frame} \frametitle{Matrix multiplication} \begin{equation*} M_1 \cdot M_2 \quad \text{$2\times2$ matrices} \end{equation*} \bigskip Assume \FFT and sizes such that transforms and pointwise multiplication take equal time. \bigskip \begin{tabular}{l|} \FFT \IFFT Pointwise Saving \\ \hline Naive 16 8 8 0\% \\ Schönhage-Strassen 14 7 7 12\% \\ Invariance 8 4 8 37\% \\ S.-S. + invariance 8 4 7 40\% \\ \end{tabular} \end{frame} \begin{frame} \frametitle{Matrix-vector multiplication} \begin{itemize} \item If $\alpha, \beta$ are returned: $M$ of size $n/4$, $A', B'$ of size $n/2$. \begin{equation*} M^{-1} \cdot \M{A \\ B} = 2^p \M{\alpha \\ \beta} + M^{-1} \cdot \M{A' \\ B'} \end{equation*} \begin{tabular}{l|rrl} \#Mults. Prod. size \\ \hline Naive 4 $3n/4$ Wins in \FFT range \\ Block 8 $n/2$ Can use invariance \\ S.-S. 7 $n/2$ Wins in Karatsuba range \\ \end{tabular} \pause \item If only matrix is returned: $M$ of size $n/4$, $A, B$ of size $n$. \begin{equation*} \M{\alpha \\ \beta} = M^{-1} \cdot \M{A \\ B} \end{equation*} $\alpha, \beta$ are of size $3n/4$ (cancellation!). Compute $\bmod (2^k \pm 1)$, with transform size $\approx 3n/4$. \pause \item \alert{Same transform size}, $3n/4$, no matter if reduced numbers are available or not! \end{itemize} \end{frame} -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: caching of transforms used for large multiplications
Daniel Lichtblau d...@wolfram.com writes: Re GCD usage, mostly I had in mind the matrix multiplications (which I believe are balanced) and a few others that, You may want to have a look at http://gmplib.org:8000/gcd-nisse/file/tip/mpn/generic/matrix22_mul.c. This code reuses transforms, and it does a few additions in the transform domain. So 8 forward transforms, 8 pointwise muls, and 4 inverse transforms. One pointwise multiplication could be eliminated using Strassen's trick, at the cost of more complicated operations in the transform domain. Possibly one value would be a factor of two larger in size than the other (i.e. the smaller on the order of the square root of the bigger). This might be what Niels had in mind with splitting the larger factor... Exactly. Another trick, suggested by Paul, iirc: Say you have the 2x2 matrix M, elements of size n, and a vector (x;y) of size 2n, and you want the (unbalanced) matrix-vector product M (x;y) Then split x and y as x = x_1 B^n + x_0, y = y_1 B^n + y_0, and compute the balanced *matrix* product M (x_0, x_1; y_0, y_1) using any optimization tricks available. The two resulting columns can be combined cheaply to get the desired vector. It was indeed Niels' Subquadratic GCD, dated October 30, 2008-- did not say where the presentation was given but it could well be an update of the slides referred to in Niels' response. I wonder where I have those files. The timing matches the two-month period where I worked at KTH, sharing a room with Torbjörn and working with integration and optimization of the subquadratic gcd code. I made a longer presentation at the department towards the end of that project. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: caching of transforms used for large multiplications
Daniel Lichtblau d...@wolfram.com writes: I'm fairly sure readers will understand that no offense was intended. Well, on a public and technical mailing list, jokes at the expense of other developers are not appropriate. Except possibly when you hold a *correct* belief that the subject of the joke will not be offended. That said, I'll not trouble you with further communication. It appears you and Torbjörn are no longer on speaking terms. Which seems unnecessary. Respectfull and productive communication is possible even with some degree of personal dislike. It requires a bit more care in writing than communication between best friends, of course. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: intel core i5 ivy bridge
Hamid Shayesteh shirazch...@yahoo.com writes: can i install gmp library on laptop model asus k55vd(cpu: core i5 ivy bridge), Most likely yes, but it also depends on which operating system you use. if yes, please help me and send instruction. i use gcc for programming. The gmp-devel list is for discussions relating to development of the gmp library itself, not general questions from gmp users. Follow the installation instructions at https://gmplib.org/manual/Installing-GMP.html#Installing-GMP. If you encounter any problems, please send any further questions to the gmp-discuss or the gmp-bugs list as appropriate (see https://gmplib.org/#MAILINGLISTS). Or get a precompiled package for your operating system, e.g., apt-get install libgmp-dev on Debian GNU/Linux. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Licensing issue with gnutls and gmp
Tomas Mraz tm...@redhat.com writes: Would it be possible to dual license the next GMP release under the LGPLv3+ and GPLv2+ licenses? I heard that it might be possible to do that. That has been discussed. I think it makes a lot of sense, but it's unlikely to happen any time soon. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM public key benchmark
ni...@lysator.liu.se (Niels Möller) writes: David Miller da...@davemloft.net writes: #include linux/smp.h ... on_each_cpu(my_func, my_func_args, 1); Cool. I'll have to play with that, but probably not until next week. And some months later... I tried that now. I use the below linux kernel module which modifies the user enable register when loaded. On all cores. Nevertheless, when I read the enable bit in a user space program it is zero (and instructions trying to access the cycle counter fails with illegal instruction). I then added some code to the module to display the value of the bit before and after I try to write it. If I load and immediately unload the module, the values are as I expect: root@arm:/home/nisse# insmod ~nisse/incoming/enableccnt.ko ; rmmod enableccnt root@arm:/home/nisse# dmesg |tail -14 [5728897.345672] enableccnt starting [5728897.349487] enableccnt enableccnt_write: 1 [5728897.354186] enableccnt before: 0 [5728897.357910] enableccnt after: 1 [5728897.361572] enableccnt enableccnt_write: 1 [5728897.366271] enableccnt before: 0 [5728897.369995] enableccnt after: 1 [5728897.385314] enableccnt stopping [5728897.392120] enableccnt enableccnt_write: 0 [5728897.399871] enableccnt before: 1 [5728897.406616] enableccnt after: 0 [5728897.413146] enableccnt enableccnt_write: 0 [5728897.420806] enableccnt before: 1 [5728897.427429] enableccnt after: 0 (Note that there are two cores). But if I delay unload just a second, something else resets the flag in the mean time: root@arm:/home/nisse# insmod ~nisse/incoming/enableccnt.ko ; sleep 1; rmmod enableccnt root@arm:/home/nisse# dmesg |tail -14 [5729055.679229] enableccnt starting [5729055.685974] enableccnt enableccnt_write: 1 [5729055.693603] enableccnt before: 0 [5729055.700286] enableccnt after: 1 [5729055.706848] enableccnt enableccnt_write: 1 [5729055.714508] enableccnt before: 0 [5729055.721221] enableccnt after: 1 [5729056.752655] enableccnt stopping [5729056.759399] enableccnt enableccnt_write: 0 [5729056.767028] enableccnt before: 0 [5729056.773742] enableccnt after: 0 [5729056.780364] enableccnt enableccnt_write: 0 [5729056.788024] enableccnt before: 0 [5729056.794738] enableccnt after: 0 Which maybe isn't too surprising, after all the kernel owns all registers. But I greped the kernel sources, and I really find no accesses to this status register anywhere. So I'm lost now. If anyone on this list have some hint on where to look, that's appreciated. The platform is still an OMAP4 Panda board. Otherwise I guess I'll have to give up this track. Regards, /Niels /* Adapted from http://bench.cr.yp.to/cpucycles/netwalker.html */ #include linux/module.h #include linux/kernel.h MODULE_LICENSE(Dual BSD/GPL); #define DEVICE_NAME enableccnt static void enableccnt_write (void *info) { int val_write = *(int *) info; int val_read; printk (KERN_INFO DEVICE_NAME enableccnt_write: %d\n, val_write); asm volatile (mrc p15, 0, %0, c9, c14, 0: =r(val_read)); printk (KERN_INFO DEVICE_NAME before: %d\n, val_read); asm volatile (mcr p15, 0, %0, c9, c14, 0 :: r(val_write)); asm volatile (mrc p15, 0, %0, c9, c14, 0: =r(val_read)); printk (KERN_INFO DEVICE_NAME after: %d\n, val_read); } static int enableccnt_init(void) { int val = 1; printk(KERN_INFO DEVICE_NAME starting\n); on_each_cpu (enableccnt_write, val, 1); return 0; } static void enableccnt_exit(void) { int val = 0; printk(KERN_INFO DEVICE_NAME stopping\n); on_each_cpu (enableccnt_write, val, 1); } module_init(enableccnt_init); module_exit(enableccnt_exit); -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: automake -a
Marc Glisse marc.gli...@inria.fr writes: It seems that it would be easiest to do what the comment in .bootstrap says, remove those files from the repository and use automake -a to make sure we have a consistent version of everything (no -f so it doesn't override our specific INSTALL, config.guess, etc). Makes sense to me, as long as removing files in the repo is decided on a case-by-case basis. E.g., I think the COPYING file should stay in the repo, even if automake -a currently installs an identical file. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: division-free binary-to-decimal conversion
Zimmermann Paul paul.zimmerm...@inria.fr writes: with Cyril Bouvier we have written a preprint describing several division-free algorithms to convert from binary to decimal (the mp?_get_str routines): http://www.loria.fr/~zimmerma/papers/get_str.pdf I've had a first reading. Seems to give impressive speedups. And nice with another application of mulmid. I think I noticed one typo: On page 3, the second inequality just after Eq. (3), should probably say 0 = y/2^n 1, not 0 = y b^k / 2^n 1. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_divexact_1 comments
Mark Sofroniou ma...@wolfram.com writes: If the quotient will be one word shorter than the dividend then set the top word to zero: if ((divisor 1) == 0) { if ((dst != src) (src[size - 1] divisor)) dst[size - 1] = 0; ... I don't understand. Doesn't the current loop always write all quotient limbs, including dst[size-1]? And why the condition (dst != src) ??? If the divisor is a power of two then skip the division code and just shift or copy: Low-level gmp functions, like this one, usually don't do that type of optimizations. The caller can do that check, if powers of two are likely enough that it matters. And I magine constant d is an important usecase. Furthermore, this function is implemented in assembly on several platforms, and adding a power-of-two check and a call to mpn_rshift to all implementations of mpn_divexact_1 would get a bit messy. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
div_qr_1 interface
of the License, or (at your + option) any later version. + + The GNU MP Library is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public + License for more details. + + You should have received a copy of the GNU Lesser General Public License + along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. */ + + #include gmp.h + #include gmp-impl.h + #include longlong.h + + #if GMP_NAIL_BITS 0 + #error Nail bits not supported + #endif + + mp_limb_t + mpn_divf_qr_1n_pi1 (mp_ptr qp, mp_size_t n, mp_limb_t u, + mp_limb_t d, mp_limb_t dinv) + { + ASSERT (n 0); + ASSERT (d GMP_NUMB_HIGHBIT); + ASSERT (u d); + + do + { + mp_limb_t q; + udiv_qrnnd_preinv (q, u, u, 0, d, dinv); + qp[--n] = q; + } + while (n 0); + + return u; + } diff -Nrc2 gmp.4fa4b9b52bd5/mpn/generic/div_qr_1n_pi1.c gmp/mpn/generic/div_qr_1n_pi1.c *** gmp.4fa4b9b52bd5/mpn/generic/div_qr_1n_pi1.c1970-01-01 01:00:00.0 +0100 --- gmp/mpn/generic/div_qr_1n_pi1.c 2013-10-16 21:15:19.0 +0200 *** *** 0 --- 1,58 + /* mpn_div_qr_1n_pi1 + +Contributed to the GNU project by Niels Möller + +THIS FILE CONTAINS INTERNAL FUNCTIONS WITH MUTABLE INTERFACES. IT IS +ONLY SAFE TO REACH THEM THROUGH DOCUMENTED INTERFACES. IN FACT, IT IS +ALMOST GUARANTEED THAT THEY'LL CHANGE OR DISAPPEAR IN A FUTURE GNU MP +RELEASE. + + + Copyright 2013 Free Software Foundation, Inc. + + This file is part of the GNU MP Library. + + The GNU MP Library is free software; you can redistribute it and/or modify + it under the terms of the GNU Lesser General Public License as published by + the Free Software Foundation; either version 3 of the License, or (at your + option) any later version. + + The GNU MP Library is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public + License for more details. + + You should have received a copy of the GNU Lesser General Public License + along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. */ + + #include gmp.h + #include gmp-impl.h + #include longlong.h + + #if GMP_NAIL_BITS 0 + #error Nail bits not supported + #endif + + /* Divides (uh B^n + {up, n}) by d, storing the quotient at {qp, n}. +Requires that uh d. */ + mp_limb_t + mpn_div_qr_1n_pi1 (mp_ptr qp, mp_srcptr up, mp_size_t n, mp_limb_t uh, + mp_limb_t d, mp_limb_t dinv) + { + ASSERT (n 0); + ASSERT (uh d); + ASSERT (d GMP_NUMB_HIGHBIT); + ASSERT (MPN_SAME_OR_SEPARATE_P (qp, up, n)); + + do + { + mp_limb_t q, ul; + + ul = up[--n]; + udiv_qrnnd_preinv (q, uh, uh, ul, d, dinv); + qp[n] = q; + } + while (n 0); + + return uh; + } diff -Nrc2 gmp.4fa4b9b52bd5/mpn/generic/divrem_1.c gmp/mpn/generic/divrem_1.c *** gmp.4fa4b9b52bd5/mpn/generic/divrem_1.c 2013-10-16 21:15:19.0 +0200 --- gmp/mpn/generic/divrem_1.c 2013-10-16 21:15:19.0 +0200 *** *** 138,154 invert_limb (dinv, d); ! for (i = un - 1; i = 0; i--) ! { ! n0 = up[i] GMP_NAIL_BITS; ! udiv_qrnnd_preinv (*qp, r, r, n0, d, dinv); ! r = GMP_NAIL_BITS; ! qp--; ! } ! for (i = qxn - 1; i = 0; i--) ! { ! udiv_qrnnd_preinv (*qp, r, r, CNST_LIMB(0), d, dinv); ! r = GMP_NAIL_BITS; ! qp--; ! } return r; } --- 138,149 invert_limb (dinv, d); ! /* Undo the offsetting */ ! qp -= (n-1); ! ! if (un 0) ! r = mpn_div_qr_1n_pi1 (qp + qxn, up, un, r, d, dinv); ! ! if (qxn 0) ! r = mpn_divf_qr_1n_pi1 (qp, qxn, r, d, dinv); return r; } -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
ni...@lysator.liu.se (Niels Möller) writes: To get going, I've written C implementations of mpn_div_qr_1n_pi1 and mpn_divf_qr1n_pi1, and made divrem_1 call them. Below, also an mpn_div_qr_1, using these primitives (and with some inspiration from divrem_1). For return value, I use the type typedef struct { mp_limb_t r; mp_limb_t qh; } gmp_qr_1_t; The order here is a micro-optimization. I expect that on most ABI:s, for the typical code fragment gmp_qr_1_t res; ... res.r = mpn_foo (...); return res; the return value from mpn_foo will already be in the right register, and only qh needs to be moved from some callee-save register to the second return value register. Are there any problems with using a struct return values like here, in terms of performance, portability, style...? If you agree this makes sense, I'm considering checking in the div_qr_1* functions fairly soon (maybe divrem_1 changes can wait until div_qr_1*_pi1 has settled. The interface should allow for additional precomputed values). I also found some old x86_64 assembly code for the new div_qr_1 algorithm. With perfect scheduling, it could run at 10 cycles on opteron (the main cpu of interest at the time). Main loop is 28 instructions, two independent mul, and decoder limited). Regards, /Niels /* mpn_div_qr_1 -- mpn by limb division. Contributed to the GNU project by Niels Möller and Torbjörn Granlund Copyright 1991, 1993, 1994, 1996, 1998, 1999, 2000, 2002, 2003, 2013 Free Software Foundation, Inc. This file is part of the GNU MP Library. The GNU MP Library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. The GNU MP Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. */ #include gmp.h #include gmp-impl.h #include longlong.h #ifndef DIV_QR_1_NORM_THRESHOLD #define DIV_QR_1_NORM_THRESHOLD 3 #endif #ifndef DIV_QR_1_UNNORM_THRESHOLD #define DIV_QR_1_UNNORM_THRESHOLD 3 #endif #if GMP_NAIL_BITS 0 #error Nail bits not supported #endif /* Divides {up, n} by d. Writes the n-1 low quotient limbs at {qp, * n-1}. Returns remainder and high quotient limb. */ gmp_qr_1_t mpn_div_qr_1 (mp_ptr qp, mp_srcptr up, mp_size_t n, mp_limb_t d) { gmp_qr_1_t res; unsigned cnt; mp_limb_t uh; ASSERT (n 0); ASSERT (d 0); if (d GMP_NUMB_HIGHBIT) { /* Normalized case */ mp_limb_t dinv, q; uh = up[--n]; q = (uh = d); res.qh = q; uh -= (-q) d; if (BELOW_THRESHOLD (n, DIV_QR_1_NORM_THRESHOLD)) { cnt = 0; plain: while (n 0) { mp_limb_t ul = up[--n]; udiv_qrnnd (qp[n], uh, uh, ul, d); } res.r = uh cnt; return res; } invert_limb (dinv, d); res.r = mpn_div_qr_1n_pi1 (qp, up, n, uh, d, dinv); return res; } else { /* Unnormalized case */ mp_limb_t dinv, ul; if (! UDIV_NEEDS_NORMALIZATION BELOW_THRESHOLD (n, DIV_QR_1_UNNORM_THRESHOLD)) { uh = up[--n]; udiv_qrnnd (res.qh, uh, 0, uh, d); cnt = 0; goto plain; } count_leading_zeros (cnt, d); d = cnt; #if HAVE_NATIVE_div_qr_1u_pi1 /* FIXME: Call loop doing on-the-fly normalization */ #endif /* Shift up front, use qp area for shifted copy. A bit messy, since we have only n-1 limbs available, and shift the high limb manually. */ uh = up[--n]; ul = (uh cnt) | mpn_lshift (qp, up, n, cnt); uh = (GMP_LIMB_BITS - cnt); if (UDIV_NEEDS_NORMALIZATION BELOW_THRESHOLD (n, DIV_QR_1_UNNORM_THRESHOLD)) { udiv_qrnnd (res.qh, uh, uh, ul, d); up = qp; goto plain; } invert_limb (dinv, d); udiv_qrnnd_preinv (res.qh, uh, uh, ul, d, dinv); res.r = mpn_div_qr_1n_pi1 (qp, qp, n, uh, d, dinv) cnt; return res; } } -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: I am not too enthusiastic about struct return types for critical functions. I expect this to be returned via a stack slot everywhere o almost everywhere. As far as I understand, the most common ABIs for x86_64 and ARM (which is pretty close to almost everywhere...) both return structs of this form in two registers: %rax/%rdx, and %r0/%r1. Consider the test compilation unit typedef struct { unsigned long q; unsigned long r; } qr_t; qr_t divrem (unsigned long u, unsigned long d) { qr_t res; res.q = u/d; res.r = u - res.q*d; return res; } On x86_64 (and gnu/linux), gcc -c -O compiles this to 0: 48 89 f8mov%rdi,%rax 3: 31 d2 xor%edx,%edx 5: 48 f7 f6div%rsi 8: 48 89 famov%rdi,%rdx b: 48 0f af f0 imul %rax,%rsi f: 48 29 f2sub%rsi,%rdx 12: c3 retq Both inputs and outputs are passed in registers. The return value is the only thing stored on the stack. I recall to have seen some code for that. How fast does it run currently on the various CPUs? Don't know yet. Code comment: I think we cannot afford to do a separate lshift of the dividend operand when the divisor is just a few limbs. We need to to shifting on-the- fly, however irksome that might be. AN mpn_div_qr_1u_pi1 is called-for. I think we'll definitely want mpn_div_qr_1u_pi1 for the most common platforms. I was thinking, that maybe we could let it be an optional function, with no C implementation, and resort to a separate mpn_lshift if the function is missing. But if needed, it's no big deal to extract a C mpn_div_qr_1u_pi1 from divrem_1.c, with on-the-fly shifting. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
ni...@lysator.liu.se (Niels Möller) writes: Consider the test compilation unit typedef struct { unsigned long q; unsigned long r; } qr_t; qr_t divrem (unsigned long u, unsigned long d) { qr_t res; res.q = u/d; res.r = u - res.q*d; return res; } On x86_64 (and gnu/linux), gcc -c -O compiles this to 0: 48 89 f8mov%rdi,%rax 3: 31 d2 xor%edx,%edx 5: 48 f7 f6div%rsi 8: 48 89 famov%rdi,%rdx b: 48 0f af f0 imul %rax,%rsi f: 48 29 f2sub%rsi,%rdx 12: c3 retq And I guess it is relevant to compare this to the same function, reorganized to return the second value via a pointer: unsigned long divrem (unsigned long u, unsigned long d, unsigned long *qp) { unsigned long q; q = u/d; *qp = q; return u - q * d; } which is compiled to 0: 48 89 d1mov%rdx,%rcx 3: 48 89 f8mov%rdi,%rax 6: 31 d2 xor%edx,%edx 8: 48 f7 f6div%rsi b: 48 0f af f0 imul %rax,%rsi f: 48 89 01mov%rax,(%rcx) 12: 48 89 f8mov%rdi,%rax 15: 48 29 f0sub%rsi,%rax 18: c3 retq If the caller is going to store the returned value directly in memory anyway, there's little difference. And if the caller is going to operate on the return value, and needs it in a register, I think struct return should be epsilon more efficient. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
ni...@lysator.liu.se (Niels Möller) writes: I'm about to push the first step, with C implementations of mpn_div_qr_1 and mpn_div_qr_1n_pi1. Done now, including some tuning code. It would be interesting to have DIV_QR_1N_PI1_METHOD DIV_QR_1_NORM_THRESHOLD DIV_QR_1_UNNORM_THRESHOLD added to the tuning reports on the web. I'll try to get the x86_64 assembly for mpn_div_qr_1n_pi1 in soon. I think my plan is to do things in this order: 1. Get div_qr_1 working efficiently. 2. Get divf_qr_1 (fraction) working efficiently. 3. Let divrem_1.c use the new code. 4. Check for regressions, and delete divrem_1.asm for onw platform at a time. but this will most likely take a long time to complete. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: Which tail call? In the normalized case, the checked in mpn_div_qr_1 does something like *qh = ...; ... return mpn_div_qr_1n_pi1(...); Which is a nice tail call. With the struct-returning version one gets instead res.qh = ...; ... res.r = mpn_div_qr_1n_pi1(...); return res; which I don't think the compiler has any chance of turning into a tail call. Should we unconfuse ourselves and users about what type of inverse is to be passed? The pi1 moniker might be replaced. How many one-limb inverse types do we currently consider? It would make some sense to me to do like the mod_1_* functions, and have a corresponding _cps function precomputing anything needed. The ABI then says nothing about the details, only that a maximum of 4 (say) precomputed limbs will be used. For the fancy div_qr_1 algorithm, what's needed is the same old reciprocal dinv = (B^2-1)/d - B, and B^2 - d*(B+dinv), same as for mod_1_1. And shiftcount, in the unnormalized case. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
ni...@lysator.liu.se (Niels Möller) writes: I'll try to get the x86_64 assembly for mpn_div_qr_1n_pi1 in soon. Pushed first working version now, see http://gmplib.org:8000/gmp/file/tip/mpn/x86_64/div_qr_1n_pi1.asm On my core2 laptop: $ ./speed -s 2-10,100,500 -C mpn_divrem_1.0x mpn_div_qr_1.0x overhead 6.13 cycles, precision 1 units of 8.33e-10 secs, CPU freq 1200.00 MHz mpn_divrem_1.0x mpn_div_qr_1.0x 2 60.6420 #39.9427 3#40.9839 55.0469 4#43.7667 44.4534 5 44.6333 #38.9055 6 39.6259 #34.4167 7 34.0063 #32.4018 8 30.1364 #28.5745 9 29.6472 #27.4599 1029.1270 #26.7300 100 24.7920 #20.6700 500 24.4400 #19.7600 So here it's a clear win, except an ugly regression for n = 3. On shell, the same command gives: 2#37.4379 51.1157 3#30.0256 61.0904 4#25.8058 27.0781 5#23.2717 24.2831 6#21.7520 22.4346 7#20.5219 21. 8#19.4783 20.1101 9#18.7726 19.3369 10 #18.3271 18.7228 100 #13.8063 13.8175 500 #13.2670 13.2750 So here the new code is epsilon slower for the larger sizes. Maybe the loopmixer can help. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: On Intel chips, op-to-mem is expensive. Even op-from-memory is often slower than load+op. (I understand the register shortage problem.) The following (untested) variant needs one register too many. UP, QP, UN: Load, store, loop counter. DINV, B2, B2md: Loop invariant constants. U2, U1I, U0, Q1I, Q0: Inputs. U1O, Q1O: Outputs. Q2, %rax, %rdx: Locals. Also U1I - U1O recurrency chain (with opteron cycle counts) mov U2, Q2 mov U2, Q1O neg Q2 mov DINV, %rax and DINV, Q1O mul U1I add Q0, Q1O adc $0, Q2 mov %rax, Q0 add %rdx, Q1O adc $0, Q2 mov B2, %rax and B2, U2 mul U1I C 0 6 lea (U0, B2md), U1O add U0, U2 cmovnc U2, U1O adc U1I, Q1O adc Q1I, Q2 mov Q2, (QP, UN, 8) jc L(incr) L(incr_done): mov (UP, UN, 8), U0 add %rax, U0C 4 adc %rdx, U1O C 5 sbb U2, U2 C 6 25 instructions (27 K10 decoder slots) excluding loop overhead. But one variable must be moved out of the registers. Maybe B2md (used once) is the best candidate. Then lea (U0, B2md), U1O would have to be replaced by mov (%rsp), U1O C Can be done very early ... add U0, U1O We then have 26 instructions + loop overhead, or 54 instructions for 2 iterations. Or possibly DINV, if one thinks the quotient logic is less critical. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: I looked at the logic following this: sbb U2, U2 C 7 13 You negate the U2 copy in Q2. It seems that three adc by sbb could avoid the neg. The problem is the final use, where Q2 is added, with carry, to a different register. It's tempting to replace adc Q1I, Q2 with sbb Q2, Q1I and negated Q2, but I'm afraid that will get the sense of the carry wrong. Do you see any trick to get that right without negating Q2 somewhere along the way? I might also be possible to replace the early loop and stuff by cmov. Maybe, but the simple way to do conditional addition with lea + cmov won't to, since we also need carry out. Does it matter if we do mov B2, r and mask, r or mov $0, r cmovc B2, r ? To optimise register usage, I sometimes annotate the code with live ranges for each register. That will help with register coalescing. There are lots of possibilities, since the computations for Q and U are mostly independent. The data flow is something like load U limb | _V_ U2, U1I, U0 - |___| - U2, U1O, U0 \ |__/ cy _V__V___V_ Q1I, Q0- |__| - Q1O, Q0 \ V store Q limb (T is rather shot-lived, perhaps its register could serve two usages?) It could perhaps eliminated. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: * The code is no win for AMD k10/k8 (although close to 10 c/l might well be possible) I tried replacing one masking op by cmov, as you suggested. We then get down to 11.25 c/l on K10. I put this modified version in the k10 subdirectory, since it was a significant slowdown on some other processors. Next thing to try is to delay the Q1 store, but that's a bit more work. After that, I guess I should try the loop mixer. I benchmarked the code on the k8, k10, core2, sandybridge, nehalem and nano machines. I couldn't log in to haswell and piledriver. /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: I turned out the code was a bit slower on k8. This patch changes that. With it applied, things takes 11 c/l on both pipelines. This is also a 2 c/l improvement for piledriver. Cool. I have not tested that this is correct. If you like the patch, please consider putting the result in the k8 subdir. I'll try to get that done reasonably soon. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: But I lack unit testing code for the function, making hacking quite cumbersome. I don't feel safe hacking *any* GMP assembly code without tests/devel/try.c's function and access checks. tests/mpn/t-div.c includes tests for mpn_div_qr_1, including single guard limbs before and after the q area. That provides for some reasonable testing, and I was running while GMP_CHECK_RANDOMIZE=0 ./t-div ; : ; done over night in the weekend. But sure, support also in try.c would be good. (1) Shorten a dep chain, and avoid keeping CF live over an inc instruction. The cmov doesn't really depend on sbb, since the latter insn never really changes carry. (This might btw be useful to teach loppmixer!) Right, using cmov (instead of masking) allows some more reordering. (2) Reallocate Q2 to an old register (not r8-r15) and then use the 32-bit form of adc $0,reg. That form is shorter. Clever! (3) Offet UP to avoid the offset in the loop. Should be straight forward. Or one could offset UN, but I think it's nice to be able to use dec / jnz for the loop. Or is the obscure loop decrement and branch instruction useful? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: But sure, support also in try.c would be good. Added now. And sure enough, it detects some bugs in the new assembly code. For size n==1, there's a missing mov. I'll add that shortly. Then there's another problem with n==2, which needs a bit more debugging. These sizes aren't exercised by div_qr_1. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: Basically qp = up won't work, but qp = up + k for any positive k will? Does the C code share that property? I think the C code has the same problem. Hmm, and it looks like in both the C and asm code, it's only the first iteration, before the main loop, which has this problem. Then it should be easy to fix, in the asm case, it's one additional mov and not in the main loop. I think it would be good to fix that, since it is surely a common usage scenario. I agree. If/when the code is reorganized to delay the q stores (to avoid the adc Q2,8(QP, UN, 8) instruction), I think the overlap issue will also get fixed in the process. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: A contribution to GMP
Torbjorn Granlund t...@gmplib.org writes: We might want to re-phrase it somewhat, perhaps just consider - seriously consider, or even New projects should use mpfr since mpf is slower, less complete, and not actively developed. I guess in theory, mpf could be slightly faster than mpfr since mpf doesn't make any effort to have precisely defined rounding. But if it isn't faster in practice, and doesn't have any other advantages over mpfr, it would make sense to clearly deprecate all use of mpf. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: Basically qp = up won't work, but qp = up + k for any positive k will? Does the C code share that property? [...] I think it would be good to fix that, since it is surely a common usage scenario. I agree. I've checked in some bugfixes. Now it seems to work according to try, and without the no-overlap patch which it turned out I never pushed into the repo. If/when the code is reorganized to delay the q stores (to avoid the adc Q2,8(QP, UN, 8) instruction), Here's a first crude version, which seems to work. I could eliminate other uses of T, and then use that register to the previous Q1 limb around I had to change the logic lea (B2md, U0), T ... and B2, U2 add U2, U0 cmovnc U0, T adc U1, Q1 Here, B2md holds B^2 (mod d) - d, and U2 is initially a mask. To use U2 as temporary (it dies at the add), I had to change it to and B2, U2 add U2, U0 lea (B2md, U0), U2 cmovnc U0, U2 adc U1, Q1 which gives the same result, provided that B2md is adjusted to just hold -d. This sequence has less friendly dependencies. Maybe it would be better to stick some of the loop-invariant constants in memory instead. It's getting late, so I'll not do any benchmarking right now. Regards, /Niels dnl x86-64 mpn_div_qr_1n_pi1 dnl -- Divide an mpn number by a normalized single-limb number, dnl using a single-limb inverse. dnl Contributed to the GNU project by Niels Möller dnl Copyright 2013 Free Software Foundation, Inc. dnl This file is part of the GNU MP Library. dnl The GNU MP Library is free software; you can redistribute it and/or modify dnl it under the terms of the GNU Lesser General Public License as published dnl by the Free Software Foundation; either version 3 of the License, or (at dnl your option) any later version. dnl The GNU MP Library is distributed in the hope that it will be useful, but dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public dnl License for more details. dnl You should have received a copy of the GNU Lesser General Public License dnl along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. include(`../config.m4') C c/l C AMD K8,K9 13 C AMD K10 13 C AMD bull 16.5 C AMD pile 15 C AMD steam ? C AMD bobcat16 C AMD jaguar ? C Intel P4 47 poor C Intel core19.25 C Intel NHM 18 C Intel SBR 15 poor C Intel IBR 13 C Intel HWL 11.7 C Intel BWL ? C Intel atom52 very poor C VIA nano 19 C INPUT Parameters define(`QP', `%rdi') define(`UP', `%rsi') define(`UN_INPUT', `%rdx') define(`U1', `%rcx')C Also in %rax define(`D', `%r8') define(`DINV', `%r9') C Invariants define(`B2', `%rbp') define(`B2md', `%rbx') C Variables define(`UN', `%r8') C Overlaps D input define(`T', `%r10') define(`U0', `%r11') define(`U2', `%r12') define(`Q0', `%r13') define(`Q1', `%r14') define(`Q2', `%r15') ABI_SUPPORT(STD64) ASM_START() TEXT ALIGN(16) PROLOGUE(mpn_div_qr_1n_pi1) FUNC_ENTRY(6) IFDOS(` mov 56(%rsp), %r8 ') IFDOS(` mov 64(%rsp), %r9 ') dec UN_INPUT jnz L(first) C Just a single 2/1 division. C T, U0 are allocated in scratch registers lea 1(U1), T mov U1, %rax mul DINV mov (UP), U0 add U0, %rax adc T, %rdx mov %rdx, T imulD, %rdx sub %rdx, U0 cmp U0, %rax lea (U0, D), %rax cmovnc U0, %rax sbb $0, T cmp D, %rax jc L(single_div_done) sub D, %rax add $1, T L(single_div_done): mov T, (QP) FUNC_EXIT ret L(first): C FIXME: Could delay some of these until we enter the loop. push%r15 push%r14 push%r13 push%r12 push%rbx push%rbp mov D, B2 imulDINV, B2 neg B2 C mov B2, B2md C sub D, B2md mov D, B2md neg B2md C D not needed until final reduction pushD mov UN_INPUT, UNC Clobbers D mov DINV, %rax mul U1 mov %rax, Q0 add U1, %rdx mov %rdx, Q1 mov B2, %rax mul U1 mov -8(UP, UN, 8), U0 mov (UP, UN, 8), U1 C mov Q1, (QP, UN, 8) add %rax, U0 adc %rdx, U1 sbb U2, U2 dec UN mov U1, %rax jz L(final) ALIGN(16) C Loop is 28 instructions, 30 decoder slots
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: ni...@lysator.liu.se (Niels Möller) writes: The interesting thing is that the next higher function, mpn_div_qr_1, should return the high quotient limb separately. I am not sure I agree. Please explain. A long time ago, we choose an interface for sbpi1_div_qr which does *not* store the most significant limb; instead it returns it. I think it was the intention that a new top-level mpn_div_qr should follow that convention, and not store the top limb. I don't remember if that was purely for esthetic reasons, or if it also simplified the divideconquer division code. Having some of the special division case functions, like mpn_div_qr_1, storing the top quotient limb leads to unnecessary complications when writing the general mpn_div_qr; it forces mpn_div_qr for n/1 to do the top limb itself and call mpn_div_qr_1 with (n-1)/1, which seems like a case of bad impedance mismatch. But then, requirements for div_qr_1 don't necessarily translate to requirements for the primitive division loops we're discussing now. For the recent mpn_div_qr_1 and mpn_div_qr_1n_pi1, it turned out to work fine to have mpn_div_qr_1 for n/1 do the top quotient limb up front, and let mpn_div_qr_1n_pi1 generate n-1 quotient limbs (but n-1 *full* limbs, since we pass in a high u limb, reduced to uh d). It's unclear to me if the best design for the unnormalized case is to follow mpn_div_qr_1n_pi1 or not. It get be clearer after writing a decent mpn_div_qr_1u_pi1. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Amd64 relocation R_X86_64_32S in a static lib
Torbjorn Granlund t...@gmplib.org writes: Now we have (at least) two OpenBSD ABIs for AMD64, pre 5.3 and now 5.3, 5.4. To make sense of things, I would not be surprised to see R_X86_64_64 banned from 5.5 and on, creating a 3rd OpenBSD AMD64 ABI. I don't understand the fine details of which reloc types make sense in pic code, but if I understand Philip correctly, the main problem is not a ABI change, but changed compiler default. And then you get link errors when linking together pic objects created by the compiler, with non-pic objects created from the gmp assembly files. I think you'd get similar problems as if a user configures gmp with, e.g., --disable-shared CFLAGS=-fpic. Do you agree with this analysis? To figure out if an invocation $(CC) $(CFLAGS) generates pic code or not, one needs a configure test like AC_TRY_COMPILE( ... #if __PIC__ #error bla bla #endif ... ) Using the result of that test when deciding whether or not assembly files should use pic mode should give the correct behavior, both with compilers doing pic by default, and users enabling it explicitly. /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Amd64 relocation R_X86_64_32S in a static lib
Torbjorn Granlund t...@gmplib.org writes: If one tries to assemble and link the former code on a 5.3/5.4 system, one gets again the error relocation R_X86_64_32S can not be used when making a shared object; recompile with -fPIC. What other objects were involved in the link? I guess something like crt.o was compiled as non-pic on pre-5.3 systems, but as pic-code on 5.3? I'd like to know if the problem is that the linker input files are an somehow incompatible mix of pic and non-pic objects, or if the problem is is that by default the linker insists on creating a position independent executable (PIE). I'd call this an ABI change since code valid under the AMD64 ELF ABI as well as on older OpenBSD releases is no longer supported. If you move an *executable* from pre-5.3 to 5.3, it might still work fine, if non-position-independent-executables are still supported, and compatible crt.o and the like are included in the executable. It's very clear that the ABI-spec defines which executables and shared libraries are valid, but I imagine one could have a long flamewa^H^H^H^H^H^Hdebate on whether or not the ABI really specifies which object files and static libraries are valid when used with the default compiler settings. But in any case, it's going to be inconvenient for users if static libraries can't be copied over to a later OS release. (There are pros and cons with feature tests, though. I consider the free software OSes out there to be primary targets for GMP, and I'd like to avoid creating GMP objects that cannot be moved between FooBSD x.y and FooBSD x.y+k. I think it would be a good test to compile gmp on pre-5.3, and use that library when compiling and linking some simple program on openbsd-5.3. Trying both with a static and a shared gmp library. And maybe trying also with --disable-assembly. If any of that fails, that's clearly demonstrates some type of ABI incompatibility. I'd also like to see the motivation for the discontinued support for movl$17, %edi call*jumptab(,%rax,8) Which seems to be a very sensible thing to use in pic code. Offsets between text and data segment are usually known at link time. Does openbsd use some flavor of address space randomization which adds a random load-time offset between .text and .data? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Amd64 relocation R_X86_64_32S in a static lib
Philip Guenther guent...@gmail.com writes: That's bad. But then why not allow R_X86_64_32S too? Because there's no guarantee that the target is within 2GB, as the reference may resolve to the symbol in a different shared object which is too far away. Can you use ELF visibility to provide the needed guarantee that it will never resolve to a symbol in a different shared object? I think references *within* a shared object is an important and common case, worthy of some optimization. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: [PATCH] Support powerpc64le-linux platform
Ulrich Weigand uweig...@de.ibm.com writes: - Update configfsf.guess/sub to current upstream versions. We should definitely do that, independent of the other fixes. Please remind if we don't get to that within a week or so (I can't do it right away). - Change mpn/powerpc64/elf.m4 to generate appropriate assembler code for the new ELFv2 ABI. The relevant change here is that we no longer require function descriptors; instead, functions that use the TOC register need to provide a pair of global and local entry points. For more details, see the GCC patch messages here: http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01144.html http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01141.html Looks sane to me (without understanding the fine details). Maybe Torbjörn has some comments. I'm not really familiar neither with how ABI variants are handled in GMP, nor with powerpc assembly. Would this be OK for GMP mainline? I think so. (I haven't submitted a GMP patch before; if there's anything else I should be doing to facilitate this process, please let me know!) We ask for FSF copyright assignments for all substantial changes. Yours is maybe borderline, but I think we'd still like to have some paperwork. We can arrange it off-list. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: porting gmp to pnacl
Kovács Zoltán zol...@geogebra.org writes: 3. When I try to compile a simple demo manually, this is what I get: kovzol@kovzol:~/tmp/nacl_sdk/naclports/src/out/repository/gmp-5.1.3/demos$ make isprime /home/kovzol/tmp/nacl_sdk/pepper_31/toolchain/linux_x86_newlib/bin/x86_64-nacl-gcc -std=gnu99 -O2 -pedanticisprime.c -o isprime /tmp/ccBeMP1x.o: In function `main': isprime.c:(.text+0x11c): undefined reference to `__gmpz_init' At least here the problem is obvious: You don't link with libgmp at all. I have no idea why the Makefile generates a command like that. If you think that is a probem with GMP configure and Makefiles, submit a bug report to gmp-b...@gmplib.org, carefully following the instructions in the manual. For help with building gmp, gmp-devel is not the right forum. Perhaps there's someone on gmp-discuss who can help you, otherwise I think you'll have better luck in a forum more specific to your platform and toolchain. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Side-channel silent modular inverse
Torbjorn Granlund t...@gmplib.org writes: I suppose I already suggested that one computes a^{-1} mod b as a^{b-1} mod b, using a plain old modexp. For prime b (which is an important special case). I realise that this will be asymptotically slower, in this setting O(n^3) vs O(n^2), but it ought have a much lower constant factor. I think powm actually was slower when I tried, for the sizes of a few limbs which were relevant for ecdsa, but I'm not sure. Some benchmarking is needed. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Side-channel silent modular inverse
Torbjorn Granlund t...@gmplib.org writes: * mpn_sec_add_1 I'd say use the obvious algorithm: Create vector of n-1 zeros and then the input limb arg at index 0, invoke mpn_add_n. That's good enough, I guess, at the cost of some extra scratch space. * mpn_cnd_neg Create zero vector, invoke mpn_sub_n. That doesn't make it conditional. And I see no obvious way to do conditional negation on top of mpn_cnd_sub_n. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Including limits.h in gmp-impl.h
Marc Glisse marc.gli...@inria.fr writes: Then I am considering pushing the attached patch soon. Do we really need a configure test? Which C compilers lack limits.h? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Side-channel silent modular inverse
Torbjorn Granlund t...@gmplib.org writes: Compute T = 2 x A using mpn_add_n or mpn_lshift. Use mpn_cnd_sub_n with A, T as arguments. Should work (except if T is computed mod B^n, one doesn't get the correct carry out, but that isn't needed here). But it's a bit awkward, and this is a performacne critical function; some 30% of the time to create a side-channel silent ecdsa signature is spent doing the modular inversion. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Side-channel silent modular inverse
Torbjorn Granlund t...@gmplib.org writes: I had neglected the significance of modular inversion for elliptic curve arithmetic. In my implementation, it's needed in two places. * For ecdsa signatures, the random nonce k is inverted mod q, the ecc group order. * When converting coordinates back from jacobian representation to affine representation. Then the z coordinate is inverted mod p. My suggestion was just for a reasonably efficient fall-back. Fair enough. /nisse -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: TODO for 5.2
Torbjorn Granlund t...@gmplib.org writes: I notice you make this non-public. Is it premature to make it part of the public interface? I just put the declarations together with the other mpn_sec_* functions. I think it makes sense to make mpn_sec_div_*, mpn_sec_minvert and mpn_sec_powm public together. Does mpn_sec_powm need more work (besides the rename) before made public? Also about naming, I noticed the inconsistency between mpz_invert and mpn_invert, which do very different things. If/when we add an mpz function corresponding to mpn_sec_minvert, I guess that would be named mpz_sec_invert? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: [PATCH 01/20] Delete mpn/generic/sizeinbase.c
Torbjorn Granlund t...@gmplib.org writes: Checked old mail, but it is only mentioned 7 months earlier, when Kevin aded it. It does not make sense as an internal function, the MPN_SIZEINBASE macro plays its role. I suppose we could preserve it, and make it public. Below a patch to that effect. I think it's good to make this function public. Needs documentation. *Maybe* it doesn't need its own tests, since it's the same as mpz_sizeinbase but with different argument type. Regards, /Niels diff -r ffe8c8da8c90 configure.ac --- a/configure.ac Wed Jan 01 16:45:58 2014 +0100 +++ b/configure.ac Wed Jan 01 23:07:15 2014 +0100 @@ -2800,7 +2800,8 @@ mul mul_fft mul_n sqr mul_basecase sqr_basecase nussbaumer_mul \ mulmid_basecase toom42_mulmid mulmid_n mulmid \ random random2 pow_1\ - rootrem sqrtrem get_str set_str scan0 scan1 popcount hamdist cmp\ + rootrem sqrtrem sizeinbase get_str set_str \ + scan0 scan1 popcount hamdist cmp\ perfsqr perfpow \ gcd_1 gcd gcdext_1 gcdext gcd_subdiv_step \ gcdext_lehmer \ diff -r ffe8c8da8c90 gmp-h.in --- a/gmp-h.in Wed Jan 01 16:45:58 2014 +0100 +++ b/gmp-h.in Wed Jan 01 23:07:15 2014 +0100 @@ -1575,6 +1575,9 @@ #define mpn_set_str __MPN(set_str) __GMP_DECLSPEC mp_size_t mpn_set_str (mp_ptr, const unsigned char *, size_t, int); +#define mpn_sizeinbase __MPN(sizeinbase) +__GMP_DECLSPEC size_t mpn_sizeinbase (mp_srcptr, mp_size_t, int); + #define mpn_sqrtrem __MPN(sqrtrem) __GMP_DECLSPEC mp_size_t mpn_sqrtrem (mp_ptr, mp_ptr, mp_srcptr, mp_size_t); -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Including limits.h in gmp-impl.h
Marc Glisse marc.gli...@inria.fr writes: So, do you agree with the change? Yes. I don't understand why the C spec doesn't define USHORT_MAX the way it does, but we should nevertheless define our constant in the same way as C's USHORT_MAX. (care to add a suitable comment?) I think the comment could be improved. But if you intend to remove the definition soon anyway, it doesn't matter. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: [PATCH 01/20] Delete mpn/generic/sizeinbase.c
Torbjorn Granlund t...@gmplib.org writes: This makes sense to me. Please apply (and please write documentation!). Done. Not sure about the right interface. Now, the docs says that it is required that n 0 and that the input must be normalized (i.e., top limb non-zero), even though the current implementation also allows n = 0. I'm thinking that maybe it makes sense to generalize the function to allow unnormalized input. If so, one also has to specify the return value in case the input is zero. I'm leaning towards returning 0 in this case, but maybe consistency with the mpz function (which returns 1) is more important. That would be a compatible extension of the currently documented interface, so it doesn't have to be decided before release. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: TODO for 5.2 v3
[0] = b; + MPN_ZERO (scratch + 1, n-1); + return mpn_sub_n (rp, ap, scratch, n); +} diff -r 84343784aa3d mpn/x86_64/cnd_neg.asm --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/mpn/x86_64/cnd_neg.asmTue Jan 07 15:13:37 2014 +0100 @@ -0,0 +1,66 @@ +dnl AMD64 mpn_cnd_neg + +dnl Copyright 2014 Free Software Foundation, Inc. + +dnl This file is part of the GNU MP Library. + +dnl The GNU MP Library is free software; you can redistribute it and/or modify +dnl it under the terms of the GNU Lesser General Public License as published +dnl by the Free Software Foundation; either version 3 of the License, or (at +dnl your option) any later version. + +dnl The GNU MP Library is distributed in the hope that it will be useful, but +dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +dnl License for more details. + +dnl You should have received a copy of the GNU Lesser General Public License +dnl along with the GNU MP Library. If not, see https://www.gnu.org/licenses/. + +include(`../config.m4') + +C INPUT PARAMETERS +define(`cnd', `%rdi') dnl rcx +define(`rp', `%rsi') dnl rdx +define(`up', `%rdx') dnl r8 +define(`n',`%rcx') dnl r9 +C scratch parameter is ignored + +ABI_SUPPORT(DOS64) +ABI_SUPPORT(STD64) + +ASM_START() + TEXT + ALIGN(16) +PROLOGUE(mpn_cnd_neg) + FUNC_ENTRY(4) + + lea (up,n,8), up + lea (rp,n,8), rp + + neg n + + neg cnd + sbb cnd, cndC make cnd mask, also copy to cy + +L(loop): + mov (up, n, 8), %r8 + sbb R32(%rax), R32(%rax)C Save carry + xor cnd, %r8C Clears carry, very annoying. + add R32(%rax), R32(%rax)C Restore carry + adc $0, %r8 + mov %r8, (rp, n, 8) + inc n + jne L(loop) + C Generate carry out, if cnd and x != 0 + inc R32(%rax) + and R32(cnd), R32(%rax) + FUNC_EXIT() + ret +EPILOGUE() + + +PROLOGUE(mpn_cnd_neg_itch) + xor R32(%rax), R32(%rax) + ret +EPILOGUE() diff -r 84343784aa3d mpn/x86_64/cnd_swap.asm --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/mpn/x86_64/cnd_swap.asm Tue Jan 07 15:13:37 2014 +0100 @@ -0,0 +1,61 @@ +dnl AMD64 mpn_cnd_swap + +dnl Copyright 2014 Free Software Foundation, Inc. + +dnl This file is part of the GNU MP Library. + +dnl The GNU MP Library is free software; you can redistribute it and/or modify +dnl it under the terms of the GNU Lesser General Public License as published +dnl by the Free Software Foundation; either version 3 of the License, or (at +dnl your option) any later version. + +dnl The GNU MP Library is distributed in the hope that it will be useful, but +dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +dnl License for more details. + +dnl You should have received a copy of the GNU Lesser General Public License +dnl along with the GNU MP Library. If not, see https://www.gnu.org/licenses/. + +include(`../config.m4') + +C INPUT PARAMETERS +define(`cnd', `%rdi') dnl rcx +define(`up', `%rsi') dnl rdx +define(`vp', `%rdx') dnl r8 +define(`n',`%rcx') dnl r9 +C scratch parameter is ignored + +ABI_SUPPORT(DOS64) +ABI_SUPPORT(STD64) + +ASM_START() + TEXT + ALIGN(16) +PROLOGUE(mpn_cnd_swap) + FUNC_ENTRY(4) + + neg cnd + sbb cnd, cndC make cnd mask + + lea (up,n,8), up + lea (vp,n,8), vp + + neg n + +L(loop): + mov (up, n, 8), %r8 + mov (vp, n, 8), %r9 + mov %r8, %r10 + xor %r9, %r8 + and cnd, %r8 + xor %r8, %r10 + xor %r8, %r9 + mov %r10, (up, n, 8) + mov %r9, (vp, n, 8) + inc n + jne L(loop) + + FUNC_EXIT() + ret +EPILOGUE() -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: TODO for 5.2 v3
bodr...@mail.dm.unipi.it writes: Ciao, Il Mar, 7 Gennaio 2014 4:58 pm, Niels Möller ha scritto: Here's a first patch adding a couple of other functions. Benchmarking and testing is missing (except that the sec_minvert tests still pass). Interesting... Another thing I was about to ask, but forgot, is use of volatile. I added it to the mpn_cnd_swap and mpn_sec_eq_ui prototypes, in an attempt to tell the compiler to not be too clever. But I'm not entirely sure where it is useful. diff -r 84343784aa3d mpn/x86_64/cnd_neg.asm --- /dev/nullThu Jan 01 00:00:00 1970 + +++ b/mpn/x86_64/cnd_neg.asm Tue Jan 07 15:13:37 2014 +0100 What about _itch functions when the .asm source is used? They're supposed to exist, and return zero. E.g., +PROLOGUE(mpn_cnd_neg_itch) + xor R32(%rax), R32(%rax) + ret +EPILOGUE() +C scratch parameter is ignored _itch should return 0... And some obvious patch like the following would take advantage of this. Nice, but won't make any difference until we have mpn_sec_add_1 assembly, right? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mini-gmp
Zimmermann Paul paul.zimmerm...@inria.fr writes: Here are the division functions I had to re-implement on top of mini-gmp (I know mpn_divrem is obsolete, but its interface is better suited for MPFR): Do you need fraction limbs? From a quick look, it seems like you implemented that for mpn_divrem_1, but not for mpn_divrem and mpn_tdiv_qr? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Cleaning out varargs
Torbjorn Granlund t...@gmplib.org writes: Since we cleaned out KR stuff a few years back, we could require stdarg.h without causing additional portability problems, right? Right. There's also some $U left in Makefile.am, which as far as I understand is a remnant of ansi2knr support. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Cleanups
Torbjorn Granlund t...@gmplib.org writes: We have a non-public mpn_invert function family, which actually computes a fraction. I suggest that we rename them to something like mpn_finvert* (at least before they are made public). In our papers, I think we ended up using the term reciprocal. Maybe some abbreviation of that would make sense. Also applies to invert_limb. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
Zimmermann Paul paul.zimmerm...@inria.fr writes: we have looked at the mpz_limbs interface, Thanks for the review. it seems it won't be sufficient for our needs in MPFR. Consider for example the following function, which generates nbits random bits into mp[]: I see. In this particular case, I think the right gmp interface change is to add mpn_urandomb and mpn_rrandomb (similar to current mpn_random and mpn_random2, but with a randstate argument). If I understand this correctly, the main obstacle is that random number internals use mpz functions, which is an undesirable dependency for mpn functions. Any other problematic cases? MPFR_ASSERTN (nbits = 1); /* To be sure to avoid the potential allocation of mpz_urandomb */ ALLOC(z) = SIZ(z) = MPFR_PREC2LIMBS (nbits); PTR(z) = mp; [...] mpz_urandomb (z, rstate, nbits); I think this is brittle, and I don't think this style should be officially supported in the public gmp interface. mpz_urandomb is not included in mini-gmp, but if you, e.g., try the same with other functions, it will most likely break with mini-gmp which assigns many of the output parameters using mpz_swap. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
Torbjorn Granlund t...@gmplib.org writes: ni...@lysator.liu.se (Niels Möller) writes: I see. In this particular case, I think the right gmp interface change is to add mpn_urandomb and mpn_rrandomb (similar to current mpn_random and mpn_random2, but with a randstate argument). If I understand this correctly, the main obstacle is that random number internals use mpz functions, which is an undesirable dependency for mpn functions. We should indeed do this. After 5.2? For testing the result, I'd write a wrapper C program that run old and new code in separate processes, and compare that the results are identical for the same (now and then regenerated) seeds. If the generation is intended to be machine-independent, it would also make sense to also add tests generating some random numbers from a few fixed seeds, and compare to expected (constant) numbers. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
Marc Glisse marc.gli...@inria.fr writes: We already have function mpz_array_init which encourages thinking of the buffer as belonging to a variable, There's a ChangeLog entry from the day before yesterday: * doc/gmp.texi: Undocument mpz_array_init. I take it Torbjörn really wants to deprecate this function. Are you using it in mpfr? Gratuitous extra alloc+dealloc pairs can have a horrible performance effect. One can still use mpz_realloc2, which can be viewed as an allocation hint to the gmp implementation (so in mini-gmp, it's harmless but also mostly useless, due to the frequent use of mpz_swap). Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: New mpn random generators
Torbjorn Granlund t...@gmplib.org writes: ni...@lysator.liu.se (Niels Möller) writes: After 5.2? Probably, but you're welcome to fix it today also. :-) Or would you suggest that we postpone the release? No, I donät think the release whould wait for new randomness functions. As you say, the design is not entirely obvious, and maybe that has to take some time. Actually, there are two parts: (i) Rewrite current code to not use mpz internally, and (ii) define a new public mpn interface. First part has no interface design issues. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
Marc Glisse marc.gli...@inria.fr writes: I was talking of the case where I already have a preexisting buffer I want to use as _mp_d, I don't want to allocate a new one with mpz_realloc2. That seems to be a few steps further than the new limb interface goes, and I can understand not wanting to support that. No, the new limb interface can only do it the other way round. If you want to use a buffer as mpz_t output, and access it with low-level fucntions too, you have give the reponsibility for allocating it to _mpz_realloc, and write the code to tolerate that any use of the buffer as an mpz_t output might do an unexpected reallocation. When does this situation arise? In the given example, the root cause was a missing mpn-level function in gmp. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
sec_invert performance
I'm confused by the pretty bad performance of sec_invert. Try e.g., ./tune/speed -s 1-200 -P invert mpn_gcdext mpn_sec_invert mpz_powm_sec Say we have inputs of n bits. Then my understadning of these algorithms is: * mpn_gcdext, for sizes of interest, is Lehmer's algorithm, O(n^2), with a reasonably small constant since it reduces roughly one limb per iteration. * mpn_sec_powm does n squarings + redc (and then some n/window_size multiplies and table lookups). And the squarings and redc are O(n^2), so total running time should be O(n^3). * mpn_sec_invert has a loop of 2n iterations. Each iterations makes 9 calls to various O(n) functions, like mpn_cnd_add_n, mpn_rshift, etc. So it ought to be O(n^2), and a constant factor slower than mpn_gcdext. But when I run speed, *both* ratios mpn_invert / mpn_gcdext and mpz_sec_powm / mpn_gcdext keep growing with n. So mpn_invert is not a constant factor slower than mpn_gcdext, as I would have expected, and it's significantly slower than mpn_sec_powm for largish sizes like 100 limbs. So it looks like it's O(n^3), not O(n^2). Any idea what's going on? And then, there's also a sharp step at n = 113 or 114 (maybe that's an artifact of the testmachine, but it's reproducible here). $ ./speed -s 100-120 -c -r mpn_gcdext mpn_sec_invert mpz_powm_sec overhead 5.42 cycles, precision 1 units of 6.25e-10 secs, CPU freq 1600.00 MHz mpn_gcdext mpn_sec_invert mpz_powm_sec 100#286753.00 97.9985 68.7822 101#290981.00 96.3812 69.3512 102#297710.00 94.5395 68.9482 103#301772.00 94.9747 69.6444 104#308297.00 94.8706 69.0619 105#310131.00 96.1709 70.7960 106#312842.00 96.6907 70.8039 107#321814.00 96.2753 70.4176 108#313583.00 100.2901 73.2001 109#321223.00 100.3469 72.9247 110#328283.00 99.1086 72.5408 111#333812.00 99.4312 72.9408 112#335619.00 100.7185 73.4388 113#343734.00 46.7804 73.1944 114#352574.00 46.8327 72.9897 115#356299.00 46.9179 73.3826 116#366523.00 46.0239 72.2455 117#371422.00 46.5541 72.7283 118#367404.00 50.3468 74.7026 119#374872.00 48.6011 74.7945 120#376131.00 48.3491 75.4658 Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: sec_invert performance
Torbjorn Granlund t...@gmplib.org writes: I quick guess is that the exponent is fixed for powm, not a function of the input size. Ah, its hardcoded to always use 6 limbs. If I change it to use the same size as b and m, results are a bit different: $ ./speed -s 1-64 -c -r mpn_gcdext mpn_sec_invert mpz_powm_sec overhead 5.43 cycles, precision 1 units of 6.25e-10 secs, CPU freq 1600.00 MHz mpn_gcdext mpn_sec_invert mpz_powm_sec 1#1393.539.25632.8281 2#3700.838.45302.8537 3#5628.829.22964.7198 4#7387.089.87056.5449 5#9270.67 12.22719.4494 6 #11102.89 13.2985 11.9077 7 #12923.54 14.2668 14.8305 8 #14985.50 14.9848 17.2423 9 #16826.96 17.1405 21.4808 10 #18998.00 17.9614 24.7252 11 #20390.00 19.6006 29.7402 12 #23244.21 19.9373 32.0728 13 #24235.58 22.4859 39.6380 14 #26027.70 23.5316 44.4109 15 #29157.89 23.7497 47.7260 16 #30368.62 26.6497 53.3113 17 #32534.43 28.4042 60.1591 18 #33580.67 29.1482 69.0387 19 #34424.60 31.2023 78.6099 20 #36470.00 32.4313 84.8543 21 #36405.25 35.8625 99.2162 22 #41335.25 33.8338 97.9815 23 #39861.67 38.2781 116.4425 24 #41421.67 39.4835 124.2736 25 #44087.67 41.0079 138.0154 26 #48386.33 39.9403 139.1569 27 #40868.00 50.5410 184.1573 So if you know that the modulo is prime (or has some other type of known factorization), using powm actually beats sec_invert for sizes up to 6 limbs. I still puzzled why the ratio between sec_invert and gcdext doesn't approach some constant (I'd expect the ratio to start growing again for sizes above the threshold for subquadratic gcdext). Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: sec_invert performance
Torbjorn Granlund t...@gmplib.org writes: All linear-time functions really take Bn+C time, and quadratic-time functions take An²+Bn+C, of course. If the sec_invert function has small B and/or C while gcdext has larger B and/or C, then the observed pattern seem even more reasonable. Do you agree? Not sure. I had expected the quadratic terms to dominate (and hence the ratio almost converging to the ratio between the n^2 coefficients), well before we cross the GCDEXT_DC_THRESHOLD which is on the order of 500 limbs. But I haven't had any deep thoughts about it. (I guess one could also try to estimate all of A, B and C from the measurements. Putting such numerics into speed is something I'd like to do sometime). Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
ni...@lysator.liu.se (Niels Möller) writes: For mpn_set_d, I think it would make some sense to have it return a base-2 exponent, and write the mantissa to a few limbs. Number of limbs would be a constant, part of the ABI, similar to LIMBS_PER_DOUBLE but renamed for external use. mp_bitcnt_t mpn_set_d (mp_limb_t rp[LIMBS_PER_BOUBLE], d); Below is a patch to do this (and return value is long, not mp_bitcnt_t, since it needs to be signed). What do you think? Regards, /Niels diff -Nprc gmp.2462ec1bc65c/gmp-h.in gmp/gmp-h.in *** gmp.2462ec1bc65c/gmp-h.in 2014-02-06 09:29:33.665957246 +0100 --- gmp/gmp-h.in2014-02-06 09:29:33.669957220 +0100 *** __GMP_DECLSPEC mp_bitcnt_t mpn_scan1 (mp *** 1584,1589 --- 1584,1593 #define mpn_set_str __MPN(set_str) __GMP_DECLSPEC mp_size_t mpn_set_str (mp_ptr, const unsigned char *, size_t, int); + #define MPN_SET_D_SIZE ((53 + GMP_NUMB_BITS - 1) / GMP_NUMB_BITS) + #define mpn_set_d __MPN(set_d) + __GMP_DECLSPEC long mpn_set_d (mp_ptr, double); + #define mpn_sizeinbase __MPN(sizeinbase) __GMP_DECLSPEC size_t mpn_sizeinbase (mp_srcptr, mp_size_t, int); diff -Nprc gmp.2462ec1bc65c/mpn/generic/set_d.c gmp/mpn/generic/set_d.c *** gmp.2462ec1bc65c/mpn/generic/set_d.c1970-01-01 01:00:00.0 +0100 --- gmp/mpn/generic/set_d.c 2014-02-06 09:29:33.669957220 +0100 *** *** 0 --- 1,160 + /* mpn_set_d convert from double to array of mp_limb_t. + + Copyright 1996, 1999-2002, 2006, 2012, 2014 Free Software Foundation, Inc. + + This file is part of the GNU MP Library. + + The GNU MP Library is free software; you can redistribute it and/or modify + it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free Software + Foundation; either version 2 of the License, or (at your option) any + later version. + + or both in parallel, as here. + + The GNU MP Library is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License + for more details. + + You should have received copies of the GNU General Public License and the + GNU Lesser General Public License along with the GNU MP Library. If not, + see https://www.gnu.org/licenses/. */ + + #include gmp.h + #include gmp-impl.h + + #ifdef XDEBUG + #undef _GMP_IEEE_FLOATS + #endif + + #ifndef _GMP_IEEE_FLOATS + #define _GMP_IEEE_FLOATS 0 + #endif + + /* Stores the mantissa as MPN_SET_D_SIZE limbs at rp, and returns the +exponent. Defined so that after e = mpn_set_d (rp, double d), + + d = {rp, MPN_SET_D_SIZE} 2^e + +Always returns with rp[MPN_SET_D_SIZE - 1] 0, but besides this +requirement, the precise normalization is unspecified. + */ + long + mpn_set_d (mp_ptr rp, double d) + { + long exp; + ASSERT (d 0); /* Exclude negatives */ + ASSERT (d == d);/* Exclude NaN */ + ASSERT (d != 0.5*d);/* Exclude infinities */ + + MPN_ZERO (rp, MPN_SET_D_SIZE); + + #if _GMP_IEEE_FLOATS + { + #if defined (__alpha) __GNUC__ == 2 __GNUC_MINOR__ == 8 + /* Work around alpha-specific bug in GCC 2.8.x. */ + volatile + #endif + union ieee_double_extract x; + mp_size_t i; + uint_least32_t manl, manh; + mp_bitcnt_t shift; + x.d = d; + exp = x.s.exp; + manl = x.s.manl; + manh = x.s.manh; + + if (exp 0) + /* Set implicit one bit */ + manh |= 1UL 20; + else + /* Denormalized case, not optimized */ + while ( (manh (1UL 22)) == 0) + { + manh = (manh 1) | (manl 31); + manl = (manl 1) 0x; + exp--; + } + + /* Assign mantissa */ + #if GMP_NUMB_BITS = 32 + rp[0] = manl; + #else + for (i = 0; manl 0; manl = GMP_NUMB_BITS, i++) + rp[i] = manl GMP_NUMB_MASK; + #endif + + shift = 32 % GMP_NUMB_BITS; + i = 32 / GMP_NUMB_BITS; + + rp[i++] |= ( (mp_limb_t) manh shift) GMP_NUMB_MASK; + if (GMP_NUMB_BITS - shift 20) + { + manh = (GMP_NUMB_BITS - shift); + #if GMP_NUMB_BITS = 32 + rp[i] = manh; + #else + for (; manh 0; manh = GMP_NUMB_BITS, i++) + rp[i] = manh GMP_NUMB_MASK; + #endif + } + + exp -= (1023 + 52); + } + #else /* Non-ieee-floats */ + { + mp_size_t i; + mp_limb_t x; + + exp = 0; + /* Normalize to range 0.5 = d 1.0. Arbitrarily, start with 32-bit steps. */ + if (d 1.0) + do + { + d *= 4294967296.0; /* 2^32 */ + exp -= 32; + } + while (d 1.0); + else if (d = 4294967296.0) + do + { + d *= 1.0 / 4294967296.0; + exp += 32; + } + while (d
Re: mpz_limbs interface
Marc Glisse marc.gli...@inria.fr writes: Why not return int, since int is what we use for _mp_size? int would be sufficient for all known float formats I guess. I choose long because that was the signed type closest to mp_bit_cnt_t. Is 53 really safe for non-IEEE double? Maybe something based on DBL_MANT_DIG, assuming that FLT_RADIX==2? The existing code was limited to 53 bits (e.g., in the definition of LIMBS_PER_DOUBLE), I didn't try to improve on that part. I don't think we are still supporting gcc-2.8... That volatile hack? I just copied it from __gmp_extract_double. Probably it's no longer needed. From a performance POV, it may not be optimal to split the sign/infinity/nan/zero test from the rest, but I agree it makes the interface simpler. I'm not sure what's the right thing for mpn_set_d to do with such inputs. It seemed simplest to disallow them, and then we need no special error return, and we can produce a normalized output in all cases. It's a bit strange to disallow zero, though. One alternative might be to allow zero, and specify that rp[MPN_SET_D_SIZE-1] == 0 iff the input is zero. + ASSERT (d != 0.5*d); /* Exclude infinities */ That excludes more than infinities, it might also exclude FLT_TRUE_MIN, no? I would have expected that FLT_TRUE_MIN * 0.5 == 0.0. And then it's not excluded by that assert. But I'm not familiar with those fine floating point details. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
Vincent Lefevre vinc...@vinc17.net writes: In rounding mode toward +inf (FE_UPWARD), FLT_TRUE_MIN * 0.5 gives FLT_TRUE_MIN. I see. You may need: ASSERT (d - d == 0); That should exclude both infinities and NaN:s, right? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
Vincent Lefevre vinc...@vinc17.net writes: Yes. If you want to differentiate between NaN and infinities: d != d is true only for NaN. I'm looking at the definition of DOUBLE_NAN_INF_ACTION in gmp-impl.h. Maybe it could be simplified to a single, unconditional, definition #define DOUBLE_NAN_INF_ACTION(x, a_nan, a_inf) \ do { \ if (UNLIKELY ((x) - (x) != 0.0))\ { \ if ((x) != (x)) { a_nan; } \ else { a_inf; } \ } \ } while (0) Does any of the current three definitions have any advantage over the above portable one? I would hope that the above is * no slower then the _GMP_IEEE_FLOATS definition (which extracts the exponent via a union). * equivalent to the NOP definition for VAX; then x - x != 0.0 should be always false, and the compiler ought to know. * faster than the generic version, which implies a sequence of three (false) tests in the common case. I'm tempted to 1. Replace the DOUBLE_NAN_INF_ACTION macro. 2. Check in the mpn_set_d changes (possibly with smaller tweaks). 3. Make mpn_get_d public, and document both functions. Ok? Or should this wait until after 5.2? Ah, and about mpn_get_d. Maybe it would make sense to drop the sign argument, and and use the sign of the size argument, or limit to non-negative numbers only (it's trivial for the caller to negate the output when desired). Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
Marc Glisse marc.gli...@inria.fr writes: On Thu, 6 Feb 2014, Niels Möller wrote: I'm looking at the definition of DOUBLE_NAN_INF_ACTION in gmp-impl.h. Maybe it could be simplified to a single, unconditional, definition Note that there exist standard functions like isfinite. But so far, we don't use any libm functions in gmp. * no slower then the _GMP_IEEE_FLOATS definition (which extracts the exponent via a union). Are you sure about that? I'm not sure about anything... I'd expect that moving floating point values to integer registers (or to memory) is often slow. If x is a denormal, x-x may take a very long time to compute. I don't think the case of denormals is very important for performance. And there's a bit-by-bit normalization loop for that case later on. How long is very long? 10 cycles? 100 cycles? Ok? Or should this wait until after 5.2? I would have been in favor of avoiding new features less than a couple months before the release, but since there are already plenty of *sec* changes going on... Personally, I feel a bit easier about adding feature close to release, than about rewriting code for existing features. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
Torbjorn Granlund t...@gmplib.org writes: But then the shifting done in __gmp_extract_double will be needed in every caller. Or am I mistaken? I think you are right, for all current callers but one. The exception is mpq_set_d, which needs to remove trailing zero bits for normalization, using a different shift count. Other possible exceptions: * The various cmp_d functions should in the most likely case not need shifting of all of the mantissa bits. They could be arranged to first compare whichever top bits are the easiest to shift into the right position (in some cases, that would include only the most significant one bit of the mantissa), followed by a full comparison to all mantissa bits, followed by more and more unlikely comparisons for lower limbs being zero. (For this trick to work, I suspect mpn_set_d can't be allowed any normalization freedom; the most significant one bit must be placed in a fix position). * mpfr, which as far as I understand doesn't do limbification in the same way as mpf. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
bodr...@mail.dm.unipi.it writes: Il Gio, 6 Febbraio 2014 9:56 am, Niels ha scritto: What do you think? void mpz_set_d (mpz_ptr r, double d) { int negative; ! mp_limb_t tp[MPN_SET_D_SIZE]; ! mpz_t t = MPZ_ROINIT_N (tp, MPN_SET_D_SIZE); Uhm, in tests/mpz/t-limbs.c we test MPZ_ROINIT_N only #if __STDC_VERSION__ = 199901 I wasn't sure about whether or not that type of initializer, with one member initialized with a pointer to a local variable, is allowed in c89. But I wrote a small test program which compiled with no complaints from gcc -std=c89. My interpretation of that was that a pointer to a local variable is constant enough to use in an initializer. But of course, if that doesn't work portably, MPZ_ROINIT_N mustn't be used here. exp = mpn_set_d (mpz_limbs_write (r, MPN_SET_D_SIZE), d); Not sure if it really matters, but my intention was to try to avoid doing realloc twice. With the above, you may realloc once growing the allocation to MPN_SET_D_SIZE (which could be 2), and then realloc again in mpn_mul_2exp. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_limbs interface
bodr...@mail.dm.unipi.it writes: So, maybe we can discuss about adding a new function to the _limbs interface: mp_ptr mpz_init_limbs_write (mpz_t x, mp_size_t n) { ALLOC (x) = n; PTR (x) = (mp_ptr) (*__gmp_allocate_func) (n*GMP_LIMB_BYTES); SIZ (x) = 0; return PTR (x); } What is this intended for? Looks a bit like like mpz_init2. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mpn_sec_div tests
I tried to improve the mpn_sec_div_* tests, by adding the following to the t-div test: --- a/tests/mpn/t-div.c Thu Feb 06 08:14:38 2014 +0100 +++ b/tests/mpn/t-div.c Sat Feb 08 14:57:23 2014 +0100 @@ -308,8 +308,15 @@ main (int argc, char **argv) MPN_COPY (qp, junkp, nn - dn + 1); qp[nn - dn] = mpn_sec_div_qr (qp, rp, nn, dup, dn, scratch); ASSERT_ALWAYS (ran == scratch[itch]); - check_one (qp, rp, np, nn, dup, dn, mpn_sec_div_qr, 0); - + check_one (qp, rp, np, nn, dup, dn, mpn_sec_div_qr (unnorm), 0); +#if 1 + MPN_COPY (rp, np, nn); + if (nn = dn) + MPN_COPY (qp, junkp, nn - dn + 1); + qp[nn - dn] = mpn_sec_div_qr (qp, rp, nn, dnp, dn, scratch); + ASSERT_ALWAYS (ran == scratch[itch]); + check_one (qp, rp, np, nn, dnp, dn, mpn_sec_div_qr (norm), 0); +#endif /* Test mpn_sec_div_r */ itch = mpn_sec_div_r_itch (nn, dn); if (itch + 1 alloc) But I then get the following failure, from the mpn_sec_div_r test just after this change, *** mpn_sec_div_r failed test -300: q too small N=19e6b016cd8e4b6a ... D=0001 Q= 33cd602d9b1c96d5 98653fa4c9c6d254 cf3580b66c725b56 ... c56d5d8653fa4c9c 752544f3580b66c7 15b576194fe93271 d49513cd602d9b1c R= N-Q*D= 19e6b016cd8e4b6a cc329fd264e3692a 679ac05b36392dab ... 3a92a279ac05b363 8adabb0ca7f49938 ea4a89e6b016cd8e 2b6aec329fd264e3 nn = 3136, dn = 1, qn = 3136 I'm a bit puzzled. I can't spot any error in the test code (which is a copy of the test just above, but using dnp (normalized divisor) rather than dup. So maybe the mpn_sec_div_qr call overwrites something in this case? I'll try to debug this fairly soon. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_sec_div tests
ni...@lysator.liu.se (Niels Möller) writes: I'm a bit puzzled. I can't spot any error in the test code (which is a copy of the test just above, but using dnp (normalized divisor) rather than dup. Found it now. Comment a few lines down: pass qp[] from the previous function, mpn_sec_div_qr. */ So reordering the tests should work. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mpn_sec_powm
After some discussion with Torbjörn, I intend to change mpn_sec_powm to take the exponent size argument in bits, rather than limbs (because the current code may leak high bit of the exponent, which can cause serious problems for some applications, e.g., dsa signatures). But first, I'd like to fix a more minor issue. The window size k should always be smaller than exponent size, k = eb. I.e, if you have an exponent of eb bits, the largest window size should correspond to computing a table of all 2^{eb} powers, and then select one of them. This works fine, except for corner cases in the tuning of adding a corresponding assert gives an assertion failure from tuneup. Say the POWM_SEC_TABLE table (created by tuneup) starts with 1, then the win_size function uses a table like x[] = { 0, 1, 17, ..., ~(mp_bitcnt_t) 0} The window size k (k = 1) is chosen so that x[k-1] eb = x[k]. So we get eb = 1 == k = 1 eb = 2 == k = 2 eb = 3 == k = 2 On the other hand, if the tuned table starts with 2, we get x[] = { 0, 2, 17, ..., ~(mp_bitcnt_t) 0} eb = 1 == k = 1 eb = 2 == k = 1 eb = 3 == k = 2 So it differs only for eb = 2: are going to use a window size of either 1 or 2 bits in this case. But if the tuneup program generates a first entry of 1, or a larger value, is based on comparison of exponent size of 1 bit with the two window sizes k == 1 and k == 2. And eb = 1 and k = 2 violates the k = eb condition (and any corresponding assert), and generally makes no sense; it means we compute a table of 4 entries, and then use the exponent bit to select one of the two first entries. So there's some kind of off-by-one error here. I think tuning should start with nbits = 2 (that's tuneup's name for the exponent size, and the first size where it makes sense to try a window k 1), and time k = 1 and k = 2. And if k = 2 was faster, output nbits - 1 == 1, not nbits like the current code. It might also be interesting to note that there's only a single gmp-mparam.h file where POWM_SEC_TABLE starts with 1: x86_64/bobcat. There are a larger number of gmp-mparam.h files where it starts with 2, which would shrink to 1 if the code is fixed to emit nbits - 1. All this might not be terribly important, but there is a conditional for eb windowsize (before the loop, i.e., for the initial value of eb), not exercised by the testsuite, but needed because of this tuneup peculiarity (see https://gmplib.org/devel/lcov/shell/tmp/lcov/gmp/mpn/sec_powm.c.gcov.html). And I'd like to eliminate that test. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_roinit_n documentation
bodr...@mail.dm.unipi.it writes: Should we add to the documentation of mpz_roinit_n (mpz_t x, const mp_limb_t *xp, mp_size_t xs) that the array pointed by xp must have at least a readable limb even if xs==0? If that's not obvious from other documentation of mpz (which I doubt it is), I think it deserves documentation. Can you add that? Also applies to the MPZ_ROINIT_N macro. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_sec_powm
ni...@lysator.liu.se (Niels Möller) writes: After some discussion with Torbjörn, I intend to change mpn_sec_powm to take the exponent size argument in bits, rather than limbs (because the current code may leak high bit of the exponent, which can cause serious problems for some applications, e.g., dsa signatures). Any comments on the below patch? Regards, /Niels diff -Nrc2 gmp.133eee634d4a/doc/gmp.texi gmp/doc/gmp.texi *** gmp.133eee634d4a/doc/gmp.texi Mon Feb 10 22:12:32 2014 --- gmp/doc/gmp.texiMon Feb 10 22:12:32 2014 *** *** 5744,5761 ! @deftypefun void mpn_sec_powm (mp_limb_t *@var{rp}, const mp_limb_t *@var{bp}, mp_size_t @var{bn}, const mp_limb_t *@var{ep}, mp_size_t @var{en}, const mp_limb_t *@var{mp}, mp_size_t @var{n}, mp_limb_t *@var{tp}) ! @deftypefunx mp_size_t mpn_sec_powm_itch (mp_size_t @var{bn}, mp_size_t @var{en}, size_t @var{n}) Set @var{R} to @m{B^E \bmod @var{M}, (@var{B} raised to @var{E}) modulo @var{M}}, where @var{R} = @{@var{rp},@var{n}@}, @var{M} = @{@var{mp},@var{n}@}, ! and @var{E} = @{@var{ep},@var{en}@}. ! It is required that @math{@var{B} 0}, that @math{@var{E} 0} specifically ! with @m{@var{ep}[@var{en}-1] @neq 0, @var{ep}[@var{en}-1] != 0}, and that ! @math{@var{M} 0} is odd. No overlapping between @var{R} and the input operands is allowed. This function requires scratch space of @code{mpn_sec_powm_itch(@var{bn}, ! @var{en}, @var{n})} limbs to be passed in the @var{tp} parameter. The scratch space requirements are guaranteed to increase monotonously in the operand sizes. --- 5744,5759 ! @deftypefun void mpn_sec_powm (mp_limb_t *@var{rp}, const mp_limb_t *@var{bp}, mp_size_t @var{bn}, const mp_limb_t *@var{ep}, mp_bitcnt_t @var{ebits}, const mp_limb_t *@var{mp}, mp_size_t @var{n}, mp_limb_t *@var{tp}) ! @deftypefunx mp_size_t mpn_sec_powm_itch (mp_size_t @var{bn}, mp_bitcnt_t @var{ebits}, size_t @var{n}) Set @var{R} to @m{B^E \bmod @var{M}, (@var{B} raised to @var{E}) modulo @var{M}}, where @var{R} = @{@var{rp},@var{n}@}, @var{M} = @{@var{mp},@var{n}@}, ! and @var{E} consists of the least @var{ebits} in the area pointed to by @var{ep}. ! It is required that @math{@var{B} 0}, and that @math{@var{M} 0} is odd. No overlapping between @var{R} and the input operands is allowed. This function requires scratch space of @code{mpn_sec_powm_itch(@var{bn}, ! @var{ebits}, @var{n})} limbs to be passed in the @var{tp} parameter. The scratch space requirements are guaranteed to increase monotonously in the operand sizes. diff -Nrc2 gmp.133eee634d4a/gmp-h.in gmp/gmp-h.in *** gmp.133eee634d4a/gmp-h.in Mon Feb 10 22:12:32 2014 --- gmp/gmp-h.inMon Feb 10 22:12:32 2014 *** *** 1660,1666 #define mpn_sec_powm __MPN(sec_powm) ! __GMP_DECLSPEC void mpn_sec_powm (mp_ptr, mp_srcptr, mp_size_t, mp_srcptr, mp_size_t, mp_srcptr, mp_size_t, mp_ptr); #define mpn_sec_powm_itch __MPN(sec_powm_itch) ! __GMP_DECLSPEC mp_size_t mpn_sec_powm_itch (mp_size_t, mp_size_t, mp_size_t) __GMP_ATTRIBUTE_PURE; #define mpn_sec_tabselect __MPN(sec_tabselect) --- 1660,1666 #define mpn_sec_powm __MPN(sec_powm) ! __GMP_DECLSPEC void mpn_sec_powm (mp_ptr, mp_srcptr, mp_size_t, mp_srcptr, mp_bitcnt_t, mp_srcptr, mp_size_t, mp_ptr); #define mpn_sec_powm_itch __MPN(sec_powm_itch) ! __GMP_DECLSPEC mp_size_t mpn_sec_powm_itch (mp_size_t, mp_bitcnt_t, mp_size_t) __GMP_ATTRIBUTE_PURE; #define mpn_sec_tabselect __MPN(sec_tabselect) diff -Nrc2 gmp.133eee634d4a/mpn/generic/sec_powm.c gmp/mpn/generic/sec_powm.c *** gmp.133eee634d4a/mpn/generic/sec_powm.c Mon Feb 10 22:12:32 2014 --- gmp/mpn/generic/sec_powm.c Mon Feb 10 22:12:32 2014 *** *** 257,265 void mpn_sec_powm (mp_ptr rp, mp_srcptr bp, mp_size_t bn, ! mp_srcptr ep, mp_size_t en, mp_srcptr mp, mp_size_t n, mp_ptr tp) { mp_limb_t ip[2], *mip; - mp_bitcnt_t ebi; int windowsize, this_windowsize; mp_limb_t expbits; --- 257,264 void mpn_sec_powm (mp_ptr rp, mp_srcptr bp, mp_size_t bn, ! mp_srcptr ep, mp_bitcnt_t ebi, mp_srcptr mp, mp_size_t n, mp_ptr tp) { mp_limb_t ip[2], *mip; int windowsize, this_windowsize; mp_limb_t expbits; *** *** 268,272 int cnd; ! ASSERT (en 0 ep[en - 1] != 0); ASSERT (n = 1 ((mp[0] 1) != 0)); /* The code works for bn = 0, but the defined scratch space is 2 limbs --- 267,271 int cnd; ! ASSERT (ebi 0); ASSERT (n = 1 ((mp[0] 1) != 0)); /* The code works for bn = 0, but the defined scratch space is 2 limbs *** *** 274,279 ASSERT (bn = 1); - MPN_SIZEINBASE_2EXP(ebi, ep, en, 1); - windowsize = win_size (ebi); --- 273,276 *** *** 416,420 mp_size_t ! mpn_sec_powm_itch (mp_size_t bn, mp_size_t en, mp_size_t n) { int windowsize; --- 413,417
Re: mpn_sec_powm
Torbjorn Granlund t...@gmplib.org writes: The exponent size argument docs implies that bits in the leading limb beyond the bit count argument are to be ignored. I am not sure that is a wise promise. Given the current implementation, it's natural. But we could document that it is required that any left over bits in the top limb must be zero. Would that be better? Please use something else than ebits, since that sounds like the arguments contains bits with individual meaning. IIRC enb would follow conventions used elsewhere in the manual. Ok, then I should switch to enb both in the docs and the code. /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_sec_powm
Torbjorn Granlund t...@gmplib.org writes: Please use something else than ebits, since that sounds like the arguments contains bits with individual meaning. IIRC enb would follow conventions used elsewhere in the manual. Naming is hard ;-) The docs for mpn_sec_invert use nbcnt (I guess that's your choice). Docs for mpf_urandomb use nbits. Other mp_bitcnt_t arguments in the manual give little guidance, names like n, op, starting_bit, bit_index, prec, bit, m2exp. So enb here is as good as any. /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_sec_powm
ni...@lysator.liu.se (Niels Möller) writes: Maybe one problem is that to make a fair comparison between window sizes k and k+1, the exponent bit size should be divisible by both k and k+1. I'm testing the below patch. Output from four consecutive runs: #define POWM_SEC_TABLE 1,29,203,839,1409 #define POWM_SEC_TABLE 1,11,203,399,899 #define POWM_SEC_TABLE 1,5,203,579,1289 #define POWM_SEC_TABLE 1,17,203,839,1409 At least the middle value is consistent ;-) It means that we should increase the window size from 3 to 4 bits when enb 203. I.e., benchmarking was done at 204 = 17*12 bits. Increasing the window size here saves 17 modular multiplications (on four limbs) and table lookups (out of 68), at the cost of 8 additional multiplies in the precomputation, and each of the remaining 51 table lookups getting twice as expensive. Which sounds more or less sane to me. /Niels --- a/tune/tuneup.c Tue Feb 11 22:26:02 2014 +0100 +++ b/tune/tuneup.c Tue Feb 11 22:27:40 2014 +0100 @@ -1868,7 +1868,7 @@ tune_powm_sec (void) mp_size_t n; int k, i; mp_size_t itch; - mp_bitcnt_t nbits, nbits_next, possible_nbits_cutoff; + mp_bitcnt_t nbits, possible_nbits_cutoff; const int n_max = 3000 / GMP_NUMB_BITS; const int n_measurements = 5; mp_ptr rp, bp, ep, mp, tp; @@ -1914,7 +1914,10 @@ tune_powm_sec (void) /* For nbits == 1, we should always use k == 1, so no need to tune that. Starting with nbits == 2 also ensure that nbits always is larger than the windowsize k+1. */ - for (nbits = 2; nbits = n_max * GMP_NUMB_BITS; ) + for (nbits = 2; + nbits = n_max * GMP_NUMB_BITS k 10; + /* Increment, and round upwards to a multiple of k(k+1) */ + nbits = k*(k+1)*(1+nbits/(k*(k+1 { n = (nbits - 1) / GMP_NUMB_BITS + 1; @@ -1972,9 +1975,6 @@ tune_powm_sec (void) } else possible_nbits_cutoff = 0; - - nbits_next = nbits * 65 / 64; - nbits = nbits_next + (nbits_next == nbits); } printf (\n); TMP_FREE; -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: sec_karatsuba
bodr...@mail.dm.unipi.it writes: I tried to write a _sec_ version of Karatsuba multiplication. It is not intended for 5.2, of course, but it is ready and I like to share it. It is attached. Cool. It obviously is slower than the non_sec toom22, but not terribly: 180.00635 #0.90791.0035 190.00677 #0.97221.0510 200.00761 #0.88290.9771 210.00833 #0.92910.9859 So threshold around 20 limbs, or 1280 bits. So 4096-bit rsa (with 2048-bit secret primes) is one potential application. I used evaluation in +1 instead of -1, to avoid branches due to the sign, For the -1 point, one could use mpn_sub + mpn_cnd_neg to compute |a-b|. And then also conditonally negate and sign-extend the corresponding product. I don't know what's best, it's a couple of negations vs an extra limb in the product. By the way, if you have the time to look at the code and comment, I'll appreciate. I haven't read it very carefully, so only one minor point, ASSERT (an = bn); s = an 1; n = an - s; t = bn - n; ASSERT (mpn_sec_mul_itch (an, bn) = 4*n + 4); if (UNLIKELY (s + t = n)) { mpn_mul_basecase (rp, ap, an, bp, bn); return; } I think I'd prefer to do the check for the too unbalanced case earlier, and not rely on t being signed. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Bug found in nightbuilds
Torbjorn Granlund t...@gmplib.org writes: I suspect this is a new bug, and that the error is in the test code. It seems to happen for many seeds, around one in 25, for s390. If the bug were old it'd had happened before. Hmm. Look like it's the returning of the high limb via the separate *qp which is broken? And there's no (non-inline) assembly involved, its generic/mpn_div_qr_1.c. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Bug found in nightbuilds
Torbjorn Granlund t...@gmplib.org writes: Also, keeping DIV_QR_1_UNNORM_THRESHOLD = 1 surely does the job, but it puts another test it the critical path. Perhaps that's unavoidable, I haven't checked. I'm not entirely sure I follow you, but I guess the alternative is something like if (nn 0) if (BELOW_THRESHOLD (...)) { do { ... nn-- ...} while (nn 0); } else { invert_limb(...); return div_qr_1n_pi1 (...) cnt; } return uh; Then the BELOW_THRESHOLD test can be always false when the threshold is 0. A clever enough compiler could optimize it away also if the threshold is 1, but I have no idea if gcc does that. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel