Re: gcd_11 asm
Ciao, Il Sab, 24 Agosto 2019 12:14 am, Torbjörn Granlund ha scritto: > "Marco Bodrato" writes: > It is not elegant, I agree, but maybe joining them both in a single > .asm file, so that the jump is local? > > We might do that, but it makes things a lot more complicated as there The mpn_gcd_11_for_gcd_22 entry point proposed by Niels is much more elegant than my workaround... > Even if... maybe I'd call them _1o1o and _2o2o, as someone suggested :-) > > Do you want an entry point also for _22 which accepts even operand(s)? No, but I usually like a little bit of coherence in the naming convetions, so that I do more easily remember the exceptions in the calling conventions :-) Ĝis, m ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: gcd_11 asm
"Marco Bodrato" writes: It is not elegant, I agree, but maybe joining them both in a single .asm file, so that the jump is local? We might do that, but it makes things a lot more complicated as there will be more variants (almost a cross product of the current _22 and _11 variants). > I'd say this is a very generic problem. Our fine library will mishave > too if somebody overrode mpn_mul_basecase with something incompatible. If _mul_basecase does not behave as the documented function is supposed to, of course. But in this case the library will misbehave if somebody overrides the assembly version of mul_gcd_11 with the C code we distribute in mpn/generic/gcd_11.c ... I think the situation with our private gcd_11 is really no worse than with some other internal functions or tables. It is perhaps a bit of a mine for our own development, as if we e.g. remove the toplevel x86_64/gcd_11.asm and as a result configure ends of with the C gcd_11, things will likely break. For now, I can say that they both are great pieces of code. Even if... maybe I'd call them _1o1o and _2o2o, as someone suggested :-) Do you want an entry point also for _22 which accepts even operand(s)? -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: gcd_11 asm
t...@gmplib.org (Torbjörn Granlund) writes: > (I'm not 100% happy with our private ABI between _22 and _11, but for > now it will serve its purpose well enough, I think.) To get some protection, maybe one could define a function with the desired api, say mp_double_limb_t mpn_gcd_11_for_gcd_22(mp_limb_t u, mp_limb_t b) { mp_double_limb_t g; g.d0 = gcd_11 (u, v); g.d1 = 0; return g; } Then gcd_22 could tail call that function, and gcd_11 assembly supporting that return type could override it to just be an alias. As a bonus, one would get a more sane behavior if for some reason one tries to use asm gcd_22 but C gcd_11 or vice versa. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: gcd_11 asm
Ciao, Il Ven, 23 Agosto 2019 3:18 pm, Torbjörn Granlund ha scritto: > Is it safe to tail-call (jump to, in asm) another function that may not > have the same return type? > > A private call can have any private ABI. I agree, if it is private. > Dynamically or statically linked, yes, gcc_11 can be overriden. I always imagine the static library as a single block, but I should know (and you are obviously right) that it is not :-) > The "hidden" attribute could take care of a dynamic library. For > static, I am not aware of anything which could help. It is not elegant, I agree, but maybe joining them both in a single .asm file, so that the jump is local? > I'd say this is a very generic problem. Our fine library will mishave > too if somebody overrode mpn_mul_basecase with something incompatible. If _mul_basecase does not behave as the documented function is supposed to, of course. But in this case the library will misbehave if somebody overrides the assembly version of mul_gcd_11 with the C code we distribute in mpn/generic/gcd_11.c ... > (I'm not 100% happy with our private ABI between _22 and _11, but for > now it will serve its purpose well enough, I think.) For now, I can say that they both are great pieces of code. Even if... maybe I'd call them _1o1o and _2o2o, as someone suggested :-) Ĝis, m ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: gcd_11 asm
"Marco Bodrato" writes: One question about PIC/PLT and the gcd_22 to gcd_11 tail-call. 3 questions, actually... Is it safe to tail-call (jump to, in asm) another function that may not have the same return type? A private call can have any private ABI. I mean, is it possible, for a program dynamically-linked to the library, to force the usage of another gcd_11 function returning its result in %rax as expected but not taking care of %rdx? I mean: such a gcd_11 function would be completely valid... Dynamically or statically linked, yes, gcc_11 can be overriden. Can the tail-call be forced as internal to the library? Or maybe it is so, by default... The "hidden" attribute could take care of a dynamic library. For static, I am not aware of anything which could help. I'd say this is a very generic problem. Our fine library will mishave too if somebody overrode mpn_mul_basecase with something incompatible. :-) (I'm not 100% happy with our private ABI between _22 and _11, but for now it will serve its purpose well enough, I think.) -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: gcd_11 asm
Ciao, Il Ven, 23 Agosto 2019 7:37 am, Niels Möller ha scritto: > I've had a look at the latest gcd_11 asm, and it's really neat, > including naturally getting %rdx zero on return. One question about PIC/PLT and the gcd_22 to gcd_11 tail-call. Is it safe to tail-call (jump to, in asm) another function that may not have the same return type? I mean, is it possible, for a program dynamically-linked to the library, to force the usage of another gcd_11 function returning its result in %rax as expected but not taking care of %rdx? I mean: such a gcd_11 function would be completely valid... Can the tail-call be forced as internal to the library? Or maybe it is so, by default... Ĝis, m ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: gcd_11 asm
t...@gmplib.org (Torbjörn Granlund) writes: > The really weird thing is that tzcnt, encoded identically to rep;bsf is > much faster than a bare bsf on AMD platforms. Weird indeed! Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: gcd_11 asm
ni...@lysator.liu.se (Niels Möller) writes: I've had a look at the latest gcd_11 asm, and it's really neat, including naturally getting %rdx zero on return. One question: the bd2 and bd4 versions use L(top): rep;bsf %rdx, %rcx C tzcnt! I've not seen this before, but a quick web search indicates that tzcnt is the same as bsf, except that it has a well defined result also when the input is zero. But in these loops, we should get to this instruction only for non-zero %rdx. So are there any other subtleties? That, and that the flags are set quite differently. But the code cares neiter about the flags nor aboubt what might or might not happen for zero input. The really weird thing is that tzcnt, encoded identically to rep;bsf is much faster than a bare bsf on AMD platforms. One would think (1) an instruction B which works like instruction A except that B has some additional corner case requirements would not be faster than A, and (2) why didn't they just make instruction A faster and also defined it for 0?, and (3) why didn't they make bsf faster too, it should be trivial. You might say "but thanks to rep;bsf we KNOW that it is well-defined for 0". Good point. Except that rep;bsf will run just like bsf on any older x86 CPU out there, i.e. be undefined for 0. Therefore safe use of rep;bsf requires knowldge of whether this instruction is specifically handled as tzcnt (which asks for e.g. running the cpuid instruction). Conclusion: silently making bsf well-defined is really the same as silently making rep;bsf well-defined (except that the latter has a byte longer encoding). The performance difference is huge. E.g., AMD Ryzen can execute 6 times more rep;bsf than plain bsf per cycle, each with 2/3 of the latency. See also: https://gmplib.org/~tege/x86-timing.pdf -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
gcd_11 asm
Hi, I've had a look at the latest gcd_11 asm, and it's really neat, including naturally getting %rdx zero on return. One question: the bd2 and bd4 versions use L(top): rep;bsf %rdx, %rcx C tzcnt! I've not seen this before, but a quick web search indicates that tzcnt is the same as bsf, except that it has a well defined result also when the input is zero. But in these loops, we should get to this instruction only for non-zero %rdx. So are there any other subtleties? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel