Re: gcd_11 asm

2019-08-24 Thread Marco Bodrato
Ciao,

Il Sab, 24 Agosto 2019 12:14 am, Torbjörn Granlund ha scritto:
> "Marco Bodrato"  writes:

>   It is not elegant, I agree, but maybe joining them both in a single
>   .asm file, so that the jump is local?
>
> We might do that, but it makes things a lot more complicated as there

The mpn_gcd_11_for_gcd_22 entry point proposed by Niels is much more
elegant than my workaround...

>   Even if... maybe I'd call them _1o1o and _2o2o, as someone suggested :-)
>
> Do you want an entry point also for _22 which accepts even operand(s)?

No, but I usually like a little bit of coherence in the naming convetions,
so that I do more easily remember the exceptions in the calling
conventions :-)

Ĝis,
m

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: gcd_11 asm

2019-08-23 Thread Torbjörn Granlund
"Marco Bodrato"  writes:

  It is not elegant, I agree, but maybe joining them both in a single .asm
  file, so that the jump is local?

We might do that, but it makes things a lot more complicated as there
will be more variants (almost a cross product of the current _22 and _11
variants).

  > I'd say this is a very generic problem.  Our fine library will mishave
  > too if somebody overrode mpn_mul_basecase with something incompatible.

  If _mul_basecase does not behave as the documented function is supposed
  to, of course. But in this case the library will misbehave if somebody
  overrides the assembly version of mul_gcd_11 with the C code we distribute
  in mpn/generic/gcd_11.c ...

I think the situation with our private gcd_11 is really no worse than
with some other internal functions or tables.  It is perhaps a bit of a
mine for our own development, as if we e.g. remove the toplevel
x86_64/gcd_11.asm and as a result configure ends of with the C gcd_11,
things will likely break.

  For now, I can say that they both are great pieces of code.
  Even if... maybe I'd call them _1o1o and _2o2o, as someone suggested :-)

Do you want an entry point also for _22 which accepts even operand(s)?

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: gcd_11 asm

2019-08-23 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

> (I'm not 100% happy with our private ABI between _22 and _11, but for
> now it will serve its purpose well enough, I think.)

To get some protection, maybe one could define a function with the
desired api, say

mp_double_limb_t
mpn_gcd_11_for_gcd_22(mp_limb_t u, mp_limb_t b)
{
  mp_double_limb_t g;
  g.d0 = gcd_11 (u, v);
  g.d1 = 0;
  return g;
}

Then gcd_22 could tail call that function, and gcd_11 assembly
supporting that return type could override it to just be an alias.

As a bonus, one would get a more sane behavior if for some reason one
tries to use asm gcd_22 but C gcd_11 or vice versa.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: gcd_11 asm

2019-08-23 Thread Marco Bodrato
Ciao,

Il Ven, 23 Agosto 2019 3:18 pm, Torbjörn Granlund ha scritto:
>   Is it safe to tail-call (jump to, in asm) another function that may not
>   have the same return type?
>
> A private call can have any private ABI.

I agree, if it is private.

> Dynamically or statically linked, yes, gcc_11 can be overriden.

I always imagine the static library as a single block, but I should know
(and you are obviously right) that it is not :-)

> The "hidden" attribute could take care of a dynamic library.  For
> static, I am not aware of anything which could help.

It is not elegant, I agree, but maybe joining them both in a single .asm
file, so that the jump is local?

> I'd say this is a very generic problem.  Our fine library will mishave
> too if somebody overrode mpn_mul_basecase with something incompatible.

If _mul_basecase does not behave as the documented function is supposed
to, of course. But in this case the library will misbehave if somebody
overrides the assembly version of mul_gcd_11 with the C code we distribute
in mpn/generic/gcd_11.c ...

> (I'm not 100% happy with our private ABI between _22 and _11, but for
> now it will serve its purpose well enough, I think.)

For now, I can say that they both are great pieces of code.
Even if... maybe I'd call them _1o1o and _2o2o, as someone suggested :-)

Ĝis,
m

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: gcd_11 asm

2019-08-23 Thread Torbjörn Granlund
"Marco Bodrato"  writes:

  One question about PIC/PLT and the gcd_22 to gcd_11 tail-call.

3 questions, actually...

  Is it safe to tail-call (jump to, in asm) another function that may not
  have the same return type?

A private call can have any private ABI.

  I mean, is it possible, for a program dynamically-linked to the library,
  to force the usage of another gcd_11 function returning its result in %rax
  as expected but not taking care of %rdx?
  I mean: such a gcd_11 function would be completely valid...

Dynamically or statically linked, yes, gcc_11 can be overriden.

  Can the tail-call be forced as internal to the library? Or maybe it is so,
  by default...

The "hidden" attribute could take care of a dynamic library.  For
static, I am not aware of anything which could help.

I'd say this is a very generic problem.  Our fine library will mishave
too if somebody overrode mpn_mul_basecase with something incompatible.
:-)

(I'm not 100% happy with our private ABI between _22 and _11, but for
now it will serve its purpose well enough, I think.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: gcd_11 asm

2019-08-23 Thread Marco Bodrato
Ciao,

Il Ven, 23 Agosto 2019 7:37 am, Niels Möller ha scritto:
> I've had a look at the latest gcd_11 asm, and it's really neat,
> including naturally getting %rdx zero on return.

One question about PIC/PLT and the gcd_22 to gcd_11 tail-call.

Is it safe to tail-call (jump to, in asm) another function that may not
have the same return type?

I mean, is it possible, for a program dynamically-linked to the library,
to force the usage of another gcd_11 function returning its result in %rax
as expected but not taking care of %rdx?
I mean: such a gcd_11 function would be completely valid...

Can the tail-call be forced as internal to the library? Or maybe it is so,
by default...

Ĝis,
m

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: gcd_11 asm

2019-08-23 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

> The really weird thing is that tzcnt, encoded identically to rep;bsf is
> much faster than a bare bsf on AMD platforms.

Weird indeed!

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: gcd_11 asm

2019-08-23 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  I've had a look at the latest gcd_11 asm, and it's really neat,
  including naturally getting %rdx zero on return.

  One question: the bd2 and bd4 versions use

  L(top): rep;bsf %rdx, %rcx  C tzcnt!

  I've not seen this before, but a quick web search indicates that tzcnt
  is the same as bsf, except that it has a well defined result also when
  the input is zero. But in these loops, we should get to this instruction
  only for non-zero %rdx. So are there any other subtleties?

That, and that the flags are set quite differently.

But the code cares neiter about the flags nor aboubt what might or might
not happen for zero input.

The really weird thing is that tzcnt, encoded identically to rep;bsf is
much faster than a bare bsf on AMD platforms.  One would think (1) an
instruction B which works like instruction A except that B has some
additional corner case requirements would not be faster than A, and (2)
why didn't they just make instruction A faster and also defined it for
0?, and (3) why didn't they make bsf faster too, it should be trivial.

You might say "but thanks to rep;bsf we KNOW that it is well-defined for
0".  Good point.  Except that rep;bsf will run just like bsf on any
older x86 CPU out there, i.e. be undefined for 0.  Therefore safe use of
rep;bsf requires knowldge of whether this instruction is specifically
handled as tzcnt (which asks for e.g. running the cpuid instruction).
Conclusion: silently making bsf well-defined is really the same as
silently making rep;bsf well-defined (except that the latter has a byte
longer encoding).

The performance difference is huge.  E.g., AMD Ryzen can execute 6 times
more rep;bsf than plain bsf per cycle, each with 2/3 of the latency.

See also: https://gmplib.org/~tege/x86-timing.pdf

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


gcd_11 asm

2019-08-22 Thread Niels Möller
Hi,

I've had a look at the latest gcd_11 asm, and it's really neat,
including naturally getting %rdx zero on return.

One question: the bd2 and bd4 versions use

L(top): rep;bsf %rdx, %rcx  C tzcnt!

I've not seen this before, but a quick web search indicates that tzcnt
is the same as bsf, except that it has a well defined result also when
the input is zero. But in these loops, we should get to this instruction
only for non-zero %rdx. So are there any other subtleties?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel