Re: How to calculate cycles/limb in assembly routines

2024-04-04 Thread Torbjörn Granlund
Albin Ahlbäck  writes:

  I am looking at Torbjörn's `aorsmul_1.asm' for Apple M1, and I am
  having trouble understanding how the cycles per limb number was
  calculated.

It was not calculated.  It was measured.

But that asm code was written with understanding of the M1 pipeline.
Here, 1.25 c/l was actually what was expected.

  As I understand it, the cycles per limb number represents the loop(s)
  in any routine. Looking at the main loop, it seems like it should
  scale at 10 cycles per loop (of which 2 cycles are lost due to latency
  from loading x4, I believe), for which it treats four limbs from `up'
  at a time. However, the given number is 1.25 which is half the size of
  my calculated 10 / 4.

I cannot follow your reasoning here, but as it seems incorrect from
measurements, I am not too worried.  :-)

The performance limit to 1.25 c/l (5 cycles per loop) comes from these
lines:

ADDSUB  x8, x8, CY  C W0carry-in
ADDSUBC x4, x4, x9  C W1
ADDSUBC x5, x5, x10 C W2
ADDSUBC x6, x6, x11 C W2
csinc   CY, x7, x7, CONDC W3carry-out

These 5 instructions form a circular dependency, and each instruction
has a one cycle latency, so 5 cycles in total.

There are other parts of the loop with also sets performance limits,
e.g., the mul/umulh has a throughput of 2 per cycle, so even with the
above circular dependency removed, the loop would need 4 cycles.

I hope this reply helps.  Modern CPUs have complex pipelines, with lots
of buffering to cate for dependencies and very wide exeecution in the
absence of dependencies.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Has there been historical work done to investigate small integer optimization?

2024-02-12 Thread Torbjörn Granlund
marco.bodr...@tutanota.com writes:

  But implementing it with the current mpz type is "impossible".
  I mean, one should break the current interface.
  Currently, _mp_d is always a pointer AND it always point to a readable limb. 
Even if _mp_alloc is zero.

If we set alloc = 0 and size >= 2^30, then that means the the pointer
field is actually a numeric value, and perhaps the low 30 bits of the
size field is more bits for the numeric value.  :-)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: What's a reasonable size ratio for toom32?

2023-11-19 Thread Torbjörn Granlund
Niels Möller  writes:

  I'd like to get the changes back in, piece by piece...

Sounds good!

  The below patch is the change to add /r syntax, and enable it only for
  mpn_mul and mpn_mul_basecase. Seems to work for me, tested by running
  ./tuneup and

./speed -r -s 10-500 -f 1.2  -C mpn_mul mpn_mul/0.7 mpn_mul/0.4

  Does that look reasonable?

It does!

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: What's a reasonable size ratio for toom32?

2023-10-19 Thread Torbjörn Granlund
Niels Möller  writes:

  Looks like tuning MUL_TOOM42_TO_TOOM63_THRESHOLD crashes. Even though I
  can measure these independently using speed.

  I can't debug this further at the moment, so I'm reverting these changes
  for now.

I can confirm that the tuneup program now works.  (It took a few days
extra as a FreeBSD upgrade had broken the autotool.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: What's a reasonable size ratio for toom32?

2023-10-16 Thread Torbjörn Granlund
Niels Möller  writes:

  I can't debug this further at the moment, so I'm reverting these changes
  for now.

Thank you!

The abort() happens in tuneup.c's one() function.  (I didn't analyse it
beyond running gdb to see which abort() was called.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: What's a reasonable size ratio for toom32?

2023-10-15 Thread Torbjörn Granlund
Niels Möller  writes:

  Pushed now. I've done some benchmarks on shell, on tip-of-tree GMP (no
  local changes). See numbers at the end of this message, comparing
  mpn_mul_basecase, mpn_toom22_mul and mpn_toom32_mul. 

Every single tuneup invocation made by the nightly builds in the last
few weeks have failed.  The last time it worked was around 2023-09-27.

The changes made around that time ought to be the culprit.

Failures with tuneup cause the autobuild system to rerun things 5 times;
that's my electricity bill...

Please test any changes before checking them in.  And please also follow up
test results at e.g., https://gmplib.org/devel/tm/gmp/date!

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Fix typo in z13 bdiv_dbm1c.

2023-10-10 Thread Torbjörn Granlund
Stefan Liebler  writes:

  In my case, before calling mpn_divexact_by3c v6 was {0x1, 0x0} and thus
  mpn_bdiv_dbm1c returns 0x1, which is ANDed with 3 and then also returned
  by mpn_divexact_by3c. Therefore the test fails by chance.

OK.  We should clearly have tests/*call.asm and tests/*check.c for
s390_64 and s390_32.  (But the check file might want to re-randomise the
magic register values after checking them.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Fix typo in z13 bdiv_dbm1c.

2023-09-30 Thread Torbjörn Granlund
Stefan Liebler  writes:

  The returned limb was retrieved from v6 instead of v2.
  This is also observable in failing testcase t-fat.

While I agree the code looks suspicious, I fail to trigger any test
suite failure for any Z/arch configuration.

How exactly did you trigger it?

(I try to avoid fixing any GMP bugs without making sure we trigger it
with our testsuite and test system setup.)


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Revert "Move popcount and hamdist back from z14 to z13 after needed edits."

2023-08-03 Thread Torbjörn Granlund
I committed your patch (while also impersonating you).

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Revert "Move popcount and hamdist back from z14 to z13 after needed edits."

2023-08-03 Thread Torbjörn Granlund
Stefan Liebler  writes:

  Thanks for that. In the meantime, gmp 6.3.0 + this patch should be used.
  Do you have plans for 6.3.1?

We tend to make point releases.

  For me "qemu-system-s390x --cpu help" at least lists some z13 models,
  but to be honest, I don't use it. I have to check.

Yes, that list I get. But passing almost any of the suggested options
causes an error message.  (I think -cpu z990 was the only one with which
I had success.  But no, I haven't tried every listed option.)

I stick to "user emulation" with qemu these days, as I cannot afford the
electricity needed for full system emulation.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Revert "Move popcount and hamdist back from z14 to z13 after needed edits."

2023-08-03 Thread Torbjörn Granlund
Stefan Liebler  writes:

  Unfortunately not only the extended mnemonics are not available with z13,
  but also vpopct M3=1-3 is reserved. Thus you'll get an illegal-instruction
  if run on z13 as vector enhancement facility 1 (introduced with z14) is
  not available.

Ah, darn.  This will need to be fixed in a 6.3.1 when the dust settles.

I only have a z15 to run tests on.  I don't have a reliable way of
telling what instructions run on a particular implementation.  An
instruction table with checkmarks per implementation would avoid this
sort of problems.  IBM should be able provide that.

A simple test I have used is to see what instructions the assembler
accepts under various command-line options.  That depends on the GNU
binutils developers to have made no mistake.

I also tried to make qemu help by passing -cpu XXX, as that has proven
itself historically for other evolving ISAs.  The GMP test setup then
can autotest things for several architectures, quickly triggering any
incorrect instruction availability assumptions.  Unfortunately for
qemu-s390x, all I get from passing say "-cpu z13" is an arm long
non-parsable error message.  I have had no luck with any -cpu argument
whatsoever.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Why different runtimes for constant exponent and (huge) modulus for mpz_powm()?

2023-07-26 Thread Torbjörn Granlund
This thread does not seem relevant to gmp-devel readers.  Please more it
elsewhere, perhaps gmp-discuss.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: AMD Ryzen 5 7600X gmpbench submission (result 8010.2) -- new: 9176.2

2023-07-08 Thread Torbjörn Granlund
  The result is 14.5% increase in GMPbench value, new value is 9176.2 !

That's much better, but still not as good as it should be with some
optimisation.  Specifically, there seems to be something slightly off
with the multiply performance.

Unfortunately, I don't have Zen4 hardware, nor do I the time to optimise
GMP for Zen4 anytime soon.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: AMD Ryzen 5 7600X gmpbench submission (result 8010.2)

2023-07-08 Thread Torbjörn Granlund
herm...@stamm-wilbrandt.de writes:

  I recently bought a new 7600X PC that was not-too-expensive (619$),
  but is rank 18 on PassMark's Singe Thread performance list of >3100
  CPUs:

[snip]

  GMPbench: 8010.2

Interesting!  I would hope Zen 4 could perform much better than that,
though.

Could you please repeat the test with the most latest GMP snapshot?

https://gmplib.org/download/snapshot/gmp-next/

I am not sure GMP 6.2 recognises Zen4 well enough.  A very snapshot
should do better, hopefully resulting in greater performance.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH v3 0/4] Add addmul_1, addmul_2, and mul_basecase for IBM z13 and later

2023-06-23 Thread Torbjörn Granlund
These improvements are now (finally!) in GMP repo.

I have not run any timing tests, as I trust you to worry about the
performance.

A mistake we GMP develiopers have made in the past is couting cycles for
inner loops for quite large trip counts, and then accidentally adding
overhead as a side effect of beating down the cycle count.  One should
never forget that most bignum computations probably use moderately large
numbers, which means decreasing overhead.

Running commands like

  tune/speed -p1000 -C -s1-100 mpn_mul_basecase
  tune/speed -p1000 -C -s1-100 mpn_addmul_1.0xcafecafecafecafe

are helpful.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Requests from Microsoft IP Addresses

2023-06-19 Thread Torbjörn Granlund
Pedro Gimeno  writes:

  Stupid question maybe but, couldn't the online getbundle command be disabled?

We don't want to disable functionality for legitimate use.

If I ever get the time, I will assign more CPU cores to handling clone
commends.  It is not just bumping a number somewhere, it requires
running mercurial from nginx in a quite different manner.

Still, even with a dozen CPU cores, Github-type behaviour would be able
to bog down our web server.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Requests from Microsoft IP Addresses

2023-06-19 Thread Torbjörn Granlund
I think we now understand what happened.

It has everything to do with how Github works.

It started when the FFmpeg project changed their GI script to clone GMP
for every of their checkin.  This is a bad idea, but not a terrible
idea.  (The right way would be to keep a local checkout and pull changes
to it.)

Now, FFmpeg is "forked" in Github, and Github works in a very peculiar
way with such forks; it pulls in changes from the project from which a
fork was created, and runs GI tests on the result.

There seems to be several hundreds of forks of FFmpeg.  So each FFmpeg
checkin triggered hundreds of clone requests to the GMP server.

The conclusion is that, Github performed a DDoS attack on us for each
FFmpeg commit.  Poor design.  Horrible design.

Now FFmpeg changed their scripts again to instead download the tar file
from gnu.org.  That's less stupid, but gnu.org will get hundreds of
pointless downloads for each FFmpeg checkin.  (A plain download costs a
small fraction of a clone, so this is an improvement.  But it could
still be considered to be a DDoS attack; I leave it to the folks at FSF
to deal this this.)

The result of this all is that we lost many hours, and that Microsoft
(who owns Github and provides it with servers) now cannot reach
gmplib.org.  We won't be removing the Microsoft blocks as we don't
expect Github to change the way they are operating.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Requests from Microsoft IP Addresses

2023-06-18 Thread Torbjörn Granlund
I added "Usage conditions" to .

  "These resources are open to the public. Yet, we expect the usage of
   these resources to be used responsibly. Repeated clone command should be
   avoided. Scripting of clone commands is strongly discouraged.

   As a guideline, more than one clone command per day is considered
   excessive. More than 10 is considered abuse, and will result in having
   your IP address or entire network blocked."


One would think that anybody who is beyond the hello_world.c skillz
level would understand how to clone and then (infrequently) pull in any
changes.  And would also understand that 1400 clones in a few hours is
outrageous.

Alas, apparently not.

Case closed.  The firewall works awesomely.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Requests from Microsoft IP Addresses

2023-06-18 Thread Torbjörn Granlund
Niels Möller  writes:

  Would have been helpful with a specific reference to the user and script
  in question. I would guess this revert is intended to stop the hg clones
  (instead getting release tarballs from ftp.gnu.org):

  
https://github.com/BtbN/FFmpeg-Builds/commit/d75466340a9a5985e546fff9756b976727c4ab98

Interesting.

So the GMP server is "way too flakey" as it cannot accept bursts of
several clone request per second...!

We won't get an apology, I think.  :-)

I will keep the firewall rules blocking Microsoft as I really don't have
time for some more sophisticated approach here.  (I suppose a better
approach would be to slow down abusively repeated accesses, similar to
how TCP works when it encounters package loss.  A problem here was that
a large number of IP addresses were involved in the attack, it was only
after "whois" revealed they all came from Microsoft that we understood
that they likely had the same root cause.  DDoS is hard to deal with.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Requests from Microsoft IP Addresses

2023-06-18 Thread Torbjörn Granlund
Marc Glisse  writes:

  One thing that should be doable is set up a mirror of GMP's repository
  on github, and advertise that one for CI purposes. Any user could do
  that (there are already a few), but if it was advertised on the GMP
  website, it would be more likely to be used by more people. Of course
  it would be good if people didn't set up their CI in such a wasteful
  way, but the trend doesn't make me very optimistic.

We don't now if Mike Blacker's message is completely accurate.  I assume
it was worded by a lawyer: the abusive traffic is not the fault of any
large corporation, the fault lies with "a user" and...the victim of the
abusive traffic.

It is abundantly clear that neither Github nor Microsoft think they need
to do anything about this, The traffic continues as I write this; in
fact we had to add a few more Microsoft IP ranges to our firewall this
morning.

We certainly could make our infrastructure more resilient to this type
of abusive traffic.  The repo requests currently is handled by a single
CPU core, and that has worked fine for a very long time, but with a
clone request every few seconds hour after hour, that CPU get bogged
down by running compressions.

So how many CPU cores should we assign to this?  Clearly, if the abusive
traffic just scaled up a little, no number of CPU cores would suffice.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Requests from Microsoft IP Addresses

2023-06-18 Thread Torbjörn Granlund
Niels Möller  writes:

  Do we have more capacity to serve the nightly tarballs to lots of
  clients (at least, they shouldn't require any CPU compression work per
  request)?

We do indeed have much more capacity for that.  But the 1 GbE connection
would quickly be saturated if a few more people adopted the same
download strategy is we now see from Microsoft's network.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Requests from Microsoft IP Addresses

2023-06-17 Thread Torbjörn Granlund
Mike Blacker  writes:

  Microsoft and GitHub have investigated the issue and determined that a
  Github user updated a script within the FFMPeg-Builds project that pulled
  content from https://gmplib.org. This build was configured to run parallel
  simultaneous tests on 100 different types of computers/architectures. This
  activity does not appear to be nefarious. GMPLIB appears to have limited
  infrastructure that could not sustain the limited, yet simultaneous
  requests.

Note that this abusive traffic is still ongoing, but it is subsiding as
I keep adding more and more Microsoft subnets to the firewall rules.  I
have much better things to do than defend a public service web server
against corporate abuse!

What would you advise me to do, should I contact a US lawyer and have
them send a cease and desist letter?

I've copied a short excerpt from the current logs (these subnets are now
going into the firewall rules too, of course).  Last afternoon (UTC) it
was about 10 times worse than shown here.

These logs are only the ones which repeatedly downloads large parts of
the repository.  There are many other indefinitely repeated requests,
but these below are the ones which really cause significant CPU load and
network traffic.

20.57.73.47 - - [17/Jun/2023:15:28:49 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5788851 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.172.10.64 - - [17/Jun/2023:15:28:58 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5788842 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.185.158.94 - - [17/Jun/2023:15:29:45 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5789473 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.57.73.47 - - [17/Jun/2023:15:30:06 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 14612259 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.185.158.94 - - [17/Jun/2023:15:30:15 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5788753 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.172.10.64 - - [17/Jun/2023:15:30:35 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5788836 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
172.176.188.82 - - [17/Jun/2023:15:31:31 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 579 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.49.37.18 - - [17/Jun/2023:15:31:36 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5788866 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.172.10.64 - - [17/Jun/2023:15:31:45 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5788864 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.185.158.94 - - [17/Jun/2023:15:31:50 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5788767 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
172.177.96.47 - - [17/Jun/2023:15:32:09 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5790455 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.49.37.18 - - [17/Jun/2023:15:32:31 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5788838 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.172.10.64 - - [17/Jun/2023:15:33:22 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5788838 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
20.49.37.18 - - [17/Jun/2023:15:33:44 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 5788851 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"
65.52.35.13 - - [17/Jun/2023:15:33:50 +0200] "GET /repo/gmp/?cmd=getbundle 
HTTP/1.1" 200 14613459 "-" "mercurial/proto-1.0 (Mercurial 6.3.2)"


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Requests from Microsoft IP Addresses

2023-06-17 Thread Torbjörn Granlund
Mike Blacker  writes:

  Microsoft and GitHub have investigated the issue and determined that a
  Github user updated a script within the FFMPeg-Builds project that pulled
  content from https://gmplib.org. This build was configured to run parallel
  simultaneous tests on 100 different types of computers/architectures. This
  activity does not appear to be nefarious. GMPLIB appears to have limited
  infrastructure that could not sustain the limited, yet simultaneous
  requests.

While I appreciate to get an explanation, I find your reply really
curious.

Our machine is pretty powerful, it is a server class machine with many
cores and lots of RAM, and its connection is 1 GbE at a top class data
centre.

What we experienced was tens of thousands requests from 20ish different
Microsoft subnets, many of which where apparently repo clone commands
which required our server to compress the contents.  In total about 8
GiB of compressed data where requested, surely many times more for the
server to compress.  All in just a few hours before I firewalled the
attack IP addresses off.

This is NOT legitimate use of any server on the Internet.  Your reply
seems to suggest that it is our fault, that we ought to have more
powerful servers to accommodate this behaviour.  Really?

I beg to disagree.  This traffic was, if not nefarious, very far from
acceptable.  We will keep the GMP server infrastructure, and we will
defend it from irresponsible usage like this in order to keep it
available for responsible usage.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


GMP servers are under DoS attack from Microsoft

2023-06-16 Thread Torbjörn Granlund
The GMP servers are under attack by several hundred IP addresses owned
by Microsoft cooperation. We do not know if this is made with malice by
Microsoft, if it is some sort of mistake, or if some of their cloud
customer is running the attack. The attack targets the GMP repo, with
thousands of identical requests.  The requests are cleverly chosen as to
cause heavy system load.

We're firewalling off all of Microsoft's IP addresses as an emergency
response.

The GMP repositories are intermittently slow to access, but we expect to
resume normal operation once we have identified all Microsoft's IP
ranges.

If some Microsoft employee with enough clout reads this, please notice
that we *completely* block Microsoft now as that is the fastest remedy.
That means that email will likely not reach us.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Status of latest release for macos and arm64?

2023-05-10 Thread Torbjörn Granlund
Niels Möller  writes:

  Hi, do I get it right that latest stable release (6.2.1) is not quite
  working on Macos arm64? With the "register x18" known issue listed at
  https://gmplib.org/#STATUS, fixed in
  https://gmplib.org/repo/gmp-6.2/rev/f4ff6ff711ed (and similar fix in the
  main repo).

  Would it make sense to release a 6.2.2 just to get this fix out?

I think we should perhaps releae a 6.3 instead.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-09-04 Thread Torbjörn Granlund
Marco Bodrato  writes:

  We should start writing mpn_sec_binvert :-)

I think mpn_binvert is almost sec_ naturally.

The exception is when sbpi1_bdiv_q.or dbpi1_bdiv_q c are invoked.  The
former has some < on data (for carry computations) and the latter has a
mpn_incr_u which is very leaky.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-09-01 Thread Torbjörn Granlund
/* FIXME: Using mpz_invert is cheating. Instead, first compute m' =
   m^-1 (mod 2^k) via Newton/Hensel. We can then get the inverse via

   2^{-k} (mod m) = (2^k - m') * m + 1)/2^k. */
mpz_invert (t, t, m);
mpn_copyi (info->ip, mpz_limbs_read (t), mpz_size (t));

You might want to use mpn_binvert here.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-08-31 Thread Torbjörn Granlund
  Much more unclear to me how close it might be to the typical or average
  number of iterations needed.

That's perhaps not very interesting, as early exit is not an option
here.  (Unless this algorithm would beat plain, "leaky" inverse.)

  Currently uses exponentiation for the field inverse, and nettle's
  version of sec_invert for inverses mod the group order, as needed by
  ecdsa. Except for the somewhat obscure gost curves (Russian standards),
  where sec_invert is used for all inverses, and A/B for gost-gc512a is
  almost 30%.

Why do you use sec_invert when inverting mod the group order when that
is of prime order?  (Yes, this question will become moot I suppose with
this new algorithm.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-08-31 Thread Torbjörn Granlund
  > count = (49 * bits + 57) / 17;
  >
  > Odd.

  For sure. This isn't based on local progress of the algorithm (there
  ain't no guaranteed progress for a short sequence of reduction steps),
  but based on rather complex analysis of the number of steps needed for
  the complete 2-adic quotient sequence.

That count is a proven "upper bound" of the numbver of iterations.  Does
the paper discuss how tight it is, i.e., is it close to a "lower bound"
of the required number of iterations.

  > I think your measurements are a bit optimistic, though.  My measruments
  > from slightly modified timing code suggest 4.5x slowdown, and more for
  > really small operands.

  Maybe, I tried to keep the timing really simple and portable.

I simply up'ed the number of iterations (actually made them depend in
the operand size).

  I've now implemented inverse too. See updated code below. When I run it,
  I get

  invert size 1, ref 0.293 (us), djb 1.145 (us)
  invert size 2, ref 0.721 (us), djb 1.659 (us)
  invert size 3, ref 0.994 (us), djb 2.375 (us)
  [...]
  invert size 17, ref 5.157 (us), djb 18.538 (us)
  invert size 18, ref 5.207 (us), djb 19.064 (us)
  invert size 19, ref 5.702 (us), djb 21.286 (us)

  so a slowdown of 3--4 compared to mpz_invert. This timing excludes the
  precomputation of a few needed per-modulo constants.

Very very nice!

People not famililiar with what we're trying to do here might be
unimpressed with these results.  "Look, I've written new code which is 4
times slower than what we have, HURRAY!"

Comparing to the existing side-channel-silent inversion code would 
impress more, I think.

  I think the inverse flavor is still rather neat, main loop is

for (delta = 1;count > STEPS;count -= STEPS)
  {
delta = steps (, STEPS, delta, fp[0], gp[0]);
matrix_vector_mul (, STEPS, n+1, fp, gp, tp);
matrix_vector_mul_mod (, Bp, n+2, up, vp, tp);
  }

I can make it even neater:

for (delta = 1;count > STEPS;count -= STEPS)
  {
do_it_all (fp, gp, up, vp, n);
  }

:-)

  I'm thinking I'll try to implement and benchmark this for Nettle's ecc
  functions first, before attempting to update GMP function. The reason is
  that (i) I really want to do needed precomputations for all moduli of
  interest for Nettle at compile time, which would complicate the GMP
  interface if it is supported at all, and (ii) in some cases I want
  inversion to operate on numbers in montgomery representation, and then
  I'd like to fold any needed additional power of two into the precomputed
  constant.

What percentage of Ec scalar mul is used for the final inversion?  Not
much, I suppose.  Does Nettle use exponentiation there today, or
mpn_sec_invert?


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-08-24 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

count = (49 * bits + 57) / 17;

Odd.

(For production code, one will need to cap the intermediates there,
at least for 32-bit machines.  Perhaps something like this:

count = (51 * bits - 2 * bits + 57) / 17 =
  = 3 * bits  - (2 * bits - 57) / 17

  About a factor 3 slower than plain mpz_gcd on my machine. It will be
  interesting to see if slowdown for modular inversion is about the same.

Cool!

I think your measurements are a bit optimistic, though.  My measruments
from slightly modified timing code suggest 4.5x slowdown, and more for
really small operands.

This will still completely outclass the current sec code.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


GMP testing

2022-06-19 Thread Torbjörn Granlund
The automated GMP testing has had a momentary hiatus due to a disruptive
thunderstorm, further extended by a bad buggy BIOS image.

Furthermore, we decided to make testing sparser in order to save
electricity.  Now, each configuration is tested within 10 days, up from
7 days.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-06-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Extract least significant 96 bits of each number.

Is that 3 32-bit limbs or 1.5 64-bit limbs?

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-06-03 Thread Torbjörn Granlund
Any algorithm with these properties would be a huge improvement compared
to what we have today:

0. No side-channel leakage of anything but bit counts of input operands.
   (I suppose there are usages where the mod argument is not sensitive,
   such as for the elliptic curve usages).  This is already true for the
   present, slow code.

1. Computation of a 2 x 2 matrix in O(1) time.  Not necessarily fast.

2. Guaranteed reduction of intermediate operand bit count by a full limb
   each.

3. If the reduction in (2) happens to be greater than one limb, the
   subsequent iteration of (1) must still work.  I.e., it must allow the
   bits it depends on to be zero and should not need to locate, say, the
   least significant 1 bit of the full operands.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mul_fft, cleaning up some details of the code

2022-03-06 Thread Torbjörn Granlund
Marco Bodrato  writes:

  This is not really needed to solve a bug.
  The comment before the mpn_mul_fft_decompose function says "We must
  have nl <= 2*K*l", this means that we should never have ((dif = nl -
  Kl) > Kl), and the code in that branch should never be used.

I noticed that at some point too, and have had a patch lying around for
quite a while:

*** /tmp/extdiff.kfdtyubq/gmp.1cbf890d4bff/mpn/generic/mul_fft.c	Thu Mar  3 21:50:34 2022
--- /home/tege/prec/gmp/mpn/generic/mul_fft.c	Tue Apr  6 10:57:52 2021
***
*** 59,62 
--- 59,65 
  */
  
+ #include 
+ 
+ 
  #ifdef TRACE
  #undef TRACE
***
*** 267,270 
--- 270,277 
  	  cc = 0;
  	}
+ #if 0
+   /* FIXME: Remove rd assignments above. */
+   rd = ~r[m];
+ #endif
  
/* now complement {r, m}, subtract cc from r[0], subtract rd from r[m] */
***
*** 682,726 
TMP_MARK;
  
if (nl > Kl) /* normalize {n, nl} mod 2^(Kl*GMP_NUMB_BITS)+1 */
  {
mp_size_t dif = nl - Kl;
!   mp_limb_signed_t cy;
  
tmp = TMP_BALLOC_LIMBS(Kl + 1);
  
!   if (dif > Kl)
! 	{
! 	  int subp = 0;
! 
! 	  cy = mpn_sub_n (tmp, n, n + Kl, Kl);
! 	  n += 2 * Kl;
! 	  dif -= Kl;
! 
! 	  /* now dif > 0 */
! 	  while (dif > Kl)
! 	{
! 	  if (subp)
! 		cy += mpn_sub_n (tmp, tmp, n, Kl);
! 	  else
! 		cy -= mpn_add_n (tmp, tmp, n, Kl);
! 	  subp ^= 1;
! 	  n += Kl;
! 	  dif -= Kl;
! 	}
! 	  /* now dif <= Kl */
! 	  if (subp)
! 	cy += mpn_sub (tmp, tmp, Kl, n, dif);
! 	  else
! 	cy -= mpn_add (tmp, tmp, Kl, n, dif);
! 	  if (cy >= 0)
! 	cy = mpn_add_1 (tmp, tmp, Kl, cy);
! 	  else
! 	cy = mpn_sub_1 (tmp, tmp, Kl, -cy);
! 	}
!   else /* dif <= Kl, i.e. nl <= 2 * Kl */
! 	{
! 	  cy = mpn_sub (tmp, n, Kl, n + Kl, dif);
! 	  cy = mpn_add_1 (tmp, tmp, Kl, cy);
! 	}
tmp[Kl] = cy;
nl = Kl + 1;
--- 689,703 
TMP_MARK;
  
+   ASSERT_ALWAYS (nl <= 2*Kl);
+ 
if (nl > Kl) /* normalize {n, nl} mod 2^(Kl*GMP_NUMB_BITS)+1 */
  {
mp_size_t dif = nl - Kl;
!   mp_limb_t cy;
  
tmp = TMP_BALLOC_LIMBS(Kl + 1);
  
!   cy = mpn_sub (tmp, n, Kl, n + Kl, dif);
!   cy = mpn_add_1 (tmp, tmp, Kl, cy);
tmp[Kl] = cy;
nl = Kl + 1;

(Not sure about the separate FIXME change therein; it looks strange to
me now.)

Feel free to clean this up.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: binvert_limb speedup on 64 bit machines with UHWtype

2022-02-27 Thread Torbjörn Granlund
John Gatrell  writes:

  I think you missed why the 0x7F is unnecessary. If you start with 8 bits
  and divide by 2 then the top bit must become zero. gcc does this itself and
  suppresses the 0x7F. So this idea will help other compilers start with
  8-bits to achieve the same. The same trick can be used in hand-crafted
  assembler implementations.

You misread the code; n is a full limb.  Masking by 0x7F is absolutely
not unnecessary!

Replacing the bitwise logical and maskiong with a cast to "unsigned
char" is not portable.  (But swapping the division and the mask
operation, adjusting the mask accordingly, is semantically equvalent to
the original code.  As I said in my previous reply, this might help some
compiler.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: binvert_limb speedup on 64 bit machines with UHWtype

2022-02-27 Thread Torbjörn Granlund
John Gatrell  writes:

  I noticed that replacing '(n/2)&0x7F' with '(unsigned char)n/2', may give a
  hint to assembler implementers that the 7F mask is unnecessary.
  For your consideration

It is necessary to portably extract the least significant bits.

Perhaps one could write it (n & 0xff)/2 and get better code from some
compilers, as there are special instructions for "& 0xff" which then
would be obvious to compilers.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: New mulmod_bknp1

2022-02-20 Thread Torbjörn Granlund
Very nice speedups there!

I am too busy to examine the code to see what you've done.  Perhaps you
could outline the algorithms here?

Is n = 3^t-k now slower than n' = 3^t for small k (with k mod 3 != 0)?
Then we could zero-pad such operands...


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mpq_mul_ui

2022-01-24 Thread Torbjörn Granlund
Marc Glisse  writes:

  What would you think of adding mpq_mul_ui, mpq_div_ui, mpq_ui_div, and
  also the _z versions?

That would make sense to me.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


[ADMIN] GMP mailing list

2022-01-15 Thread Torbjörn Granlund
The GMP mailing lists have been down for several months, but I am now
working on resuming their operation.

If you get this message, chances are that things now work again.

The backgound is that a security update to the (FreeBSD) virtual server
broke the mailing list software (mailman).  I usually update manually
applications from FreeBSD's posts source tree, but that did not work
with the obsolete FreeBSD version which the server ran.  Therefore, I
installed a precompiled mailman update, but that did not work with the
MTA software (poatfix).  A mess, and I did not have time to try to fix
it until now.

The fix was a complete reinstall of the server.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Suggested tune/tuneup.c patch

2021-10-31 Thread Torbjörn Granlund
Marco Bodrato  writes:

  Is this still active?
  I can access https://gmplib.org/devel/thres/2021-10-30/ , but I'd like
  to check how FAC_ODD_THRESHOLD evolves after my last commit to
  fac_ui... and I find an empty page if I look at
  https://gmplib.org/devel/thres/2021-10-30/FAC_ODD_THRESHOLD

Still there, but for some reason nginx had stopped serving gz pages, and
there were no non-gzip pages on the server.  I made a quick workaround.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-08 Thread Torbjörn Granlund
Torbjörn Granlund  writes:

  zen12   2 2 =   all equal (saturated mul)

Typo.  The last number should be 4, not 2.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-08 Thread Torbjörn Granlund
I created an "ununrolled" version, and a 4x unrolled version.  I then
compared these with some other variants.  Here are the results:

   mul_1  addmul_1  addaddmulresult best variant
zen12   2 2 =   all equal (saturated mul)
zen21.7 2.1   2.8   +   4x unrolled
zen31.5 1.5   3.9   -   1x/2x/4x unrolled
bwl 1.5 1.7   3.1   +   1x
skl 1.5 1.7   3.1   +   1x
rkl 1.1 1.6   2.7   =   1x

The only CPU which sees great improvements if zen2.  Both bwl and skl see
very minor improvement.  I don't see your 20% figure for bwl.

The result on zen3 are really poor.  I believe this CPU has quite some cost
for indexed memrefs.  I think that's also true for bwl and perhaps skl, even
if the new code runs OK there.

We might want to produce variants which uses plainer memory ops, code which
updates baseregs with lea.  That will require unrolling to at least 4x.
(The present addmul_1 code which is used for bwl, skl, rkl, and zen3 is 8x
unrolled without indexed memrefs.  It was loopmixed for bwl, but runs
surprisingly well elsewhere.)

We really should move the present code to the k8 subdir.  It is a major
slowdown everywhere I have tested except k8 and k10.  (I have not tested bd1
through bd4, but they have k8 in their asm paths (which might be good or
bad).)

If we want to pursue this more, this is what I would suggest:

1. Move the present code to x86_64/k8

2. Create a non-indexed variant, perhaps 4x unrolled.  I suspect this might
   give some real speedups for bwl, skl, perhaps rkl, and likely zen3.
 
3. If 2 is successful, commit code bwl subdir, create relevant grabber for
   the appropriate zen CPUs.  If 2 is not successful, commit "ununrolled"
   code for bwl subdir, make zen1/zen2 use it (but make sure zen3 does not
   use that variant!).

4. Consider loopmixing.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-08 Thread Torbjörn Granlund
Your version is faster than my versions (where I tested them).

I made some minor changes to your code.

1. Got rid of c1 by moving two adox earlier.  That also made for a speedup.

2. Simplified the feed-in code by jumping into the loop for the odd n case.

3. Use rbx for the bp variable as rbp is not a great base register (yes x86
   coding is absurd).

4. Use some 32-bit operations for code size.  (More could be done along
   those lines, i.e. use 8-bit test $1,R8(n), add n instead of $0 for
   the final carry add

Note that the loop now contains two identical copies of the same code
block.  One might unroll more or less with quite limited effort.  :-)



addaddmul_1msb0-mulx-bynisse.asm
Description: Binary data

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-07 Thread Torbjörn Granlund

  And will cause an interesting failure if one can ever afford enough RAM
  to use an input size larger than 2^63 limbs ;-)

Nobody in his right mind will ever need more than 2 EiB of memory.  :-)

  Attaching a version that actually passes some tests (I should commit the
  unit tests, but not today). The loop is only 2-way, and there are three
  spare registers:

  L(top):
mov (ap, n, 8), %rdx
mulxu0, l0, hi
mov (bp, n, 8), %rdx
mulxv0, l1, c1
adoxc0, l0
adoxhi, c1
adc l0, l1
mov l1, (rp, n, 8)

inc n
mov (ap, n, 8), %rdx
mulxu0, l0, hi
mov (bp, n, 8), %rdx
mulxv0, l1, c0
adoxc1, l0
adoxhi, c0
adc l0, l1
mov l1, (rp, n, 8)
inc n

jnz L(top)

Neat!

  I've eliminated the zero adds by adding together the high half of the
  products earlier (thanks to the "msb0" condition, there's no carry out
  from the second adox in the pairs). I think the critical recurrency
  involves the alternating carry limbs c0, c1, and the OF flag which is
  live only between the adjacent adox instructions.

  The 3 cycles/limb means 6 cycles for the two-way loop of 19
  instructions, or 3.1 instructions per cycle. To reach the ideal of 2
  cycles/limb (critical path limit, one iteration of this 2-way loop in 4
  cycles), one would need to issue almost 5 instructions per cycle, and
  that's not very realistic, right?

4 non-regmov insn/cycle is the limit.  Regmove are usually free.

  A few instructions can be shaved off by going to k-way for larger k, and
  moving the invariant u0/v0 to %rdx, as you suggested. Is that the way to
  go?

Perhaps.

  This structure doesn't work as is for addsubmul_1msb0, one would need to
  either organize it differently, or negate on the fly (but adding 4 not
  instructions into the loop sounds rather expensive, if speed is
  already limited by the number of instruction).

For many recent CPUs, submul_1 is slow.  So the competition is not that
fierce.  :-)

  Ah, and one final question: Where should mulx-code go? I tried putting
  this in x86_64/mulx/adx/, but it wasn't picked up automagically by
  configure, I had to link it in manually.
   
It is messy, and I don't think it could be any other way.

You could place it in the x86_64/coreibwl directory.  Then all later
Intel CPUs will find it.  You then need to put an include_mpn file for
bd4 and later.

Two versions which I cobbled together attached.  The first works for all
sizes, the 2nd only for even sizes.  The first one has false
dependencies, the second one fixes that.



addaddmul_1msb0-mulx-nisse.asm
Description: Binary data


addaddmul_1msb0-mulx-nisse-nofdep.asm
Description: Binary data

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-07 Thread Torbjörn Granlund
Torbjörn Granlund  writes:

  Problem: the adc will write a useless value to the O flag.  That is then
  read by the first adox, yielding incorrect results.  Clearing O without
  creating any (too bad false) dependencies could perhaps be done with an
  additional dummy adox zero, zero.

On 2nd thought, the bookkeeping inc will clear the O flag and leave the C
flag alone.  So, except for a typo in addressing your code loooks
correct.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  L(top):
mov (ap, n, 8), %rdx
mulx%r8, alo, hi
adoxahi, alo
mov hi, ahi C 2-way unroll.
adoxzero, ahi   C Clears O

mov (bp, n), %rdx
mulx%r9, blo, hi
adoxbhi, blo
mov hi, bhi
adoxzero, bhi   C clears O

adc blo, aloC Or sbb, for addsubmul_1msb0
mov alo, (rp, n, 8)
inc n
jnz top

Problem: the adc will write a useless value to the O flag.  That is then
read by the first adox, yielding incorrect results.  Clearing O without
creating any (too bad false) dependencies could perhaps be done with an
additional dummy adox zero, zero.

Another remedy would be to use adcx instead of adc, but then we cannot
easily make a addsubmul_1msb0 variant.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Here's a sketch of a loop, that should work for both addaddmul_1msb0 and
  addsubmul_1msb0:

  L(top):
mov (ap, n, 8), %rdx
mulx%r8, alo, hi
adoxahi, alo
mov hi, ahi C 2-way unroll.
adoxzero, ahi   C Clears O

mov (bp, n), %rdx
mulx%r9, blo, hi
adoxbhi, blo
mov hi, bhi
adoxzero, bhi   C clears O

adc blo, aloC Or sbb, for addsubmul_1msb0
mov alo, (rp, n, 8)
inc n
jnz top

  L(done):
adc bhi, ahiC No carry out, thanks to msb0
mov ahi, %rax   C Return value

Neat!

Some unrolling would save several instructions:

Put r8 (aka u0) in rdx over k ways of unrolling.  Supply ap[...] limbs
to mulx directly.  Then accumulate with an adox chain, reducing the adox
zero need.  Same for r9/v0.

  (BTW, do I get operand order right for mulx? I'm confused by the docs
  that use the generally different intel conventions).

Your use looks right.

  Now, question is if it can beat mul_1 + addmul_1. I don't know.

It surely has potential.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Gave it a run on my closest x86_64 (intel broadwell, no mulx)), and
  numbers for mpn_addaddmul_1msb0 are not impressing. Also, it appears
  mpn_addmul_2 is significantly slower than two addmul_1.

I believe addmul_2 is inhibited for that CPU.  It might still appear in
the compiled library, though.  :-(

  79#1.56171.80064.32774.6949
  86#1.57021.78834.32904.7031
  94#1.54411.77434.33214.7018

  So there's definitely some room for improvement.

The odd instruction order of the present loop suggests it was optimised
for K8.  In fact, it runs almost optimally there.

(32 loop instructions, the 6 muls need a double slot, so 38.  3-way
issue, 6 way unrolled gives (32+6)/3/6 = 2.111...  Very close to the
stated 2.167.)

Beating mul_1 + addmul_1 elsewhere without loopmixing will probably be
hard.  We should probably move the present code into the k8 subdir.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Risc V greatly underperforms

2021-10-06 Thread Torbjörn Granlund
Hans Petter Selasky  writes:

  Then you get a penalty. But the penalty might not be so big assuming
  random input. Adding one to a number is pretty cheap and you only need 
  to continue traversing the data words making up the number when the
  increment overflows. Which in turn gets you a variable number of
  iterations.

Not good, side-channel leakage.

  How microcode works and what instruction sequences are optimal for a
  bignum adder, I will not go into. My point is just that x86
  instructions are parsed before they are executed. Almost like a VM.

Ahum.  But Risc V instrutions are not "parsed" you say?

  I would guess that if RISC-V executed "N" instructions at a time on
  the same logical core w/o using microcode, the performance would be 
  comparable to x86. Then it would be up to the compiler to layout the
  instructions correctly and not the microcode.

You guess wrong.

Most instructions today has a latency of a single cycle, be it Risc V,
some x86 core, or Arm.

Arm has the most powerful instruction set.  But x86 also has powerful
instructions, albeit very messy from both a programmer's perspective and
from the hardware's perspective.

Now you claim that something magic (parsing, microcode) slows things
down on x86.  Somehow, a single-cycle instruction on x86 is really
magically slower than a single-cycle instruction on Risc V.

You're dead wrong.

X86 will use many fewer instructions than Risc V for any task since X86
has many more instructions and many instructions are also more powerful.
Typically, instructions run in a single cycle and does not involve
"microcode".

Risc V will never compete with Arm or x86 for integer scientific tasks
(including crytpography).  It won't even come close.  It would need to
run at clock speeds several times higher than the competition to come
close.

(Modern CPUs are complex, and surely many instructions are not executed
as simply as a plain add.  Some instructions are internally split, e.g.,
"add mem,reg" might be split into a load and a register-based add.  But
the opposite is also true, that some instruction pairs are glued to at
later stages be seen as a single instruction.  )

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-06 Thread Torbjörn Granlund
I haven't followed this discussion very closely, and did not see if you
have conidered the following.

OK, so the code is 3-ways unrolled.  That's always a bit inconvenient
and tends to cause some code bloat.

I am pretty sure we have that at least in sme other place, but still
make all the work in one loop, switching into apropriate places from the
feed-in code.

It is not expensice to compute something like

  (3^(-1) mod 2^32)*n mod 2^32 / 2^30

in the feed-in code.  (3^(-1) mod 2^32) = 0xaaab, so we can do the
above with two instructions (imul and shr).  The latency of umul+shr is
<= 4 on moderna architectures.

Since addaddmul_1msb0 is strictly internal, and since it presumably is
used for very limited values of n, I assume 32-bit arithmetic on n is
suffficient.

(Note that the tricky mod computation above "maps" the remainder 1 to 2
and the remainder 2 to 1.)

Other ideas:
* Use xor r,r instead of mov $0,r (considering that xor messes with the
  carry bit).

* Use one more register for accumulation, with 4x unrolling.  That would
  save the 0xaaab magic mul.

* Provide variant with mulx.

* Accumulate differently, say 4 consecutive limbs at a time, with carry
  being alive.  That will require more registers for sure.  By using
  adcx and adox, one may accumulate to the same registers in two chains
  semi-simultaneously.

* Use rbx instead of r12 to save a byte or two.

I suspect te present code is far from optimal on modern x86 CPUs which
can sustain 1 64x64->128 multiply per cycle.  I feel confident that we
could reach close to 1 c/l.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Risc V greatly underperforms

2021-10-06 Thread Torbjörn Granlund
Hans Petter Selasky  writes:

  If the GMP could utilitize multiple cores when doing bignum
  multiplication and addition, I think the picture would look different.

  For example for addition, you could split the number in two parts, and
  then speculate if there is an addition for the higher part or not.

And if the guess is wrong, then what?

It is well knowm in a model which ignores caches and memory bandwidth,
than one can get 2n/k + log(k) word operation steps for n-word addition
on k execution agents.  Agent k computes the sum of block k with both
carry = 1 and carry in = 0 and saves both results.  The log(k) term is
for serially choosing the proper block depending on whether carry-in
happened to specific blocks.

On a cached system, I would expect this algorithm to just slow things
down.

  I thought that RISC-V would produce cheaper and more cores, and that
  single core performance was not that critical.

Slow cores are useful in some applications, sure.

  Talking about x86, don't forget that there is microcode below each
  instruction.

This is a false sattement.  Even it it were true, how is that relevant
for this discusson?  The relevant instructions run in one cycle.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Risc V greatly underperforms

2021-09-21 Thread Torbjörn Granlund
A carry bit helps for some codes, GMP being a prime example.

Keeping carry/borrow conditions in plain registers can be made to work
well too.  But then you need good ways of computing carry/borrow, and
good ways of inputting the carry/borrow result to dependent add/subtract
instructions.

Risc V has OK ways of computing borrow but not carry.  Risc V lacks good
ways of inputting carry/borrow to dependent add/subtract instructions.

For the subtraction c = a - b we could compare a and b independent of
the subtraction using sltu.  The sltu instruction is of course a
subtraction which puts the the borrow-out in its result register.

But for the addition c = a + b we don't have anything which computes the
carry-out, a "compute carry-out from add" would be needed.  We now need
to first perform the add, then use sltu on the result, thus creating a
very unfortunate dependency.  Instruction dependencies are a major
performance killer for super-scalar processors.

We also don't have any way of efficiently consuming a computed
carry/borrow result.  3-input add/subtract would have solved that
(together with 3-input sltu and a 3-input "compute carry from add").

This all means that on Risc V, multi-word subtraction could be made to
at 2 cycles/word while multi-word addition is limited to 3 cycles/word,
in both cases assuming a very wide super-scalar core.  Remember that
other concurrent CPUs do these in 1+epsilon cycles/word, and that
without needing to do wide super-scalar dispatch.


I use multi-word add/subtract here as an example of the inefficiencies
of Risc V.  But the weak instruction set of Risc V shows in any
integer-heavy application, as others have pointed out before me.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Risc V greatly underperforms

2021-09-20 Thread Torbjörn Granlund
It seems safe to assume that most people on this list have heard of
Risc V by now, the license-free instruction set.

I trust that much fewer have looked at the technical details.  I have,
though, as we implement critical inner loops for GMP in assembly.

My conclusion is that Risc V is a terrible architecture.  It has a
uniquely weak instruction set.  Any task will require more Risc V
instructions that any contemporary instruction set.  Sure, it is
"clean" but just to make it clean, there was no reason to be naive.

I believe that an average computer science student could come up with
a better instruction set that Risc V in a single term project.  It is,
more-or-less a watered down version of the 30 year old Alpha ISA after
all.  (Alpha made sense at its time, with the transistor budget
available at the time.)

Let's look at some examples of how Risc V underperforms.  First,
addition of a double-word integer with carry-out:

add t0, a4, a6  // add low words
sltut6, t0, a4  // compute carry-out from low add
add t1, a5, a7  // add hi words
sltut2, t1, a5  // compute carry-out from high add
add t4, t1, t6  // add carry to low result
sltut3, t4, t1  // compute carry out from the carry add
add t6, t2, t3  // combine carries

Same for 64-bit arm:

addsx12, x6, x10
adcsx13, x7, x11

Same for 64-bit x86:

add %r8, %rax
adc %r9, %rdx
(Some additional move insn might be needed for x86 due to the
2-operand nature of this arch.)

If we generalise this to GMP's arbitrarily wide addition, we will end
of with 2 to 3 times more instructions, and go from just over 1 cycle
per 64-bit result word to 3 cycles per word.  The 3 cycles will happen
even for a wide implementation which could execute a large number of
instructions in parallel.  The critical path will be add->sltu->add
which are three dependent instructions i.e. 3 cycles.

I have heard that Risc V proponents say that these problems are known
and could be fixed by having the hardware fuse dependent instructions.
Perhaps that could lessen the instruction set shortcomings, but will
it fix the 3x worse performance for cases like the one outlined here?
Why not provide a decent instruction set instead?

I don't think designing a decent ISA is terribly difficult.  Designing
a great ISA is.  Designing something like that Risc V is trivial.


Full disclosure: I have no financial or other interest in any computer
architecture mentioned here or not mentioned here.  I really like the
idea of a license-free ISA.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-07-09 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Same as in the current (from 2013) version. Delaying the write is a bit
  tricky, since we already use all registers. But it would be better to
  update the quotient limbs in memory only in the unlikely
  carry-propagation case. I figure adc to memory is no worse than explicit
  load, adc, store (or adc from memory, store)?

Which is worse depends on CPU and magic.

I did not realise that the register pressure was so bad.  Perhaps trying
to decrease that would be most helpful.  Sometimes, when values tend to
naturally migrate, some unrolling can reduce register pressure.

When I struggle with register pressure, I usually annotate the code with
what regs are live and where each dies.  That then can steer pressure
reducing transformations.

If requiring mulx helps, I would for now forget about mul.  All relevant
CPUs have mulx.

  I could try moving the dec. I often try to insert independent
  instructions between depending ones, but perhaps that's bad in this case
  (and generally not very helpful on processors with powerful out-of-order
  capabilities).

Insn fusion happens only with branches (and perhaps cmov) and iirc only
if the insn are adjacent.

  Does that mean that 1.5 (2100 / 1400) is the right factor? Then it's
  more like 11.0 c/l vs 11.5.

You could explore this by testing some plain loop, e.g.

.text
.globl main
main:   mov $ASSUMED_FREQUENCY, %rax
1:  dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rax
jnz 1b
ret

If ASSUMED_FREQUENCY is right, it should take 10 seconds.  Else, I leave
it as an exercise to compute the actual frequency.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-07-08 Thread Torbjörn Granlund
I think you should delay writing through QP to avoid adc to a memory
place, and have just one plain write through QP per iteration.

The dec UN and the branch might run faster if put adjacent to each
other, as many CPUs fuse these into a single instruction.

Your cycle numbers should proably be multiplied by a factor

  ("turbo" frequency) / (nominal frequency)

as 7.x c/l seems faster than we ever measured.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-07-04 Thread Torbjörn Granlund
Perhaps we should write a little low-level, m4-based asm compiler?

We could define ad-hoc primitives, like some equivalent to the inline
asm umul_ppmm, add/subtract with carry, loads, stores, branches, etc.
The set of primitives should be separted in must-define and optional.
Writing a definition file for a CPU is not much work.

If we stay within the must-define primitives, source code would be
legible and simple to write.  As soon as we use ifdefs on optional
primitives things could become hairy of course.

I neat trick would be to have this structure,

include(`mdep.m4')
include(`default.m4')

where default.m4 could define all thereto undefined optional primitives
to generate an error.  That would allow us to write function variants
using optional instructions without messing with ifdef.  Instead,
compilation would fail for machine where mdep.m4 has not defined all
used primitives.  That is an OK failure mode.

Then we could define things such as your division methods using these
macros, and at least use it as a quick way of generating asm code, or
perhaps even use it as real GMP sources as an intermediate between C and
asm.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-07-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Looks better, but few tuneup results yet. Method 3 wins on a
  few machines (https://gmplib.org/devel/thres/DIV_QR_1N_PI1_METHOD), with
  all of method 1, 2 and 4 appearing as runner up on some machine.

The presentation in those "/thres/" pages is a bit strange when it is
not operand size cutoff values but, like here, methods.  I.e., measured
method 3 vs configured method is marked as a major difference.  And the
average is 0.975...

I suppose it would be somewhat interesting to know how close the
measured best and measured 2nd best methods are, or how much better the
measured best methods is than the configured method.

And, when there is asm, the scary performance ratio between C and asm
which would suggest that C-to-C comparisons of tight GMP loops might not
be terribly relevant.  :-)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-07-02 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Danger of "easy" last minute changes... Fix pushed now.

Thanks, let's see that things clean up.  (The autobuild system is a daft
and still does not run anything as a result of a repo change.  It is
calendar triggered but can also be run manually.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-06-30 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  I'm tempted to commit this code. I.e., new variants (not enabled) +
  tuneup changes. To see which variants are favorites on the various test
  machines. Should give some guidance as to what's most promising for
  assembly implementation.

  What do you think?

Please do!

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mul_fft

2021-06-30 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  I think add_n_sub_n was originally motivated by improved locality (could
  apply at different levels of memory hierarcy).

  But maybe we could get close to twice the speed using newer instructions
  with multiple carry flags (I guess that's what
  powerpc64/mode64/p9/add_n_sub_n.asm is doing)? We could probably do
  something similar on x86_64 with adox and adcx (if corresponding
  subtract instructions are missing, on-the-fly negation should be fairly
  efficient).

Yes, p9/add_n_sub_n.asm takes advantage of new addex instructions which
use alternative carry bits.

The POWER processors have since P4 a unique pipeline where all
instructions have 2t cycles latency for some integer t.  Plain integer
ops have a latency of 2. (It is really as if the CPUs run at half the
impressive advertised clock frequency with twice the assumed dispatch
width.)

Because of this, add_n and sub_n cannot be made to run at less than 2
c/l.  And add_n_sub_n is close at 2.25 c/l on P9.

Unfortunately, x86 is not as easy.  The subtract-with-borrow
instructions clobber all flags, making the new adcx/adox pretty
pointless.  And no, there are no non-clobbering negation (or one's
complement) instructions.  One needs to save and restore carry.  There
will be many register-to-register copy instructions, because of x86's
habit of overwriting source operands.  (Such copies are essentially free
on AMD processors, as they only manipulate the register rename map.)

Another example where there is just one carry bit is arm64.  For Apple
M1, as you can see at https://gmplib.org/devel/asm, we have an
add_n_sub_n which runs at just 1.3 c/l.  (Almost all GMP loops,
including add_n and sub_n run at exactly 1 c/l there, amazingly.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-06-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  I've tried it out. Works nicely, but no speedup on my machine. I'm
  attaching another patch. There are then 4 methods:

  method 1: Old loop around udiv_qrnnd_preinv.

  method 2: The clever code from 10 years ago, with the microoptimization
I commited the other day.

  method 3: More or less the same as I posted a few days ago.

  method 4: Postpones the update u1 -= u2 d, off the critical recurrency
chain. Instead, conditionally adds in the constant B2 (B - d) to the
lower u limbs.

Let me throw some additional complication into this project:

The fastest way to do a division by a single limb will involve more
precomputation, assuming we have good throughput of the multiply
instruction.

Even in some scenario where one has a 1-limb inverse precomputed, it
will be faster to first do a Newton round to get a 2-limb inverse, and
then run a loop which develops two quotient limbs at a time.  I expect
the dividend limb count cutoff point to be small, surely less than 10.

With full precomputation, the dividend limb count cutoff point would be
2, I presume.

The function div_qr_1n_pi1 is useful mainly for CPUs without a pipelined
multiply instruction.  And of course also useful until we have great
div_qr_1n_piN for N >= 2.

Improvements to div_qr_1n_piN are of course theoretically interesting
too.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-06-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  And I don't quite trust these cycle numbers, they should probably be
  twice as large, on the order of 10 cycles/limb for all variants. Less
  than 5 cycles is too good to be true, right?

Yes.

"Turbo" messes things up.  The TSC cycle counterstays it base clock.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-06-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Maybe we should have some macrology for that? Or do all relevant
  processors and compilers support efficient cmov these days? I'm sticking
  to masking expressions for now.

Let's not trust results from compiler generated code for these things.
The mixture of inline asm and plain code is hard for compilers to deal
with.  Very subtle things can make a huge cycle count difference.

For conclusive results, asm is needed, unfortunately.

(That's not always the case; Marco and I have played with AVX3/AVX512
lately with both asm and C using intrinsics.  C behaved well there.  but
that wss for non- arithmetic loops.)

So what about cmov's performance?  Intel fixed its latency for their
high-end cores with broadwell, which is about 6 years ago.  Their
low-power cores still have 2 cycles though.  AMD's cores always have had
1 cycle latency and good throughput.

  Worries about side-channel leakage of cmov isn't so relevant for these
  particular functions, since the use of MPN_INCR_U is a data dependent
  loop anyway.

Granted.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-06-03 Thread Torbjörn Granlund
Don't forget about the adcx and adox instructions.  They might come in
handy here.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1n_pi1

2021-06-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  The critical path, via the u1 variable, is 

umul_ppmm (p1, p0, u1, B2);
add_mss (cy, u1, u0, u0, up[j], p1, p0);
u1 -= cy & d;

Nice!

The (cy & d) term is multiplied in the next iteration by B2, i.e., we
have either the invariant d * B2 or 0 as the contribution to the p1,p0
product.  If we can somehow add that to u0,up[j] during the umul_ppmm
then we could save several more cycles.

Your snippet above should translate to the following x86 instructions
(assuming we unroll by 2x to avoid the read-after-write of u0 in the
add_mss macro; u0 will have to alternate between two registers):

   INSN CYCLE
mul 0,8
add 3
adc 4
sbb 5
and 6
sub 7

(A mov or two might be missing; these are free with modern x86 CPUs as
they only modify the rename map.)

Judging from https://gmplib.org/devel/asm, this should give about 20%
boosts for current AMD and Intel CPUs.

If we dare use cmov (and its presumed side-channel leakage) we could
probably shorten the critical path by a cycle.  The "sbb" and "and"
would go away.

The Arm code (both v7 and v8) should get really neat using their
conditional execution.  Again, that might be a side-channel leakage
hazard.

(I am a bit fixated with side-channel leakage; our present
implementations of these particular functions are not side-channel
silent.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: 2-adic Svoboda

2021-05-03 Thread Torbjörn Granlund
Paul Zimmermann  writes:

  I tried to implement Montgomery-Svoboda at the C level, but did not manage
  to beat the mpn_redc_x routines. I'm very interested to see your results!

Without invariance from e.g modexp, I don't believe one can beat
sbpi1_bdiv_r (or the older redc_1).  Newer implementation of
sbpi1_bdiv_r make use of the observation that the next quotient can be
computed early, actually almost a whole innerloop invocation early,
which makes its cost very low.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: 2-adic Svoboda

2021-05-02 Thread Torbjörn Granlund
Paul Zimmermann  writes:

  yes, see Section 2.4.2 of Modern Computer Arithmetic, where we call it
  "Montgomery-Svoboda". The quotient selection becomes trivial, which means
  one can reduce the latency between two mpn_addmul_1 calls.

It really becomes a mul_basecase except that the first round there is
done with mul_1.  If we replace that mul_1 by addmul_1, and
then called the resulting function like this,

mul_basecase'(rp, up, un, rp, rn-un)

then we will get the desired remainder at mul_basecase speed.

If we do this, I think it might make sense to have two entry points in
mul_basecase as the bulk of the functions (which is sometimes very
large) could be shared.

(I completely separate improvement would be to chain multiplication and
squaring with Montgomery reduction.)

  If you reduce k limbs at a time, by precomputing m = D^(-1) mod \beta^k,
  then you can use mpn_addmul_k at each step. But to reduce the last k limbs,
  you need to revert to classical (Montgomery) division, since mN has n+k limbs.

The importance of addmul_k for k > 1 has decreased im the last few
years.  But sure, using a "larger" inverse is an option when
e.g. addmul_2 exists.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


2-adic Svoboda

2021-05-02 Thread Torbjörn Granlund
IIRC, Svoboda's division trick for N/D is to find a small multiplier m
such that, for the division mN/(mD) we have mD = 10...XX... with the
high limb of mD being 1000...0.

This idea works also for 2-adic division.  Find m = D^(-1) mod \beta
where \beta is the lomb base.  Then do mN/(mD) or mN mod (mD) with
2-adic norm.  Now mD will end in 0...0001, and the quotient limbs will
be the low limbs of the intermediate remainder.

That should mean that we could use mul_basecase directly for
sbpi1_bdiv_r (or its sibling redc_1).

Surely this is not a new observation.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add Zhaoxin x86 processor support

2021-04-08 Thread Torbjörn Granlund
  I try to use paths for zen and k8 compiler parameter. This runs great in
  test. I now realize that this is a wrong setup. I'm going to modify it with
  the strategy you recommend.

I have attached a script which might be useful for choosing the best
existing asm file for the new CPUs.

Please read the comments in the script before considering to use it.  It
has defaults which work for the automated GMP build-and-test facility,
i.e., its paths and the convention that the configure && make output is
put in the file make.log.

I think you might find the script to be useful and that it is not hard
to adapt and use.

The type of output to be expected can be seen at the end of any "tuneup"
results file from GMP's autobuild system.  See e.g., this one:
https://gmplib.org/devel/tm/gmp/tuneup/success/verm.gmplib.org-stat:64.txt




compare-asm-variants.sh
Description: Bourne shell script


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add Zhaoxin x86 processor support

2021-04-07 Thread Torbjörn Granlund
Let me make this clear: The changes look basically sound.  My comments
concentrated on things that looked odd to me.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add Zhaoxin x86 processor support

2021-04-07 Thread Torbjörn Granlund
Some comments about the patches:

(1) Why do you set up paths for zen (as a fallback)?

Doing that seems wrong unless all these 3 CPUs support every zen
instruction.  Do they?

Also, passing k8 to the compiler adn choosing zen asm code makes very
little sense to me.  If zen makes sense for asm code, why does it not
make sence for compiler-generated code?

A more conservative strategy is to pick file-by-file  from the desired
x86_64 subdir.  E.g., if mpn_mul_basecase from x86_64/foo runs great for
the kx6000 CPU, do

MULFUNC_PROLOGUE(mpn_mul_basecase)
include_mpn(`x86_64/foo/mul_basecase.asm')

in x86_64/kx6000/mul_basecase.asm.

(At some point, providing specific asm code for these CPUs might be
desirable, of course.)


(2) Testing specific stepping.

You check for stepping 0xE specifically.  I assume the "nano" CPU has
lower stepping values than 0xE, and Zhaoxin variants currently have
exactly 0xE.  It then feels more future-proof to check >= 0xE, as that
would allow Zhaoxin to do stepping iterations without GMP silently
falling back to "nano" for such steppings.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH 2/4] config.guess, configure.ac: Add detection of IBM z13

2021-03-10 Thread Torbjörn Granlund
Marius Hillenbrand  writes:

  z14 introduced "alignment hints" for vector loads, where 8-byte aligned
  reads have more bandwidth (e.g., "vl %v,,3" # 3 for 8-byte
  alignment, 4 for 16-byte alignment). vlerg does not take these hints.
  Empirically, I observe a slight advantage for vlerg nonetheless (~5%).

How does z13 interpret these hints?  Ignore them, I hope!

  My 8x unrolled addmul_1 with extensive pipelining gets to within ~20% of
  mlgr throughput. Though, my current implementation only applies for >=
  18 limbs (8 lead-in, 8x-unrolled, 2 limbs wrap-up) -- not very useful
  for addmul_1, besides making the case for going for addmul_2.

~20% is very good, indeed.

I see that the price is deep static instruction scheduling.  That's
sometimes necessary, but it often adds significant O(1) overhead.

If we're really crazy, we could make what I sometimes refer to as
overlapped software pipelining.  With tha, I mean that the outer loop of
e.g. mul_basecase combines the inner loop's wind-down code for outer
loop iteration j with the inner loop's feed-in code of iteration j+1.

But code complexity will probably be lower with addmul_2 or some such,
as we now have an inner loop with more inherit parallelism.

  I'm looking at the addmul_2/3/4 variants, exploring parameters.

  Given that mpn_mul requires s1n >= s2n, mul_basecase will always call
  any variant of addmul_k with n > k (if I read the code correctly). Is
  that an assumption that addmul_k can make in general?

Yes, it is.  Or more correctly n >= k.

Note a trickyness of mul_k which has bit me before: It is allowed to
have the same source and destination, i.e., mul_2 (ap, ap, n, bp).  My
mul_2 code therefore preloads from ap in a manner which my addmul_2 does
not.  Of course, any slight software pipelining tends to take care of
this problem automatically.

And tests/devel/try knows how to trip code which gets this wrong.

I would suggest that you concentrate on an addmul_2 to see how close you
can get to mlgs's throughput without making its software pipeline overly
deep.  Going to addmul_k for larger k tends to diminish the returns, and
furthermore requires mul_1, mul_2, mul_(k-1) in order to create the
*_basecase functions.

I made my addmul_2 actually work for all limb counts.  (I think it even
unnecessarily handles n = 1.)  Code attached.




z14-addmul_2.asm
Description: Binary data

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH 2/4] config.guess, configure.ac: Add detection of IBM z13

2021-03-09 Thread Torbjörn Granlund
Marius Hillenbrand  writes:

  One minor proposal (patch to follow): Some versions of GCC only accept
  the -march=arch variant of the most recent CPU level they support
  (e.g., GCC-9 accepts -march=arch13 but not -march=z15; 13 as in the 13th
  edition of the Principles of Operations that describe the ISA
  extensions; both parameters are equivalent in later versions of GCC).

Ah, OK.  Will apply!

  > The main change I made to your suggested change is that I added z14 and
  > z15 to the recognised cpu types.  I also made z13 a fallback for z14 (if
  > the latter is not understood by tools), and analogously made z14 and z13
  > fallbacks for z15.

  That absolutely makes sense. When I wrote my patches initially, it was
  not yet clear that it is worthwhile to differentiate.

It is not clear, but it does not hurt to make config.guess be accurate,
and then treat the CPUs the same way.  In my experience, people can get
confiused when GMP claims they have CPU foo-k when they actually have
foo-(k+1).

I tried vlerg on the system here, and it works fine.  Very little timing
differece though, but then again I didn't try very hard.

I am not aware of any timing differences between z13, z14, z15 for the
L1 cache-hit cases.  Are there any?  And the only GMP-relevant ISA
difference of which I am aware is the presence of vlerg in z15.


How's it going with the various addmul_k variants?  My completely
non-scheduled addmul_2 seems to run 37% slower than the mlgr throughput.
That's not bad.  Some fiddling around with the schedule got me to just
25% slower.  That was with 2x unrolling.  I haven't tried anything
sophisticated.

How far is your best addmul_1 from mlgr's throughput?

I believe it to be possible to get pretty close to mlgr's throughput, if
not by any other means by going to addmul_k for k > 2.  I think 8-way
addmul_1 makes little sense, but I think 2-way or 4-way addmul_2, or
2-way addmul_3 or 2-way addmul_4 does make sense if they run close to
mlgr's throughput.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH 2/4] config.guess, configure.ac: Add detection of IBM z13

2021-03-09 Thread Torbjörn Granlund
I applied patch 1/4 and 2/4 with modifications.
Please take a look at the repo code when you have the time.

The main change I made to your suggested change is that I added z14 and
z15 to the recognised cpu types.  I also made z13 a fallback for z14 (if
the latter is not understood by tools), and analogously made z14 and z13
fallbacks for z15.

I don't know how relevant the 31-bit ABI is today.  I do not plan to
write any new asm for that, and so did not add special paths for newer
CPUs for that ABI.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-06 Thread Torbjörn Granlund
Marius Hillenbrand  writes:

  z13: introduced the vector extensions
PoP: Vector Facility for z/Architecture
Linux: vx / HWCAP_S390_VX

  z14:
Vector-Enhancements Facility 1
vxe / HWCAP_S390_VXE

  z15:
Vector-Enhancements Facility 2 (adds VLERG and VSTERG, among others)
vxe2 / HWCAP_S390_VXRS_EXT2

You forgot to mention z16?  :-)

IIRC, your config* patches added z13 by grepping /proc/cpuinfo.  It did
not add z14 or z15.  Any particular reason for that?  I think it makes
sense to have accurate recognition code in GMP, even if we then treat
the CPUs exactly the same.

BTW, it turns out that the system I amplaying with is a z15.

Here is /proc/cpuinfo (cut after one CPU):

vendor_id   : IBM/S390
# processors: 2
bogomips per cpu: 3241.00
max thread id   : 0
features: esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs 
te vx vxd vxe gs vxe2 vxp sort dflt sie 
facilities  : 0 1 2 3 4 6 7 8 9 10 12 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 30 31 32 33 34 35 36 37 38 40 41 42 43 44 45 46 47 48 49 50 51 52 
53 54 55 57 58 59 60 61 73 74 75 76 77 80 81 82 128 129 130 131 133 134 135 146 
147 148 150 151 152 155 156 168
cache0  : level=1 type=Data scope=Private size=128K line_size=256 
associativity=8
cache1  : level=1 type=Instruction scope=Private size=128K 
line_size=256 associativity=8
cache2  : level=2 type=Data scope=Private size=4096K line_size=256 
associativity=8
cache3  : level=2 type=Instruction scope=Private size=4096K 
line_size=256 associativity=8
cache4  : level=3 type=Unified scope=Shared size=262144K 
line_size=256 associativity=32
cache5  : level=4 type=Unified scope=Shared size=983040K 
line_size=256 associativity=60
processor 0: version = FF,  identification = 13,  machine = 8561
processor 1: version = FF,  identification = 23,  machine = 8561

cpu number  : 0
physical id : 0
core id : 0
book id : 0
drawer id   : 0
dedicated   : 0
address : 0
siblings: 1
cpu cores   : 1
version : FF
identification  : 13
machine : 8561
cpu MHz dynamic : 5200
cpu MHz static  : 5200

I measure the actual clock to 3800 MHz, not 5200.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-05 Thread Torbjörn Granlund
  >From 4x unrolling to 8x unrolling with pipelinging we gain ~40%.

Wow, that's much more than I would have expected (or actually seen at my
end).

Which brings up another question: Which zarch pipelines does it make
sense to optimise for?  I thought I had z196 access, but that is
probably not true as I didcovered that it supports the vector insns.  It
must be a z13 or later, I think.

I don't mind adding code for older CPUs at all.  I want to avoid slowing
down GMP for any CPU, old or new, so adding code which affects a CPU
should not be done without benchmarking on that CPU.

However, on that machine, a 4x variant of my code runs addmul_1 at about
3.4 cycles/limb and an 8x variant runs at about 3.3 cycles/limb.  That
difference is far smaller than what you see, either because we are on
different CPUs or because your 8x variant is far better than mine.

I tested a mlgr-only loop without any dependencies.  It indicates an
mlgr throughput of 0.5 insn/cycle.  Thus, we cannot hope for anymul_1 to
run at less than 2 cycles/limb.

If your cycles/limb numbers are not better than mine, there is still
considerable room for improvements, 3.3 is still far from 2.0!

I haven't thought carefully about it, but an addmul_2 with vector
accumulation might help.  An addmul_2 loop use half the number of loads
and stores, and use fewer vpdi, compared to addmul_1.  And perhaps one
could accumulate cleverly to avoid some of the carries.  Or not.

By the way, have you tried using vlerg/vsterg to avoid vpdi?

(I am having problems understanding which instructions are implemented
in a particular CPU.  The "Principles of Operations" manual seem to skip
that aspect.  I tried vlerg but it was not even recognised by the
assembler.)

  > Be careful about the lead-in times too.  With deep unrolling, one needs
  > a table indexed by n % W, where W is the unrolling arity.

  Since I found code complexity (and debugging effort, to be honest) raise
  sharply when supporting flexible lead-ins, I am currently wrapping up a
  hybrid variant for addmul_1 and mul_1: use the much simpler 2x unrolled
  variant from my first patch when N is less than the pipelined lead-in +
  one iteration + lead-out; when N is above the threshold, use that loop
  for the n%W and then use the pipelined variant. As a plus, we need fewer
  registers below the threshold. That variant achieves ~2x speedup for N <
  18 and gets to within 20% of peak multiplication throughput above that.

Interesting!

A mistake can be made here, and that is optimising addmul_k for too
large limb counts.  We've made that mistake a lot with GMP in the past.

So why is that a mistake?  Because the use of divide-and-conquer
algorithms make large trip counts exceedingly rare.

I tested on the z13(?) and found that Karatsuba's algorithm is used
already from 8 limbs for multiplication and 11 limbs for squaring.
That's with the 4x vector accumulation addmul_1.

I expect mul_basecase/sqr_basecase would increase these numbers a but,
as they cut O(n) of the execution time.

I would still be surprised if 8x addmul_1 unrolling help in practice,
but I would love to be proven wrong.

For completeness of my reasoning, there is one rare case where addmul
trip counts will be large: a very unbalaned multiplication where one
operand is >> than the other.

To make mul_basecase as low-overhead as possible, one need to do the un
mod k (k being the unrolling factor) before the k outer loops.  Yes,
outer loops, plural.  The lead-in of the k inner loops then can be
condition-less straight-line code.  Yes, major code expansion, in
particular for large k, but mul_basecase and sqr_basecase are super
important.

The structure of sqr_basecase will become different, as the inner loop
trip counts will decrease by one in the outer loop.  Still, one should
do not n mod k except in the function header code.  Instead, one need to
branch into k different places inside the one outer loop.  Tricky, yes.
Neat, not at all.

I am all for fat support for zaxrh.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-02 Thread Torbjörn Granlund
Marius Hillenbrand  writes:

  Most notably, I changed the order so that the mlgr's are next to each
  other. The reason is that decode and dispatch happens in two "groups" of
  up to three instructions each, with each group going into one of the two
  "issue sides" of the core (both are symmetric and have  the same set of
  issue ports since z13). For some instructions, grouping is restricted --
  that includes mlgr, which will be alone in a group. Thus, placing two
  mlgr's next to each other will ensure that they will spread across both
  issue sides and exploit both multiply units.

This sort of restrictions are hard to deal with.

What happens around branches with these grouping restrictions?  Could an
incoming branch (such as the loop branch) have one group from before the
branch and one after the branch, thus invalidating the assumption of
adjacent mlgr's going to different groups?

I've seen cases (with other pipelines) where a loop can take completely
different times depending on some parity-like issue condition upon loop
entry.  It never recovers in the "bad" case.

  In my experiments, incrementing and comparing a single "idx" turned out
  beneficial over incrementing the pointers and decrementing n separately.

Doesn't brctg with its awareness of the induction variable help branch
prediction in such a way that not only is branch back accurately
predicted, but also the final fall-through?

OK, using brctg and whether to use idx is perhaps orthogonal.

  Similarly, using 128-bit adds in vector registers performs better than
  alcgr + algr. One factor is that alcgr must be alone in a dispatch
  group, same as mlgr. Given the number of alcgrs we would need, the
  128-bit adds wins. For comparison, vacq and vacccq also have a grouping
  limitation -- only two of them can be in a group. However, that means we
  can fit a 128-bit add with carry in and out in one dispatch group,
  instead of just a 64-bit add.

I wrote a 4x unrolled addmul_1 a while back, timing it on a z196 (yes,
old system, but that's the hardware to which I have convenient access).
It is 60% faster than the existing code; it takes 5 cycles/limb whereas
the old code takes 8 cycles/limb.  The code is attached.

(I hope more recent machines get much better cycles/limb numbers.  Many
machines (x86, POWER, Apple M1) today are approaching 1 cycle/limb.)

  To improve performance notably (~40% over my initial patch on z15), my
  currently best-performing implementation maintains the same instruction
  sequence (mlgr + vacq for two limbs at a time) as our previous attempts,
  yet unrolls for 8 limbs per iteration with software pipelining of the
  different stages (load, multiply, add, and so on).

  Unrolling even more did not improve performance.

Did you get rid of the lgr of the carry limb?  That should not be too
hard.  The code attached does that.

What is the performance improvement for going from 4x to 8x unrolling?

Be careful about the lead-in times too.  With deep unrolling, one needs
a table indexed by n % W, where W is the unrolling arity.

I split up my 4-way code into two similar loop blocks.  That makes entry
into the loop middle possible.  For 8x, using such an approach would
avoid huge feed-in code.  (Code attached.)

  While this variant helped a lot in debugging and tweaking parameters and
  schedule, it is hackish and brittle (e.g., the empty asm("")s help
  define instruction scheduling, yet GCC may change how it handles them
  over time). Further, I suspect there may be performance gains left in
  hand-tweaking the assembly code.

I agree that we should use asm to avoid the performance brittleness of C
code.

  So, for integrating this implementation into GMP, I propose adding both
  the resulting assembly variant and that C code for reference or future
  improvements.

  What do you think?

We might include the C code as a comment in the asm.

Two attachments, 4-way code with possible mid-loop-entry, and a 4-way
addmul_1 using just plain registers.



z14-addmul_1-ur4b.asm
Description: Binary data


s390-addmul_1-ur4.asm
Description: Binary data


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-01 Thread Torbjörn Granlund
Torbjörn Granlund  writes:

  I played a bit with an addmul_1 of my own, with some ideas from your
  code.  I don't plan to do more work on this.  Does this perform well on
  hardware?

I now realise that the instruction sequence of my example is essentially
the same as in your code, except with more unrolling.  That's a bit
surprising, as I wrote it by understanding the instruction set, starting
at the vac* insn which your code mase me look at.  I did actually not
look at your code while writing my code.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-01 Thread Torbjörn Granlund
I played a bit with an addmul_1 of my own, with some ideas from your
code.  I don't plan to do more work on this.  Does this perform well on
hardware?

Note that it only works for n = 0 (mod 4).




z14-addmul_1-ur.asm
Description: Binary data


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-01 Thread Torbjörn Granlund
My analysis and reports in this thread had several problems.  For
example, I had made shared-lib builds of some qemu images used for "user
mode" emulation; that does not work unless the host dynlibs are made
available in the guest file system.

Trying again: your submul_1 works fine on all tested versions of qemu,
which are 4.1.x through 5.2.0.  (But I did not test every version in
that range.)

The problems I saw with qemu and submul_1 where actually with the
existing s390 submul_1 which has been part of GMP for several years.
All tested qemu versions exhibit that same problem.  I have access to a
s196 system, and there I cannot provoke any error.

So, the old submul_1 code seems to work on actual hardware.  The code
looks good to me.  There is a bug in qemu.

I only tested "user mode" qemu for this experiment.  Sometimes full
system emuation runs instructions correctly when "user mode" emuation
does not.  Sometimes vice versa.

(I've encountered lots of bugs of this kind in qemu over the years.  I
have tried to help the qemu project as I have expertise in the areas of
CPUs, emulation and arithmetic, but I was discouraged enough by their
culture to no longer even try to take the time needed to report my
findings.)

I have dealt with qemu's buggyness by keeping many qemu versions
installed.  I then try to locate the most recent which works for running
GMP in either full system emulation or on "user mode" emulation.

  In the meantime, I've been working on a software pipelined variant of
  addmul_1, which improves significantly over my previous patch. That is C
  with inline assembly, which helps somewhat with the increased
  complexity. It needs more tuning and stress testing, though.

Great!

Usually, multiplication insn throughput is the limiting factor for
addmul_1 and friends.  Therefore, understanding its throughput is a
great place to start.  Once that is understood, one knows what to aim
for.

One usually can get quite close to the multiplication insn throughput in
addmul_1 (or in some cases addmul_2, or addmul_k for some small k > 2).
But usually, and in particular if that throughput is great, the end
performance will be up to 50% worse.

I only know of one CPU where addmul_1 runs at exactly the multiplication
insn throughput; Apple M1.

I looked briefly at your code after David sent a link to the s390 ISA
manual.  If I understand it correctly, you rely on 128 bit addition.  I
have no idea of the throughput or latency of those 128-bit instructions,
but if those numbers are good, I agree that they might be very useful.

An alternative is to stick to plain 64-bit (non-vector) instructions for
unrolled addmul_1.  We do that for many CPUs already.  One will need to
run through partial products twice, for s390 using alcg(r).  The most
significant 64-bit partial product of an unroll group can work as a
carry trap.

If we pull out all the stops, perhaps addmul_4 or something like that,
combined with the vector instructions could yield the best performance.
In the end, balancing complexity and performance (and effort!) will
decide.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-02-26 Thread Torbjörn Granlund
Torbjörn Granlund  writes:

  Ehum.  That failure was apparently due to a qemu bug.  Sorry.

I tested qemu 4.2.0 and qemu 5.2.0.  The latter does not work at all,
not even for "ls" or "cat".  The former runs submul_1 for a while, but
ultimately reports an error.  The error is not reproducible.

This is on a server system with ECC.  It is either some problems in your
submul_1, e.g. incorrect handling of callee-saves registers, or an
uninitialised register or flag, or a bug in qemu.

I cannot provoke any similar problems with your addmul_1.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-02-26 Thread Torbjörn Granlund
Torbjörn Granlund  writes:

  Please make sure to test GMP contributions properly.  And being honest
  about the tests performed is critically important.

Ehum.  That failure was apparently due to a qemu bug.  Sorry.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-02-26 Thread Torbjörn Granlund
Marius Hillenbrand  writes:

  Done, completed successfully for both mpn_addmul_1 and mpn_submul1.

Are you sure you did that?

I tried your submul_1 under qemu (version 4.1.1).  it does NOT pass any
tests with tests/devel/try.

Please make sure to test GMP contributions properly.  And being honest
about the tests performed is critically important.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-02-20 Thread Torbjörn Granlund
Marius Hillenbrand  writes:

  Done, completed successfully for both mpn_addmul_1 and mpn_submul1.

Good!

  I measure ~40% reduction in cycles per limb on z14 and ~60% reduction on
  z15, for both addmul_1 and submul_1 (smaller gains for small n <10 but
  no regressions), yet ...

Great speedup!

Smaller gains for small loop trip counts are expected.

  > 2. Is the measured performance close to what you would hope for, given
  > some hard pipeline limits?  If not, would you be willing to try to
  > improve things?

  ... I assumed that I hit max multiplication throughput, yet I just
  learned that there is potential to increase that further. I'll accept
  that challenge and keep working on the patch :)

Good!

  I went through the process before and already have copyright assignments
  in place for multiple GNU projects, including GMP, so we should be good
  from that side.

Ah, that will simplify things.

Your addmul_1 and submul_1 will be great contributins.  Please do a
mul_1 too, as that should be very straightforward by removing some insn
from addmul_1.

If you really want to make these CPUs perform near their maximum
caapbility, consider writing mul_basecase, sqr_basecase, and perhap
mpn_sbpi1_bdiv_r too.

The last one might not be too clear, but it is really very similar to
mul_basecase, only that one innerloop invariant multiplier comes from a
Hensel limb quotient.

A fast sqr_basecase is tricky, and GMP uses a plethora of algorithms for
it.  The most recent saves a lot of cycles and look like below in C.  In
asm, one would inline addmul_1, of course.  And for both mul_basecase
and in particular sqr_basecase, one should implement straight line code
for small operands (this also tends to simplify the main code as it
doesn't need to consider small operands).  For sqr_basecase, one would
exit the outer loop long before un becomes zero, and fall into straight
line code.

void
mpn_sqr_basecase (mp_ptr rp, mp_srcptr up, mp_size_t un)
{
  mp_limb_t u0;
  mp_limb_t cin;

  u0 = up[0];
  umul_ppmm (cin, rp[0], u0, u0);
  ++rp;

  if (--un) {
u0 = u0 << 1;
up += 1;

rp[un] = mpn_mul_1c (rp, up, un, u0, cin);

// error: up[0]:63 x up[(n-1)...1], analogous after each iteration below
do {
  u0 = up[0];
  mp_limb_t ci = -(up[-1] >> (GMP_NUMB_BITS-1)) & u0; // correction term
  mp_limb_t x0 = rp[1] + ci;
  mp_limb_t c0 = x0 < ci;
  mp_limb_t hi, lo;

  umul_ppmm (hi, lo, u0, u0);
  mp_limb_t x1 = x0 + lo;
  mp_limb_t c1 = x1 < lo;
  cin = hi + c0 + c1;
  ASSERT (cin < hi);
  rp[1] = x1;
  rp += 2;

  if (--un == 0) break;
  u0 = (up[-1] >> (GMP_NUMB_BITS-1)) + (u0 << 1);
  up += 1;

  rp[un] = mpn_addmul_1c (rp, up, un, u0, cin);
} while (1);
  }

  rp[0] = cin;
}


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-02-17 Thread Torbjörn Granlund
Thanks for contributing to GMP!

Marius Hillenbrand  writes:

  These patches add IBM z13 as a new s390_64 CPU level to mpn and add optimized
  versions of addmul_1 and submul_1 that exploit the SIMD extensions that were
  introduced with the IBM z13 generation. Both implementations share the same
  structure and use 128-bit add/subtract ops in vector registers with 
carry/borrow
  bits in registers.

  Tested with the regression test suite and stressed with tests/devel/anymul_1.c

Please use tests/devel/try too.  It checks for access outside of defined
operands, which is a common bug in GMP asm.

  I got started with addmul_1 since it felt challenging and instructive enough,
  yet will look into the other ops, as well.

We are careful about not adding asm code which does not give real,
adequate speedup.  It has happened in the past that seemingly useful new
instructions do not easily lead to improved performance.

A 2nd goal is to try to approach the theoretical maximum performance,
e.g., to saturate the multiply unit or some other unavoidable part of a
CPU pipeline.

Therefore, I have two questions:

1. What is the measured speed difference of the existing code and the
new code?

2. Is the measured performance close to what you would hope for, given
some hard pipeline limits?  If not, would you be willing to try to
improve things?

As a rough measure, if the code is within 20% of theoretical maximum for
the target CPU pipeline, we're happy.  If not, more unrolling, better
scheduling, a different instruction choice, might be tried.  Code
complexity is also an issue, for sure.  But addmul_1 is extremely
important for GMP's performance (in particular in the absence of special
mul_basecase and sqr_basecase) so complexity there is particularly
tolerated.


Your contribution is significant enough to need copyright paperwork from
you and IBM.  IBM is FOSS-friendly, so I don't expect any problems, but
it might take a while.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: t-constants FAILs with GMP 6.2.1 on aarch64

2020-12-28 Thread Torbjörn Granlund
[Sorry for the slow reply to your report!  I have been quite busy.]

I think I understand the problem, but how to fix it is not obvious to
me.

  And GMP_ASM_RODATA likely arrives at the .text result since CFLAGS
  includes -flto.  Is there any reason to use CFLAGS in those tests?

It is common that CFLAGS affect the ABI and as a result the way things
are defined in asm.

  The other tests seem to just switch on target systems or use
  TRY_ASSEMBLE - that looks less fragile to me.

TRY_ASSEMBLE might not work too well with C snippets which we need to
use in GMP_ASM_RODATA.  :-)

  Defaulting to .text on aarch64 when using GOT relocations in asm
  is probably asking for trouble (see my case), I think the test
  should fail when running into

else
  echo "Couldn't find label: 
  ^${tmp_gsym_prefix}foo$gmp_cv_asm_label_suffix" >_FD_CC

  rather than falling back to .text.  Alternatively all targets that
  use explicit references to [RO]DATA should have a safer default.

For some platforms, the correct segment for RODATA is .text.

  What's the collection of RODATA results "supported"?  As said a
  switch like for GMP_ASM_GLOBL looks less error-prone here.

Perhaps, but hardwiring things is against the spirit of autoconf and
creates a maintenance burden.

The syntax for rodata varies for between arm64 platforms.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: 答复: [PATCH] Add Zhaoxin x86 processor support

2020-12-18 Thread Torbjörn Granlund
DylanFan-oc  writes:

  We are willing to sign the papers.

Great!

  However, we would like to know what we need to do and where to
  download the FSF paperwork.

I will provide paperwork for you.

Please bear with me for a couple of weeks; we have holidays coming up
over here, and my day work needs my attention.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH] Add Zhaoxin x86 processor support

2020-12-17 Thread Torbjörn Granlund
Thanks for the GMP contribution!

This change is significant enough that we will require FSF paperwork
from you and your employer.  The paperwork gives the GNU project the
legal right to distribute your code.

Are you and your employer willing to sign such paperwork?


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: gcd_11 without abs

2020-11-21 Thread Torbjörn Granlund
Niels Möller  writes:

  And then there will be some conditional operations somewhere. For
  shortest path, it's best if the code can be arranged to do those
  operations in parallel with count trailing zeros.

  Current ARM code (I'm looking at the v6t2 version) does that, with two
  conditional instructions (rsbcc, movcs) independent of the ctz. But then
  count trailing zeros is done with two depending instructions, rbit, clz,
  so the critical path is then 4 instructions. Unclear to me why it runs
  faster than 4 cycles on some processors, maybe the rbit+clz combination
  is executed in a single step on some of the execution units?

The cycle counts are per bit eliminated, it is not cycle counts.

I never bothered to try to measure cycle counts.  One would need to
measure with specially selected operands with known behaviour for the
algorothm, but that would be fragile.  I mean, slight algorithm
variations could reduce slightly less or slightly more bits per
iteration.

  Current x86_64 code (I'm looking at the core2 version) also looks pretty
  tight, also with all conditional moves independent of the count trailing
  zeros instruction. Dependency graph is more complex, but as I read it it
  depth would be 4, with each of these groups of instructions executed in
  parallel in one cycle:

  sub u0, %rdxC v - u
  
  bsf %rdx, %rcx
  mov u0, %rax
  sub v0, u0  C u - v<
  
  jnz L(top)
  L(top): cmovc   %rdx, u0C u = |u - v|
  cmovc   %rax, v0C v = min(u,v)

  shr R8(%rcx), u0
  L(odd): mov v0, %rdx
   
  I wonder if it's possible to get down to 3 cycles. Scheduling the second
  sub instruction in parallell with the first would be a start.

It might be.  Good luck!  :-)

The last few generations of x86 processors have free mov insns.
Therefore, one can mimic 3-operand operations with a mov; op pair.  That
might help.

  Back to the no-abs version. That seems to require cmovs on the critical
  path *after* the shift, and that would tend to make the critical path
  longer.

The current code got a lot of attention a year ago.  It will not be easy
to improve it.

One thing which I never really explored is to evaluate at least a-b and
a+b and choose the one which is divisible by 4.  If that cost just one
extra critical insn, it should be faster.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: gcd_11 without abs

2020-11-21 Thread Torbjörn Granlund
Niels Möller  writes:

  One could get rid of the absolute value by letting one of the working
  values be negative. Something like this, in pseudo code

 b = - b

 while ( (sum = a + b) != 0)
   {
 if (sum > 0) a = sum;
 else b = sum;
   }

[snip]

  That's one instruction less than the current code, but looks quite
  clumsy. It's annoying that bsf clobbers the carry flag. Since we can't
  use carry from the first addition anyway, we can use lea for that and
  save a mov. But then we still need a mov + add to get the carry later
  on; here negation works against us, since that makes the compare
  instruction useless. 

The way to improve the current code on almost any processor is by
shortening the critical dependency path.  reducing the insn count helps
very few CPUs these days.

Do you think you can shorten the dependency path?

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Revert from backup of main GMP repo

2020-11-21 Thread Torbjörn Granlund
Seth Troisi  writes:

  This dropped the mpz_prevprime commits (the final commit previously had a
  hash of 970b7221873f)
  When fires are out I'd appreciate it if they could be committed again.

It will be re-committed.  I hope Marco will take care of that.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Revert from backup of main GMP repo

2020-11-18 Thread Torbjörn Granlund
I've reverted the main gmp repository /var/hg/gmp from backup in order
to resolve broad breakage.  The public mirror will be auto-synched soon.

I don't think a repo which has been synchronised with /var/hg/gmp since
2020-11-15 will work properly.

For those of you that have a checkout locally which you have synched
with the main repository since or during 2020-11-15, you might also want
to restore from /home/.zfs/snapshot/2020-11-15.

Please re-commit tested changes to the restored /var/hg/gmp which got
reverted as a result of my action.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: State of PRNG code in GMP

2020-06-09 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  It's a constraint on what the algorithm internal struct can look like,
  e.g., it can't have internal pointers (but it could have offsets). So
  not necessarily a show-stopper, but we should be aware when designing
  the interfaces.

The latest code added a copy function pointer and removed the datasize
field.  We should probably also have a cleanup function pointer.

  I do see some usecases for copying, to be able to save generator state
  and restart in the same state later. It may even be useful to be able do
  save generator state to disk, if you want to write periodic
  "checkpoints" for an algorithm that needs to run for a long time.

Oh, so serialising seed + buffered data would be useful.  Another
algorithm specific pointer.  :-)

(We could allow that part to be defined only for certain algorithms.  If
we define the interface as int gmp_prng_getstate(buf, const gmp_prng_t*)
and document that iff it returns non-zero the serialising worked.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: State of PRNG code in GMP

2020-06-09 Thread Torbjörn Granlund
Marco Bodrato  writes:

  Mersenne Twister only uses mpz for initialization. Moreover there is a
  little "bug" in the initialization procedure, so that the sequence can
  be the same even if the seed is different (in the range where it is
  supposed to generate different sequencese).

Oops.

  That's why some years ago we started rewriting that init function.
  Of course this will yield to different sequences too.

  https://gmplib.org/list-archives/gmp-bugs/2017-March/004106.html

  Here is the almost mpz-free init function using the xxtea scrambler.

Cool!  Its mpz use should indeed be fairly straightforward to eliminate.

  > Here is some code I have tinkered with.

  > typedef enum {
  >   GMP_PRNG_ALG_LC,  GMP_PRNG_ALG_MT,  GMP_PRNG_ALG_AES,
  >   GMP_PRNG_ALG_LIM,  GMP_PRNG_ALG_DEFAULT = GMP_PRNG_ALG_AES
  > } gmp_prng_alg;

  What's LIM?

LIMIT.  Intended for a future-safe loop through the algorithms.
Probably useless.  :-)

It turned out to be  a bit tricky to get an efficient AES based PRNG
which also has some desired properties:

1. Well-documented effect of seeding (so users can check we do what we
promise)

2. Also, we want n bits of random data to be the exact same n bits of data
on any machine, given the same state.  (This follows from (1) I suppose,
but we might generate different well-documented sequences on 32-bit and
64-bit machines, and on little- and big-endian machines, etc.)

3. Not too much copying and byte swiveling.

I got some code running over the week-end which I think meet all these
criteria.  The C code does not fullfil a 4th criteria, though:

3. Nice and clean.

:-)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: State of PRNG code in GMP

2020-06-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  > /* PRNG algorithm specific data for any constants, buffered random bits, or
  >other state.  The _mp_data field should typically point to a algorithm
  >specific struct.  The _mp_datasize field is used by generic code for
  >de-allocating and copying a gmp_prng_t in an algorithm agnostic manner.  
*/
  > struct appdata {
  >   void* _mp_data;
  >   size_t _mp_datasize;
  > };

  Does generic code really need to copy the algorithm specific data?

It seems nicer to have generic code do as much as possible.  And it
saves having pointers in the struct for copy and cleanup functions.

Do you see any disadvantage?

  > /* Generic PRNG init function.  Is this really a good idea?  A problem is 
that
  >this will pull in all PRNG code into a binary which uses just one
  >algorithms.  It even pulls in mpz functions in a mpn-only program if any
  >PRNG uses mpz. */

  I'd prefer to not have it. I think it's rare to initialize with an
  algorithm enum selected at run time.

I agree.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: State of PRNG code in GMP

2020-06-03 Thread Torbjörn Granlund
Bradley Lucier  writes:

  I don't know whether this is of interest to GMP developers.

We probably don't have time for a very large project around random
numbers.

Independent random number generators is possible already today at the
mpz level.  Just initialise any needed number of randstate_t variables!

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: State of PRNG code in GMP

2020-06-02 Thread Torbjörn Granlund
Thanks Pedro for a quick and thorough answer!  Much appreciated!

Pedro Gimeno  writes:

  > Question 1: Why is _mp_lc wrapped in a union?

  Historical reasons. It was that way when I implemented MT.

I see.

  > Question 2: "_lc" = Linear Congruential? This is supposed to be a
  > generic structure, right?

  Historical reasons again :)

Surely the field _lc could be be renamed to something less confusing
without dire compatibility consequences.  :-)

  Yes, LC is Linear Congruential. It was the only available generator
  when I started working on the MT one. The struct had the exact same
  shape before I took it, and I fitted the new generator into that
  struct trying to use the available fields in a backwards compatible
  way. Kevin went quite paranoid about ABI compatibility, to the point
  of not wanting to add another pointer member to the union for fear of
  changing the size of the structure. I think his words were along the
  lines of "I don't know if it would be compatible, it should be, but
  better safe than sorry". So my patch included these changes:

I see, I don't know if it is a valid compatibility concern; perhaps the
C standard does not pin down that pointers to two different struct types
need to be the same size, but I wouldn't have worried about that.

  The pre-existing random structure was very limited, and building a
  compatible generic generator layer on top of it while keeping
  backwards compatibility needed some compromises, like having to
  allocate a function pointers struct instead of using the random state
  struct itself to store them; abusing the pointer field in the seed's
  mpz_t for the state, and using the macros RNG_FNPTR() and RNG_STATE()
  to access those in a more programmer-friendly way.

I understand.

  The plan was to get rid of the struct issues in the next ABI breakage,
  but I guess no one took note of that with respect to the generators,
  or at least no one did anything about it (we've broken the ABI since
  GMP 4.0.1, right?)

What breakage happened for 4.0.1?

  LC (2exp) is the default and the only choice for gmp_randinit, which
  is considered obsolete. When I implemented the MT patch, there was
  also a disabled and probably non-functional Blum-Blum-Shub generator.

  I'm not too sure why I removed gmp_randinit_lc, but something about it
  being broken rings a bell, so it's possible that Kevin told me to do
  that.

It's strange, but I cannot recall the reasons either.

  >gmp_randinit (my_randstate, GMP_RAND_ALG_DEFAULT)
  >
  > to also choose Mersenne Twister.  But GMP_RAND_ALG_DEFAULT =
  > GMP_RAND_ALG_LC as per the enum definition.

  No, gmp_randinit is declared obsolete. You're supposed to use the
  algorithm-specific gmp_randinit_ for any supported algorithm .

  I don't know for sure why gmp_randinit was not adapted to the new
  generators. Probably Kevin told me to do that but I don't remember the
  rationale.

Perhaps it was to avoid more code than necessary to be pulled into a
user binary?  With a generic init function, there will be a broad link
deendency on all (init_[ALGORITHM] functions which may in turn pull in
the actual generators.

  > Question 3: What algorithm does the call
  >gmp_randinit (my_randstate, GMP_RAND_ALG_DEFAULT)
  > choose?  Is it Mersenne Twister?

  No, it should be LC if it hasn't changed since the time I wrote the patch.

I don't think it makes any sense to have gmp_randinit_default and
gmp_randinit (..., GMP_RAND_ALG_DEFAULT) give different algorithms.

  > Perhaps we should start anew with a new set of random state functions
  > and a new type (where all fields are actually used!).  The name
  > gmp_prngstate_t might be useful.

  That would not be off the mark. Trying to build on the existing
  facilities is what led to the current mess.

I'd like to have a coherent interface which also provide reentrant
random functions to the mpn layer.  And in no case should mpn pull in
mpz.

I'd like to keep Mersenne Twister, and add block cipher based PRNGs (AES
and perhaps Salsa20r12).  We might also keep LC for historical or
compatibility reasons.

Unfortunately, Mersenne Twister uses mpz, and I am not saying that that
was a bad choice when you implemented it.  But in order to provide
reentrant mpn PRNG functions, we either rewrite the relevant parts of MT
to use mpn, or exclude it from a new mpn PRNG interface.

Here is some code I have tinkered with.

/* PRNG algorithm specific data for any constants, buffered random bits, or
   other state.  The _mp_data field should typically point to a algorithm
   specific struct.  The _mp_datasize field is used by generic code for
   de-allocating and copying a gmp_prng_t in an algorithm agnostic manner.  */
struct appdata {
  void* _mp_data;
  size_t _mp_datasize;
};

struct prng {
  void (*_prng_seed_fn) (struct appdata*, const mp_limb_t*, size_t);
  void (*_prng_run_fn) (mp_limb_t*, size_t, struct appdata*);
  struct appdata _mp_app;
};
typedef struct prng 

State of PRNG code in GMP

2020-06-01 Thread Torbjörn Granlund
I am looking into adding AES CTR as a new, fast PRNG in GMP.
Unfortunately, the current code is somewhat confusing.

The main structure for storing random state is the following:

typedef struct
{
  mpz_t _mp_seed; /* _mp_d member points to state of the generator. */
  gmp_randalg_t _mp_alg;  /* Currently unused. */
  union {
void *_mp_lc; /* Pointer to function pointers structure.  */
  } _mp_algdata;
} __gmp_randstate_struct;
typedef __gmp_randstate_struct gmp_randstate_t[1];

Question 1: Why is _mp_lc wrapped in a union?

Question 2: "_lc" = Linear Congruential? This is supposed to be a
generic structure, right?

A few lines before the above declaration, we have this:

/* Available random number generation algorithms.  */
typedef enum
{
  GMP_RAND_ALG_DEFAULT = 0,
  GMP_RAND_ALG_LC = GMP_RAND_ALG_DEFAULT /* Linear congruential.  */
} gmp_randalg_t;

So LC is the default?  And that's the one choice?  The manual says about
Mersenne Twister: "Randomness properties are also very good and this is
the default algorithm used by GMP."

Let's look at what functions we have for initialising a gmp_randstate_t:

void gmp_randinit (gmp_randstate_t, gmp_randalg_t, ...);

void gmp_randinit_default (gmp_randstate_t);

I have understood that the latter, gmp_randinit_default chooses Mersenne
Twister.  One would expect

  gmp_randinit (my_randstate, GMP_RAND_ALG_DEFAULT)

to also choose Mersenne Twister.  But GMP_RAND_ALG_DEFAULT =
GMP_RAND_ALG_LC as per the enum definition.

Question 3: What algorithm does the call
  gmp_randinit (my_randstate, GMP_RAND_ALG_DEFAULT)
choose?  Is it Mersenne Twister?  Good, I guess as we're then consistent
with what we call the default PRNG algorithm.  Bad, since then the call
  gmp_randinit (my_randstate, GMP_RAND_ALG_LC)
also chooses Mersenne Twister, which is really confusing!

Question 4: How should a user who really wants LC choose that?

This all looks very strange to me.


Perhaps we should start anew with a new set of random state functions
and a new type (where all fields are actually used!).  The name
gmp_prngstate_t might be useful.


A side note about using AES or other block ciphers as PRNGs: While
Mersenne Twister is considered a good PRNG, piggybacking on well-studied
block cipher algorithms seems even better.  Furthermore, AES is
extremely fast, proof-of-concept code I wrote yields generate about 100
Gbit/s on a current AMD CPU per core (and about 50 Gbit/s on a current
Intel CPU per core).

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


gmplib.org change

2020-05-26 Thread Torbjörn Granlund
We've reconfigured our web server to suppress .html from urls.  It seem
to work fine, but please let me know if you encounter any problems.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GMP 6.2.0 doesn't build on C89 compilers - Patch attached

2020-05-07 Thread Torbjörn Granlund
Colin Finck  writes:

  We build GMP as part of a GCC build for a build environment and have
  just upgraded to GMP 6.2.0.

  Unfortunately, this version fails to build using C89 compilers or under
  Linux distributions that don't advertise C99 support. In particular, one
  of our developers couldn't compile GMP 6.2.0 under Slackware 14.1 (see
  https://jira.reactos.org/browse/ROSBE-164).

  Fortunately, there is only one file in GMP 6.2.0 making use of C99-style
  variable declarations inside a for loop. Patching these out fixes the bui=
  ld.
  A corresponding patch is attached.

  I'd be glad to see this fixed upstream, so that we don't have to ship a
  patched version of GMP.

We added a test configuration to the automated GMP testing setup.
It catches a few C89 discrepances in fact.

The GMP developers have in the past discussed moving to C99.  I don't
recall that we decided to make the move yet, which means that
compile_powtab.c is not correct.  (But would probably be a wise thing to
do for the next major release.)


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


  1   2   3   4   5   6   >