Re: [curves] Distribution-ready optimized code

Mike Hamburg Wed, 18 Mar 2015 23:16:04 -0700

I really wish C were a good option here.

I was trying to get C working with vectorization hints when that'spossible, and intrinsics/asm when it isn't. Unfortunately, the currentclang can't vectorize its way out of a paper bag, and GCC isn't muchbetter. So I've had to fall back to extensions and intrinsics a lot.But if there were some set of carefully written constructs that wouldmake most of the things that need to be fast, fast, then it would lessenthe amount of hairy, processor specific code to deal with the rest.

If anyone has tips for making fast platform-independent C code, though,I'm all ears. Or a competitive numerics library for C++.


Or maybe I should try FORTRAN?

-- Mike

On 3/18/2015 10:53 PM, Samuel Neves wrote:

Suppose you have some amazing new CPU-specific code for your favorite field, 
curve, key exchange, or whatever. How do
you distribute it in a way that minimizes its user's effort to integrate it in 
their own applications (presumably in C
or via some FFI interface)?

As I see it, there are 4 possible approaches:

1. Distribute the assembly. This is the obvious reply, and arguably the best. 
Nevertheless, this option leaves something
to be desired:
   - ABIs / calling conventions vary between operating systems and/or 
languages, e.g., SysV ABI vs Windows ABI, . This
requires either preprocessor usage or some sort of trampoline (e.g., 
https://github.com/floodyberry/asm-opt) to adjust
parameters to the implemented convention.
  - Syntaxes also vary, e.g., Intel vs AT&T x86 syntax, Plan9 assembler syntax, 
etc. This either requires a single
assembler that works with all syntaxes, or distributing multiple versions of 
the same function.

2. Heavy preprocessor use / code generator. This is the OpenSSL approach, using 
Perl scripts to output suitable assembly
for the relevant platform. Crypto++ does something similar, but abuses the C 
preprocessor for this instead. This
approach is not too bad, but it easily makes the code unreadable when 
supporting multiple instruction sets, platforms,
or other optionals. And may require fluency in some otherwise unnecessary 
language.

3. Use compiler intrinsics. This is not always practical, since some 
instructions do not have suitable compiler
intrinsics to take advantage of. When it is, however, it is still problematic 
for anything more than prototyping:
performance is wildly dependent on the compiler, version, and switches used. In 
some cases the compiler does not even
support the intrinsics. This is OK when the user can control these, but that is 
not always the case.

4. Use a "smart" assembler. This is an assembler that is slightly higher level, 
and acts as a middle-ground between 1-2
and 3. Besides automatic register allocation, such tools may also easily 
accommodate things like syntax and ABI if
necessary. Examples of what I'm thinking here are qhasm 
(http://cr.yp.to/qhasm.html) or PeachPy
(https://bitbucket.org/MDukhan/peachpy). I like this approach, but the current 
tools are prototypes at best, and
therefore are not exactly suitable for distribution in their current state.

So what do you guys think? Are there other options I failed to list here? Which 
do you like best?

Best regards,
Samuel Neves

_______________________________________________
Curves mailing list
[email protected]
https://moderncrypto.org/mailman/listinfo/curves


_______________________________________________
Curves mailing list
[email protected]
https://moderncrypto.org/mailman/listinfo/curves

Re: [curves] Distribution-ready optimized code

Reply via email to