I really wish C were a good option here.
I was trying to get C working with vectorization hints when that's
possible, and intrinsics/asm when it isn't. Unfortunately, the current
clang can't vectorize its way out of a paper bag, and GCC isn't much
better. So I've had to fall back to extensions and intrinsics a lot.
But if there were some set of carefully written constructs that would
make most of the things that need to be fast, fast, then it would lessen
the amount of hairy, processor specific code to deal with the rest.
If anyone has tips for making fast platform-independent C code, though,
I'm all ears. Or a competitive numerics library for C++.
Or maybe I should try FORTRAN?
-- Mike
On 3/18/2015 10:53 PM, Samuel Neves wrote:
Suppose you have some amazing new CPU-specific code for your favorite field,
curve, key exchange, or whatever. How do
you distribute it in a way that minimizes its user's effort to integrate it in
their own applications (presumably in C
or via some FFI interface)?
As I see it, there are 4 possible approaches:
1. Distribute the assembly. This is the obvious reply, and arguably the best.
Nevertheless, this option leaves something
to be desired:
- ABIs / calling conventions vary between operating systems and/or
languages, e.g., SysV ABI vs Windows ABI, . This
requires either preprocessor usage or some sort of trampoline (e.g.,
https://github.com/floodyberry/asm-opt) to adjust
parameters to the implemented convention.
- Syntaxes also vary, e.g., Intel vs AT&T x86 syntax, Plan9 assembler syntax,
etc. This either requires a single
assembler that works with all syntaxes, or distributing multiple versions of
the same function.
2. Heavy preprocessor use / code generator. This is the OpenSSL approach, using
Perl scripts to output suitable assembly
for the relevant platform. Crypto++ does something similar, but abuses the C
preprocessor for this instead. This
approach is not too bad, but it easily makes the code unreadable when
supporting multiple instruction sets, platforms,
or other optionals. And may require fluency in some otherwise unnecessary
language.
3. Use compiler intrinsics. This is not always practical, since some
instructions do not have suitable compiler
intrinsics to take advantage of. When it is, however, it is still problematic
for anything more than prototyping:
performance is wildly dependent on the compiler, version, and switches used. In
some cases the compiler does not even
support the intrinsics. This is OK when the user can control these, but that is
not always the case.
4. Use a "smart" assembler. This is an assembler that is slightly higher level,
and acts as a middle-ground between 1-2
and 3. Besides automatic register allocation, such tools may also easily
accommodate things like syntax and ABI if
necessary. Examples of what I'm thinking here are qhasm
(http://cr.yp.to/qhasm.html) or PeachPy
(https://bitbucket.org/MDukhan/peachpy). I like this approach, but the current
tools are prototypes at best, and
therefore are not exactly suitable for distribution in their current state.
So what do you guys think? Are there other options I failed to list here? Which
do you like best?
Best regards,
Samuel Neves
_______________________________________________
Curves mailing list
[email protected]
https://moderncrypto.org/mailman/listinfo/curves
_______________________________________________
Curves mailing list
[email protected]
https://moderncrypto.org/mailman/listinfo/curves