bearophile wrote:
An interesting thread on Reddit:
http://www.reddit.com/r/programming/comments/92tnd/concepts_removed_from_c0x/
Related to the compilation and use of generic code, various languages work in
different ways.
D/C++ (and ShedSkin, using full type inference) compile a different function
for each different set of template types.
Java uses a single function, because even if it now has generics, at run time
there's a single function that uses Objects.
[...]
In theory D strategy is the one that leads to faster code at runtime (and
doesn't require a VM like the C# case), but the binary has to contain all the
compiled versions of a templated function, this can grow the binary size a lot.
A big binary can be a problem also because there's more code that comes through
the half-L1 cache (just 32 KB), this can and do reduce performance.
I can see two possible solutions to this problem:
- Many functions in D programs aren't performance-critical. If you "compile"
one of such templated functions into a single dynamically typed function the performance
of the code doesn't change, while the binary size is kept low. I think this is what the
JavaHot spot does. LLVM that's behind LDC may be able to do this, and it can also
probably allow to instantiate templates at run-time, but this looks more fit for a
VM-based language, something different from the current D2 design (because it's generally
not easy to know at compile time what template must be compiled and what one can be
implemented with a dynamically typed piece of code, you need a profiled-guided
compilation, or a VM that monitors the code at runtime like in the Java/C# case).
- Another possible solution is to find shared equal asm code in different
functions coming from a single template function, and put such pieces of code
only one time inside the compiled binary. This requires to split the functions
in sections, and such sections have to be joined by nonconditional jumps. You
may think such jumps slow down the code (and sometimes they surely slow down
code, so you have to avoid juping in the middle of inner loops). Time ago I
have read an article about this, it shows that this strategy also reduces
binary size, and this reduces cache misses in L1-code enough to usually balance
the slowdown caused by jumps (in many examples there was even a performance
gain).
I commonly find that I need to use templates for integral types, and
that's only because of the silly implicit casting rules from C.
For example, initially I write
foo(long x) { ... }
and this covers long, int, uint, short, ushort, byte, ubyte.
But ulong needs to be treated specially. So I add:
foo(ulong x) { ... }
But now, thanks to the silly implicit casting rules, foo(1) is now
ambiguous! So I need to add overloads for all of the other integral
types. That's ridiculous, so I make foo() a template. And I get code bloat.
If there was a way to say:
foo(explicit ulong x) { ... } // MUST really be ulong, don't allow
implicit integral conversions
then a huge fraction of the code bloat templates would disappear.
(for 'explicit', substitute any keyword of your choice).