I've been keeping in contact with Stefan and providing him example code to test with his CTFE engine. He's been saying for a while that templates are slow. So I decided to finally work out just how slow we're talking about here.

I can't show the exact code I'm running with, but needless to say this particular test case crashes the 32-bit dmd.exe that comes with the official downloads. I've had to build my own 64-bit version... which also eventually crashes but only after consuming 8 gigabytes of memory.

Using Visual Studio 2015's built in sample-based profiler, I decided to see just what the compiler was doing on a release build with the problem code.

http://pastebin.com/dcwwCp28

This is a copy of the calltree where DMD spends most of its time. If you don't know how to read these profiles, the good one to look at is that it's 130+ functions deep in the callstack. Plenty of template instances, plenty of static if, plenty of static foreach... I'm doing quite a bit to tax the compiler in other words.

Which got me thinking. I've been rolling things over to CTFE-generated string mixins lately in anticipation for the speedups Stefan will get us. But there's one bit of template code that I have not touched at all.

https://github.com/Remedy-Entertainment/binderoo/blob/master/binderoo_client/d/src/binderoo/binding/serialise.d

This is a simple set of templated functions that parses objects and serialises them to JSON (the reason I'm not just using std.json is because I need custom handling for pointers and reference types). But it turns out this is the killer. As a part of binding an object for Binderoo's rapid iteration purposes, it generates a serialise/deserialise call that instantiates for each type found. If I turn that code generation off, the code compiles. If I remove a file that has 1000+ structs (auto-generated) with tree-like instances embedded in the only object I apply Binderoo's binding to in that entire module, it compiles in 45% of the time (12 seconds versus 26 seconds).

The hot path without that 1000+ struct file actually goes through the AttribDeclaration.semantic and UserAttributeDeclaration.semantic code path, with the OS itself doing the most work for a single function thanks to Outbuffer::writeString needing to realloc string memory in dmd\backend\outbuf.c.

The hot path with that 1000+ struct file spends the most time in TemplateInstance.semantic, specifically with calls to TemplateDeclaration.findExistingInstance, TemplateInstance.tryExpandMembers, and TemplateInstance.findBestMatch taking up 90%+ of its time. finExistingInstance spends most of its time in arrayObjectMatch in dtemplate.d, which subseuqently spends most of its time in the match function in the same file (which calls virtuals on RootObject to do comparisons).

At the very least, I now have an idea of which parts of the compiler I'm taxing and can attempt to write around that. But I'm also tempted to go in and optimise those parts of the compiler.

Reply via email to