On Saturday, August 25, 2012 3:51 AM, Richard Smith wrote: > Hi Andy, > > Editorializing somewhat, I think it's unfortunate that people expect > constexpr evaluation to be asymptotically more efficient than the > runtime evaluation of the same functions, and I'm a little concerned > that this caching will lead people to write code which has surprising > poor performance in the cases where a constexpr function gets invoked > dynamically. But... g++ already implements a similar optimization, so > we'll probably need this to handle code written for it. > > Have you measured the impact of this cache on "well-written" constexpr > functions (those not requiring memoization for efficiency)? What's the > memory usage and performance impact on, say, > test/SemaCXX/constexpr-turing.cpp, with a more complex Turing machine? > Can you provide some figures for typical and worst-case slowdowns?
Ok, I've done some performance testing and here are some rough values. There are two clang test-cases which are interesting with regard to the memoisation feature, namely constexpr-turing and constexpr-nqueens. I imagine that you would consider these to be "well written" -- it seems, at any rate, that you are the author, so I am guessing so! :o) With constexpr-turing, we have a test-case that is unable to make use of the cached evaluations. In this case, I see a 5% slow-down in compilation, but also a moderate additional ~200Kb memory consumption. With constexpr-nqueens, we have a much more positive result in terms of a 19% speed-up in compilation, but at a cost of ~13Mb of additional memory. In my own code, I have seen anything in the range of 2% slow-down up to 25% speed-up -- this is, of course, only for "well written" constexpr functions since I get a 2000% to 5000% speed-up for the rest, at least for those I was willing to test without caching (the others being too slow)! In terms of additional compiler memory, I've seen up to 100Mb increase but this is really down to constexpr algorithms in use. As a comparison, constexpr-torture consumes an additional ~10Mb, but finishes in ~1.5 seconds as opposed to upwards of 3 hours (the point where I terminated it). We could still take the route of an attribute/f-switch pair to override the behaviour of the cache. My suggestion, if you feel the performance changes are too costly, would be a -fconstexpr-cache on/ off switch for global use and a __attribute__((constexpr-cache on/ off)) for local use. > There is a nasty corner case where the value of a global changes > between successive calls to a constexpr function (this happens if a > variable is read after it's zero-initialized but before it's > dynamically initialized, from a constexpr function called within its > own initializer). Your patch doesn't seem to handle that (though > perhaps I just missed it). I fixed the code to support your corner case. I'm still a little bit surprised that it isn't outlawed, but I've seen your other posts on the net talking about what's permitted and not, so I'm more than willing to accept your better understanding. For myself, I'll class it as "don't try this at home"! > Have you considered using llvm::FoldingSet for the cache? That would > be particularly useful if we ever want to allow arbitrary literal > types as non-type template arguments, or other such cases which might > require reducing an APValue to a FoldingSetNodeID. I believe this means supporting FoldingSetTrait for APValues, which is not currently done -- I don't have the time to do this at present. Can it be left to another day? > There seems to be a fair amount of APValue copying going on in this > change. For large APValues, that's not cheap. Perhaps you could > allocate the cache entry up-front, and evaluate the function arguments > and return value directly into its APValue slots, or swap them in? Done -- this took the constexpr-nqueens test-case from ~11% to ~19% speed-up, for example. > Does this have an impact on the constexpr recursion limit? For instance, > if f(500) recurses to depth 500, and g(500) recurses to depth 500 then > calls f(500), will the result of calling g(500) depend on whether I've > called f(500) already in this translation unit? That seems like a highly > unfortunate property, but it's consistent with how template > instantiations behave at least. Yes, there is an impact on recursion limit, but not in a negative way (in my opinion!), meaning that if f(500) is already cached, then calling g(500) doesn't run to a recursion depth of 1000, only 501, since f(500) is picked out of the cache and not recursively evaluated. > Thanks for looking into this! No problems. I'll try and split out the revised patches and post them shortly... Any further things before I do so? Cheers Andy _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
