https://issues.dlang.org/show_bug.cgi?id=12690
--- Comment #3 from Dmitry Olshansky <[email protected]> --- (In reply to Diez from comment #2) > Created attachment 1351 [details] > Source code for more complex test case (UTF-8) > > Thank you for your help. I verified the compilation options of my real-world > program and in fact the -inline option was missing (-O was already set). > However, it didn't change anything at the overall performance, it's still 5 > seconds (backtracking) versus 47 seconds (Thompson). With this magnitude in > performance difference there's no chance to switch to the Thompson matcher. > > The attached source is almost the same as before, but with average > expression and haystack from my real-world program (confidential > informations scrambled). Now it's not a small performance difference, but a > factor of more than 5. > I see. The fact that -inline doesn't change a thing is troublesome but lets leave that to compiler guys. The new pattern makes things much clearer - as I suspected the preparation stage of the Thompson engine is the one to blame. Preparing an internal freelist alone takes ~1/3 of total run-time in this case. The figures are much higher this time about x3 difference. I have a hunch that it allocates much more then needed in this case, might be a bug in counting up repetitions. Need to investigate further. > As I understand the meaning of the word "discouraged", the backtracking > matcher is to be removed in some future. Please don't. As long as the > Thompson matcher has this poor performance (for certain use cases), there's > need for both matchers. In fact, using the Thompson matcher is way slower > than an equivalent Java 1.7 program, while using the backtracking matcher is > slightly faster. Of course, optimally there's only one matcher which > combines the advantages of all underlaying matchers. Discouraged means exactly what you say - picking the type of underlying matcher should be done internally based on pattern properties (or read that as one meta-matcher). It occurred to me that - There are more types of matchers, some are very limited in what they can. Thus should't push it on user to understand when he/she can use them. - More matchers - more functions doesn't scale. - Also replace functions have the same NxM combinatorics (n methods, m functions), there is also splitter. I envision that when std.regex turns package, it will gradually expose internals for those who wants to make customize/their own matcher. > In my real-world program, there's about 100 times more haystacks than > patterns (940 vs. 112_000). As soon as a pattern matches, the next haystack > is examined. Each haystack needed about 23 tries (avg) until a pattern > matched, actually there were 2_584_925 match calls in total of which 111_675 > matched the haystack (a "few" haystacks didn't match any pattern). > > Note that the attached test case source file needs to be saved in UTF-8 > encoding, since the regular expression string contains UTF-8 characters > (hopefully, they don't get destroyed during upload into the issue tracking > system). The UTF-8 signature comment line should help Windows editors to > automatically switch to UTF-8 encoding when opening the source file (like an > UTF-8 BOM). > > The idea of matching multiple patterns with only one call of the matcher is > very interesting (though I didn't test it yet). Well, it's not yet implemented ;) But I expect to get to it done somewhere in May. > Consider a 5 times faster > matcher with 5 matches at once ... --
