It has been ~12 years since I looked at this, but when benchmarking the Perl 5.8 regexp implementation (which is a modified Boyer-Moore, if memory serves) it was significantly faster than Java 6's inbuilt java.util.regex classes.
However, the Perl implementation has pathological special cases that are rare, but by no means purely theoretical. At this distance, all I can recall is that it was something to do with multiple | conditions and backtracking, but the extreme slowdown could be reliably provoked - and the conclusion we came to was that it was a side effect of the algorithm's design. I'm not sure that I have a point here, beyond the usual Caveat Emptor, but there we are. Ben On Fri, 21 Dec 2018 at 19:02, Michael Barker <[email protected]> wrote: > > Boyer-Moore seems to be one that is commonly used for fast sub-string > matching (GNU grep uses it internally). Works best when the same substring > is reapplied multiple times. I've used the simplified version in the past. > The Wikipedia page > (https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm) > has an implementation in Java. It would need to be reworked slightly to be > useful. I.e. the charTable and offsetTable should be precalculated from the > search substring. I'd make it into a class, pass the search term (needle) as > a constructor parameter and store the tables as fields. The input (haystack) > is the passed into a non-static indexOf function (without the needle). > > Mike. > > On Thu, 15 Nov 2018 at 14:08, Shevek <[email protected]> wrote: >> >> Dear wizards, please advise. >> >> I need to offer a user configuration feature for pattern matching, to >> exclude objects from my billion object sort-merge (which is now working >> fairly well, thank you all). >> >> What we're mostly trying to do is exclude any record which contains any >> one of a number of substrings. The computer science textbooks give >> various fast-string-searching algorithms with pre-computed tables, any >> of which would suit our use case, but I don't see a practical >> implementation of any of them floating around... >> >> Current practical options: >> * java.util.regex, precompiled patterns >> - Reputedly slow at matching, but our patterns are simple. >> - We are using a single regex containing a large alternation. >> - perf says this regex matcher is 50% of our runtime. >> - WHY isn't Matcher.usePattern() allocation-free? It totally could >> be. This means that the int[] array allocation is a major drain on the GC. >> - If we use ThreadLocal<Matcher> instead, the ThreadLocal can't be >> static, and hits the previously discussed issue with blowing out the >> ThreadLocalMap. >> * rej2 >> - Not tried yet - has anyone tried this? >> * brics >> - Trying and failing on brics may make sense before falling back to >> java regex. >> * Groovy Closure >> - Faster than regex/pattern assuming you have a better strategy for >> matching. But now you're down to repeated contains() calls. >> >> What are the suggestions? >> >> Thank you. >> >> S. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "mechanical-sympathy" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
