On Tue, Mar 5, 2013 at 5:19 AM, Nick Wellnhofer <[email protected]> wrote: > I implemented the POSIX RegexTokenizer because it was very easy to do. PCRE > is next on my list. Maybe we should support multiple regex flavors: > > RegexTokenizer_new(CharBuf *pattern, CharBuf *flavor)
In the context of tokenizing, regular expression engines are more notable for producing subtly incompatible results than for offering substantially different functionality. I don't think that offering different regular expression flavors at runtime lets our users accomplish much as far as tokenization that they couldn't accomplish otherwise. What might theoretically be useful is specifying a regex engine for the sake of index portability across hosts -- for example, specifying that a Perl build of Lucy use PCRE instead of Perl's regex engine. There are a couple of ways we could do that. One option would be to offer a compile-time configuration option for RegexTokenizer. However, incompatible configurations would fail silently, producing subtly different results under the inappropriate engine rather than bombing out. A more reliable technique would be to provide dedicated classes such as "PCRETokenizer" which are associated with specific regex engines. However, such an approach has notable cost because the regex engine code would need to be bundled to protect against incompatibilities across regex engine versions. > That might be useful for other host languages, too. But for interoperability > between host languages, it would be better to have a single, universally > supported syntax. The only way we're going to get reliable interoperability across host languages for tokenization based on regular expressions is to bundle a regex engine such as PCRE with every Lucy build. Which doesn't seem worthwhile at this time. Even between different versions of the host language, interop will be imperfect -- index compatibility will be compromized on host language upgrades as the host language devs fix bugs, add features, update to the latest and greatest version of Unicode, etc.. But that's fine. That's what RegexTokenizer is now -- a wrapper around the host's regex facilities which allows users to leverage their existing expertise. We prize host integration, and RegexTokenizer is an expression of that. People just need to be aware that they may need to regenerate their indexes when the regex engine behavior changes. Marvin Humphrey
