c-bindings-wip2 - Implement POSIX RegexTokenizer

Marvin Humphrey Mon, 11 Mar 2013 15:22:03 -0700

On Tue, Mar 5, 2013 at 5:19 AM, Nick Wellnhofer <[email protected]> wrote:
> I implemented the POSIX RegexTokenizer because it was very easy to do. PCRE
> is next on my list. Maybe we should support multiple regex flavors:
>
>     RegexTokenizer_new(CharBuf *pattern, CharBuf *flavor)


In the context of tokenizing, regular expression engines are more notable
for producing subtly incompatible results than for offering substantially
different functionality.  I don't think that offering different regular
expression flavors at runtime lets our users accomplish much as far as
tokenization that they couldn't accomplish otherwise.

What might theoretically be useful is specifying a regex engine for the sake
of index portability across hosts -- for example, specifying that a Perl build
of Lucy use PCRE instead of Perl's regex engine.  There are a couple of ways
we could do that.

One option would be to offer a compile-time configuration option for
RegexTokenizer.  However, incompatible configurations would fail silently,
producing subtly different results under the inappropriate engine rather than
bombing out.

A more reliable technique would be to provide dedicated classes such as
"PCRETokenizer" which are associated with specific regex engines.  However,
such an approach has notable cost because the regex engine code would need to
be bundled to protect against incompatibilities across regex engine versions.

> That might be useful for other host languages, too. But for interoperability
> between host languages, it would be better to have a single, universally
> supported syntax.

The only way we're going to get reliable interoperability across host
languages for tokenization based on regular expressions is to bundle a regex
engine such as PCRE with every Lucy build.  Which doesn't seem worthwhile at
this time.

Even between different versions of the host language, interop will be
imperfect -- index compatibility will be compromized on host language upgrades
as the host language devs fix bugs, add features, update to the latest and
greatest version of Unicode, etc..

But that's fine.  That's what RegexTokenizer is now -- a wrapper around the
host's regex facilities which allows users to leverage their existing
expertise.  We prize host integration, and RegexTokenizer is an expression of
that.  People just need to be aware that they may need to regenerate their
indexes when the regex engine behavior changes.

Marvin Humphrey

Re: [lucy-dev] Re: git commit: refs/heads/c-bindings-wip2 - Implement POSIX RegexTokenizer

Reply via email to