c-bindings-wip2 - Implement POSIX RegexTokenizer

Marvin Humphrey Mon, 04 Mar 2013 20:06:11 -0800

On Sat, Mar 2, 2013 at 12:06 PM,  <[email protected]> wrote:

> Implement POSIX RegexTokenizer


> Project: http://git-wip-us.apache.org/repos/asf/lucy/repo
> Commit: http://git-wip-us.apache.org/repos/asf/lucy/commit/24d06ccd
> Tree: http://git-wip-us.apache.org/repos/asf/lucy/tree/24d06ccd
> Diff: http://git-wip-us.apache.org/repos/asf/lucy/diff/24d06ccd

> +#if defined(CHY_HAS_REGEX_H)
> +  #include <regex.h>
> +#elif defined(CHY_HAS_PCREPOSIX_H)
> +  #include <pcreposix.h>
> +#else
> +  #error No regex headers found.
> +#endif

We may want to consider allowing builds without a fully functional
RegexTokenizer in the future.  At some point, we'll publish a public API for
extending Analyzer from C, and it's not hard to imagine people creating their
own tokenizer for a dedicated app which doesn't need RegexTokenizer.

> +    // TODO: Make sure that we use a UTF-8 locale.

PCRE has a UTF-8 mode, if I recall correctly.  Would things be easier if we
make PCRE a mandatory prerequisite for a functioning RegexTokenizer?

I'm not totally up to speed on the standards, but it seems to me that it would
be better to prefer Unicode regular expressions over POSIX, if we have to
choose.

    http://www.unicode.org/reports/tr18/

Marvin Humphrey

[lucy-dev] Re: [lucy-commits] [15/15] git commit: refs/heads/c-bindings-wip2 - Implement POSIX RegexTokenizer

Reply via email to