On Sat, Mar 2, 2013 at 12:06 PM, <[email protected]> wrote: > Implement POSIX RegexTokenizer
> Project: http://git-wip-us.apache.org/repos/asf/lucy/repo > Commit: http://git-wip-us.apache.org/repos/asf/lucy/commit/24d06ccd > Tree: http://git-wip-us.apache.org/repos/asf/lucy/tree/24d06ccd > Diff: http://git-wip-us.apache.org/repos/asf/lucy/diff/24d06ccd > +#if defined(CHY_HAS_REGEX_H) > + #include <regex.h> > +#elif defined(CHY_HAS_PCREPOSIX_H) > + #include <pcreposix.h> > +#else > + #error No regex headers found. > +#endif We may want to consider allowing builds without a fully functional RegexTokenizer in the future. At some point, we'll publish a public API for extending Analyzer from C, and it's not hard to imagine people creating their own tokenizer for a dedicated app which doesn't need RegexTokenizer. > + // TODO: Make sure that we use a UTF-8 locale. PCRE has a UTF-8 mode, if I recall correctly. Would things be easier if we make PCRE a mandatory prerequisite for a functioning RegexTokenizer? I'm not totally up to speed on the standards, but it seems to me that it would be better to prefer Unicode regular expressions over POSIX, if we have to choose. http://www.unicode.org/reports/tr18/ Marvin Humphrey
