On 05/03/2013 05:05, Marvin Humphrey wrote:
On Sat, Mar 2, 2013 at 12:06 PM, <[email protected]> wrote:
We may want to consider allowing builds without a fully functional
RegexTokenizer in the future. At some point, we'll publish a public API for
extending Analyzer from C, and it's not hard to imagine people creating their
own tokenizer for a dedicated app which doesn't need RegexTokenizer.
Yes, we could make RegexTokenizer optional. I don't see a problem with that.
+ // TODO: Make sure that we use a UTF-8 locale.
PCRE has a UTF-8 mode, if I recall correctly. Would things be easier if we
make PCRE a mandatory prerequisite for a functioning RegexTokenizer?
I implemented the POSIX RegexTokenizer because it was very easy to do.
PCRE is next on my list. Maybe we should support multiple regex flavors:
RegexTokenizer_new(CharBuf *pattern, CharBuf *flavor)
That might be useful for other host languages, too. But for
interoperability between host languages, it would be better to have a
single, universally supported syntax.
I'm not totally up to speed on the standards, but it seems to me that it would
be better to prefer Unicode regular expressions over POSIX, if we have to
choose.
http://www.unicode.org/reports/tr18/
Unicode TR18 doesn't specify a particular regex syntax. It only says how
a regex engine should behave with regard to Unicode.
POSIX regexes should work with UTF-8 strings when using a UTF-8 locale.
Other than that, they probably don't support much of TR18. Also note
that even Perl's support for TR18 isn't complete:
http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level
But most other regex engines aside from ICU are a lot worse, AFAIK.
Nick