FYI, Perl's support is moving along pretty quickly: http://perl5.git.perl.org/perl.git/blob/HEAD:/pod/perlunicode.pod#l969
- Kurt On Tue, Mar 5, 2013 at 8:19 AM, Nick Wellnhofer <[email protected]> wrote: > On 05/03/2013 05:05, Marvin Humphrey wrote: > >> On Sat, Mar 2, 2013 at 12:06 PM, <[email protected]> wrote: >> >> We may want to consider allowing builds without a fully functional >> RegexTokenizer in the future. At some point, we'll publish a public API >> for >> extending Analyzer from C, and it's not hard to imagine people creating >> their >> own tokenizer for a dedicated app which doesn't need RegexTokenizer. >> > > Yes, we could make RegexTokenizer optional. I don't see a problem with > that. > > > + // TODO: Make sure that we use a UTF-8 locale. >>> >> >> PCRE has a UTF-8 mode, if I recall correctly. Would things be easier if >> we >> make PCRE a mandatory prerequisite for a functioning RegexTokenizer? >> > > I implemented the POSIX RegexTokenizer because it was very easy to do. > PCRE is next on my list. Maybe we should support multiple regex flavors: > > RegexTokenizer_new(CharBuf *pattern, CharBuf *flavor) > > That might be useful for other host languages, too. But for > interoperability between host languages, it would be better to have a > single, universally supported syntax. > > > I'm not totally up to speed on the standards, but it seems to me that it >> would >> be better to prefer Unicode regular expressions over POSIX, if we have to >> choose. >> >> >> http://www.unicode.org/**reports/tr18/<http://www.unicode.org/reports/tr18/> >> > > Unicode TR18 doesn't specify a particular regex syntax. It only says how a > regex engine should behave with regard to Unicode. > > POSIX regexes should work with UTF-8 strings when using a UTF-8 locale. > Other than that, they probably don't support much of TR18. Also note that > even Perl's support for TR18 isn't complete: > > http://perldoc.perl.org/**perlunicode.html#Unicode-** > Regular-Expression-Support-**Level<http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level> > > But most other regex engines aside from ICU are a lot worse, AFAIK. > > Nick > >
