FYI, Perl's support is moving along pretty quickly:

http://perl5.git.perl.org/perl.git/blob/HEAD:/pod/perlunicode.pod#l969

- Kurt


On Tue, Mar 5, 2013 at 8:19 AM, Nick Wellnhofer <[email protected]> wrote:

> On 05/03/2013 05:05, Marvin Humphrey wrote:
>
>> On Sat, Mar 2, 2013 at 12:06 PM,  <[email protected]> wrote:
>>
>> We may want to consider allowing builds without a fully functional
>> RegexTokenizer in the future.  At some point, we'll publish a public API
>> for
>> extending Analyzer from C, and it's not hard to imagine people creating
>> their
>> own tokenizer for a dedicated app which doesn't need RegexTokenizer.
>>
>
> Yes, we could make RegexTokenizer optional. I don't see a problem with
> that.
>
>
>  +    // TODO: Make sure that we use a UTF-8 locale.
>>>
>>
>> PCRE has a UTF-8 mode, if I recall correctly.  Would things be easier if
>> we
>> make PCRE a mandatory prerequisite for a functioning RegexTokenizer?
>>
>
> I implemented the POSIX RegexTokenizer because it was very easy to do.
> PCRE is next on my list. Maybe we should support multiple regex flavors:
>
>     RegexTokenizer_new(CharBuf *pattern, CharBuf *flavor)
>
> That might be useful for other host languages, too. But for
> interoperability between host languages, it would be better to have a
> single, universally supported syntax.
>
>
>  I'm not totally up to speed on the standards, but it seems to me that it
>> would
>> be better to prefer Unicode regular expressions over POSIX, if we have to
>> choose.
>>
>>      
>> http://www.unicode.org/**reports/tr18/<http://www.unicode.org/reports/tr18/>
>>
>
> Unicode TR18 doesn't specify a particular regex syntax. It only says how a
> regex engine should behave with regard to Unicode.
>
> POSIX regexes should work with UTF-8 strings when using a UTF-8 locale.
> Other than that, they probably don't support much of TR18. Also note that
> even Perl's support for TR18 isn't complete:
>
> http://perldoc.perl.org/**perlunicode.html#Unicode-**
> Regular-Expression-Support-**Level<http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level>
>
> But most other regex engines aside from ICU are a lot worse, AFAIK.
>
> Nick
>
>

Reply via email to