On Thu, Apr 29, 2010 at 6:56 PM, bill lam <[email protected]> wrote:
> Чтв, 29 Апр 2010, Raul Miller писал(а):
>> On Tue, Apr 27, 2010 at 7:53 PM, bill lam <[email protected]> wrote:
>> > regex default to use utf-8 encoding but those htmls use latin-1.
>> > Either convert text to utf-8 or set regex to non-utf8 mode.
>> >
>> >   rxutf8 0
>>
>> After reading
>>    open'regex'
>> and
>>    http://www.pcre.org/pcre.txt
>>
>> What I thought I would want
>>    rxutf8 do_jregex_ 'PCRE_UTF8 23 b. PCRE_NO_UTF8_CHECK'
>>
>> Unfortunately, PCRE_NO_UTF8_CHECK is not defined, and when
>> I look for its value, I find
>> http://read.pudn.com/downloads126/sourcecode/delphi_control/536510/PCRE/pcre.h__.htm
>>
>> which suggests
>> PCRE_UTF8=: 16b800
>> PCRE_NO_UTF8_CHECK=: 16b2000
>>
>> So now I know that I am confused.
>>
>> Can anyone suggest how I might be able to use pcre's ability to
>> recognize word forming utf8 characters without also losing access
>> to latin1 content?
>
> rxutf8 is intended to called as either 'rxutf8 0' or 'rxutf8 1', do
> you mean that the constant for enable/disable utf8 option is incorrect
> inside jregex?

No, not exactly

I was thinking that RX_OPTIONS_UTF8 should be PCRE_UTF8.

It is not incorrect.

Nevertheless, the current design does not allow me to use
pcre's recognition of utf8 character sets without also losing
the ability to deal with latin1 in that regular expression.

Here is where I ran into this issue:
http://rosettacode.org/wiki/Inverted_index

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to