On Sep 14, 2012, at 3:26 PM, Christian Persch (GNOME) <[email protected]> wrote:

> ...Since UTF-32 only occupies 21 bits of the 32-bit characters, it's useful 
> for
> implementations to use the upper bits to store extra info (flags, etc). Since
> it's more efficient to pass the unmodified strings to pcre32, I aim to make
> pcre32 mask out those upper bits. This is done in the code but hasn't been
> debugged yet (it's not working yet).

I suggest that such masking behavior should not be the default, but only 
enabled, if at all, by explicitly setting some configuration option.

If a 32-bit string contains a code unit such as 0x10000021, the safer 
assumption is that it is *not* equivalent to U+0021.
0x10000021 might trigger a warning that the string is not valid UTF-32, or it 
might just be treated as a different character. But to treat it by default as 
matching U+0021 would be just as wrong as an ASCII-based program treating 0xA1 
as equivalent to 0x21.

The originally ASCII-based programs that continue to work well today (for 
Latin1, UTF-8, etc.) are the ones that treat the byte 0xA1 differently from 
0x21, and refrain from masking/bending/folding/mutilating it.

Using the upper bits of 32-bit code units for flags, etc., risks 
incompatibility with future use of code points beyond U+10FFFF (such for 
extended private use); developers need to weigh the risks and benefits of such 
an approach carefully. Anyway, if they do it, they should at least be 
responsible for setting an option instructing PCRE to mask the high bits. In 
general, most libraries shouldn't be expected to mask or ignore those bits.

I hope this suggestion is helpful. A 32-bit PCRE is likely to be useful for the 
long-term future, especially if code points beyond U+10FFFF are eventually 
employed.

Best wishes,

Tom

文林 Wenlin Institute, Inc.        Software for Learning Chinese
E-mail: [email protected]     Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯




-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to