Re: [pcre-dev] [Bug 1295] New: add 32-bit library

Zoltán Herczeg Sun, 16 Sep 2012 00:11:48 -0700

We have only a few (two) masks as far as I remember. In practice 16 and 32 bit 
modes are basically useless without UTF, since you can only set character types 
only for the first 256 code point. And UFT is not likely will go beyond 
0x10FFFF in the foreseeable future, since only a small fragment of it is filled 
with characters, and that is basically cover nearly all spoken and dead (but 
real!) languages. UTF is not intended to be a picture library for bored 
graphics designers, so we can say PCRE only supports characters <= 0xfffffff 
without limiting any practical use cases.


I agree with the checks.

Regards,
Zoltan

"Tom Bishop, Wenlin Institute" <tan...@wenlin.com> írta:
>>
On Sep 14, 2012, at 3:26 PM, Christian Persch (GNOME) <c...@gnome.org> wrote:>
>
> ...Since UTF-32 only occupies 21 bits of the 32-bit characters, it's useful 
> for>
> implementations to use the upper bits to store extra info (flags, etc). Since>
> it's more efficient to pass the unmodified strings to pcre32, I aim to make>
> pcre32 mask out those upper bits. This is done in the code but hasn't been>
> debugged yet (it's not working yet).>
>
I suggest that such masking behavior should not be the default, but only 
enabled, if at all, by explicitly setting some configuration option.>
>
If a 32-bit string contains a code unit such as 0x10000021, the safer 
assumption is that it is *not* equivalent to U+0021.>
0x10000021 might trigger a warning that the string is not valid UTF-32, or it 
might just be treated as a different character. But to treat it by default as 
matching U+0021 would be just as wrong as an ASCII-based program treating 0xA1 
as equivalent to 0x21.>
>
The originally ASCII-based programs that continue to work well today (for 
Latin1, UTF-8, etc.) are the ones that treat the byte 0xA1 differently from 
0x21, and refrain from masking/bending/folding/mutilating it.>
>
Using the upper bits of 32-bit code units for flags, etc., risks 
incompatibility with future use of code points beyond U+10FFFF (such for 
extended private use); developers need to weigh the risks and benefits of such 
an approach carefully. Anyway, if they do it, they should at least be 
responsible for setting an option instructing PCRE to mask the high bits. In 
general, most libraries shouldn't be expected to mask or ignore those bits.>
>
I hope this suggestion is helpful. A 32-bit PCRE is likely to be useful for the 
long-term future, especially if code points beyond U+10FFFF are eventually 
employed.>
>
Best wishes,>
>
Tom>


-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] [Bug 1295] New: add 32-bit library

Reply via email to