Re: [pcre-dev] [Bug 1295] add 32-bit library

Tom Bishop, Wenlin Institute Tue, 18 Sep 2012 11:56:06 -0700

On Sep 18, 2012, at 9:05 AM, Zoltán Herczeg <[email protected]> wrote:


> Hi,
> 
> this still looks quite theoretical for me. However, if you come up with a 
> patch which has negligible performance overhead, I am willing to review.

Thank you! I guess I should wait until the code is in the svn repository.

Best wishes,

Tom

> 
> Regards,
> Zoltan
> 
> "Tom Bishop, Wenlin Institute" <[email protected]> írta:
>>> 
> On Sep 17, 2012, at 5:09 PM, Christian Persch (GNOME) <[email protected]> wrote:>
>> 
>> It's absolutely certain that there will never be unicode characters > 
>> 10ffff,>
>> so there's no forward compatibility problem.>
>> 
> A few years ago it was "absolutely certain" there would never be Unicode 
> characters > U+FFFF. As a result, a lot of supposedly Unicode-based software 
> still in widespread use fails for characters outside the Basic Multilingual 
> Plane. Should we learn from such mistakes, or repeat them?>
>> 
>> Now you seem to want some sort of "UCS-4" mode that would allow any 
>> characters>
>> from the 31-bit range (up to 7fffffff) of UCS-4 ? I don't see how that would 
>> be>
>> useful; for example, which properties would those characters beyond the 
>> UTF-32>
>> range have ?>
>> 
> By default, the same properties as for unassigned code points less than 
> U+110000. Especially relevant to this discussion, an essential property for 
> each character is that it shouldn't be matched with some other character 
> without a valid reason.>
>> 
> One application of code points beyond U+10FFFF is for extended private use. 
> Properties for all unassigned characters could be specified by the same 
> protocols as for ordinary private-use characters. It should be possible to 
> specify custom properties for each character, including those in the current 
> private-use ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. 
> For example, depending on the application, people may want to treat some 
> private-use characters as letters, numbers, whitespace, or combining marks. 
> (This is an ability PCRE really should have anyway.)>
>> 
>> (And if an actual use case for that UCS-4 mode ever arises, we can>
>> just add it at that point as a _new_ flag/mode.)>
>> 
> It might be best to design the API and add a few lines of code now, while all 
> the authors are alive, and before assumptions about PCRE have been hard-coded 
> into applications that depend on it.>
>> 
> Three possible behaviors are under consideration, when a 32-bit string 
> contains a code unit > 0x0010FFFF:>
>> 
> (1) trigger an error for invalid UTF-32;>
> (2) mask it with 0x001FFFFF; or>
> (3) treat it as a character in its own right.>
>> 
> I think I understand that (1) will be the default (which is good), and that 
> (2) can currently be obtained by turning on the PCRE_NO_UTF32_CHECK option. 
> You said that the masking is "only a temporary measure while developing 
> this". It's not clear what that implies: once the development is complete, 
> would the PCRE_NO_UTF32_CHECK option still produce behavior (2), or would the 
> masking code be removed and the PCRE_NO_UTF32_CHECK option produce behavior 
> (3)?>
>> 
> It seems that there are three possible purposes for someone to specify an 
> option named PCRE_NO_UTF32_CHECK:>
>> 
> (A) simply to speed up the code a bit, since they're absolutely certain that 
> their strings are valid UTF-32;>
> (B) to obtain behavior (2) since they've included extra information in the 
> eleven highest bits; or>
> (C) to obtain behavior (3) to support characters beyond U+10FFFF.>
>> 
> For purpose (A), suppose on some rare occasion the absolute certainty is 
> mistaken; then the best behavior for PCRE is (3), since 0x10000021 isn't a 
> valid code for an exclamation point (U+0021) and PCRE shouldn't report a 
> match when in reality there isn't a match.>
>> 
> The difference between behaviors (2) and (3) is huge. If only one or the 
> other is supported, (3) is more appropriate -- again, PCRE shouldn't report a 
> match when in reality there isn't a match. If the masking is considered a 
> useful option for the long term and not only a temporary measure, then there 
> could be two options in addition to the default (strict UTF-32 checking). 
> They might be named:>
>> 
> PCRE_MASK_UTF32_BEYOND_1FFFFF for behavior (2)>
>> 
> and>
>> 
> PCRE_ALLOW_UTF32_BEYOND_10FFFF for behavior (3).>
>> 
> This might only require a few additional lines of code. I'm happy to help 
> with the implementation.>
>> 
> Best wishes,>
>> 
> Tom>
>> 
> 文林 Wenlin Institute, Inc.        Software for Learning Chinese>
> E-mail: [email protected]     Web: http://www.wenlin.com>
> Telephone: 1-877-4-WENLIN (1-877-493-6546)>
> ☯>
>> 
>> 
>> 
>> 
> -- >
> ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 
> 
> 


文林 Wenlin Institute, Inc.        Software for Learning Chinese
E-mail: [email protected]     Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯




-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] [Bug 1295] add 32-bit library

Reply via email to