On Sep 18, 2012, at 9:05 AM, Zoltán Herczeg <[email protected]> wrote:
> Hi, > > this still looks quite theoretical for me. However, if you come up with a > patch which has negligible performance overhead, I am willing to review. Thank you! I guess I should wait until the code is in the svn repository. Best wishes, Tom > > Regards, > Zoltan > > "Tom Bishop, Wenlin Institute" <[email protected]> írta: >>> > On Sep 17, 2012, at 5:09 PM, Christian Persch (GNOME) <[email protected]> wrote:> >> >> It's absolutely certain that there will never be unicode characters > >> 10ffff,> >> so there's no forward compatibility problem.> >> > A few years ago it was "absolutely certain" there would never be Unicode > characters > U+FFFF. As a result, a lot of supposedly Unicode-based software > still in widespread use fails for characters outside the Basic Multilingual > Plane. Should we learn from such mistakes, or repeat them?> >> >> Now you seem to want some sort of "UCS-4" mode that would allow any >> characters> >> from the 31-bit range (up to 7fffffff) of UCS-4 ? I don't see how that would >> be> >> useful; for example, which properties would those characters beyond the >> UTF-32> >> range have ?> >> > By default, the same properties as for unassigned code points less than > U+110000. Especially relevant to this discussion, an essential property for > each character is that it shouldn't be matched with some other character > without a valid reason.> >> > One application of code points beyond U+10FFFF is for extended private use. > Properties for all unassigned characters could be specified by the same > protocols as for ordinary private-use characters. It should be possible to > specify custom properties for each character, including those in the current > private-use ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. > For example, depending on the application, people may want to treat some > private-use characters as letters, numbers, whitespace, or combining marks. > (This is an ability PCRE really should have anyway.)> >> >> (And if an actual use case for that UCS-4 mode ever arises, we can> >> just add it at that point as a _new_ flag/mode.)> >> > It might be best to design the API and add a few lines of code now, while all > the authors are alive, and before assumptions about PCRE have been hard-coded > into applications that depend on it.> >> > Three possible behaviors are under consideration, when a 32-bit string > contains a code unit > 0x0010FFFF:> >> > (1) trigger an error for invalid UTF-32;> > (2) mask it with 0x001FFFFF; or> > (3) treat it as a character in its own right.> >> > I think I understand that (1) will be the default (which is good), and that > (2) can currently be obtained by turning on the PCRE_NO_UTF32_CHECK option. > You said that the masking is "only a temporary measure while developing > this". It's not clear what that implies: once the development is complete, > would the PCRE_NO_UTF32_CHECK option still produce behavior (2), or would the > masking code be removed and the PCRE_NO_UTF32_CHECK option produce behavior > (3)?> >> > It seems that there are three possible purposes for someone to specify an > option named PCRE_NO_UTF32_CHECK:> >> > (A) simply to speed up the code a bit, since they're absolutely certain that > their strings are valid UTF-32;> > (B) to obtain behavior (2) since they've included extra information in the > eleven highest bits; or> > (C) to obtain behavior (3) to support characters beyond U+10FFFF.> >> > For purpose (A), suppose on some rare occasion the absolute certainty is > mistaken; then the best behavior for PCRE is (3), since 0x10000021 isn't a > valid code for an exclamation point (U+0021) and PCRE shouldn't report a > match when in reality there isn't a match.> >> > The difference between behaviors (2) and (3) is huge. If only one or the > other is supported, (3) is more appropriate -- again, PCRE shouldn't report a > match when in reality there isn't a match. If the masking is considered a > useful option for the long term and not only a temporary measure, then there > could be two options in addition to the default (strict UTF-32 checking). > They might be named:> >> > PCRE_MASK_UTF32_BEYOND_1FFFFF for behavior (2)> >> > and> >> > PCRE_ALLOW_UTF32_BEYOND_10FFFF for behavior (3).> >> > This might only require a few additional lines of code. I'm happy to help > with the implementation.> >> > Best wishes,> >> > Tom> >> > 文林 Wenlin Institute, Inc. Software for Learning Chinese> > E-mail: [email protected] Web: http://www.wenlin.com> > Telephone: 1-877-4-WENLIN (1-877-493-6546)> > ☯> >> >> >> >> > -- > > ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev > > 文林 Wenlin Institute, Inc. Software for Learning Chinese E-mail: [email protected] Web: http://www.wenlin.com Telephone: 1-877-4-WENLIN (1-877-493-6546) ☯ -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
