------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1295 --- Comment #3 from Christian Persch (GNOME) <[email protected]> 2012-09-17 22:09:55 --- Tom Bishop wrote: > > ...Since UTF-32 only occupies 21 bits of the 32-bit characters, > > it's useful for implementations to use the upper bits to store > > extra info (flags, etc). Since it's more efficient to pass the > > unmodified strings to pcre32, I aim to make pcre32 mask out those > > upper bits. This is done in the code but hasn't been debugged yet > > (it's not working yet). > > I suggest that such masking behavior should not be the default, but > only enabled, if at all, by explicitly setting some configuration > option. I don't see a problem with masking the values. If the UTF-32 check isn't disabled by PCRE_NO_UTF32_CHECK, these values will still be fauled (iow, the masking in _pcre32_valid_utf() is only a temporary measure while developing this); only if bypassing that check we'll allow these bits through. And making this masking optional would only make the code more complicated without any gains, IMO. > If a 32-bit string contains a code unit such as 0x10000021, the safer > assumption is that it is *not* equivalent to U+0021. 0x10000021 might > trigger a warning that the string is not valid UTF-32, or it might > just be treated as a different character. But to treat it by default > as matching U+0021 would be just as wrong as an ASCII-based program > treating 0xA1 as equivalent to 0x21. > > The originally ASCII-based programs that continue to work well today > (for Latin1, UTF-8, etc.) are the ones that treat the byte 0xA1 > differently from 0x21, and refrain from > masking/bending/folding/mutilating it. There will be the non-UTF 32-bit mode where you can pass any characters; we don't need to complicate the UTF mode with this. This 'masking' is purely a convenience for the API user; you don't *have to* use it. > Using the upper bits of 32-bit code units for flags, etc., risks > incompatibility with future use of code points beyond U+10FFFF (such > for extended private use); developers need to weigh the risks and > benefits of such an approach carefully. Anyway, if they do it, they > should at least be responsible for setting an option instructing PCRE > to mask the high bits. In general, most libraries shouldn't be > expected to mask or ignore those bits. > > I hope this suggestion is helpful. A 32-bit PCRE is likely to be > useful for the long-term future, especially if code points beyond > U+10FFFF are eventually employed. It's absolutely certain that there will never be unicode characters > 10ffff, so there's no forward compatibility problem. Now you seem to want some sort of "UCS-4" mode that would allow any characters from the 31-bit range (up to 7fffffff) of UCS-4 ? I don't see how that would be useful; for example, which properties would those characters beyond the UTF-32 range have ? (And if an actual use case for that UCS-4 mode ever arises, we can just add it at that point as a _new_ flag/mode.) (In reply to comment #1) > The html docs are created automatically from the man pages when the > script PrepareRelease is run. I will check this all out once your > patches make it into the svn repo. I guess we'll have to do a bit of > merging because independent changes are happening. (I'm currently > tidying up code for OP_HSPACE and OP_VSPACE so that the case lists of > values are defined only once, in a macro.) I'll 'git rebase' the branch when new svn commits happen (already done so for the OP_[HV]SPACE changes). > I don't know if you've already picked this up, but I recently noticed in > the code a few places where > > #ifdef COMPILE_PCRE16 > > should be changed to > > #ifndef COMPILE_PCRE8 > ^ > ^ > > so that it applies to 32-bit as well as 16-bit. In my patch, generally I have been changing #ifdef COMPILE_PCRE16 to #if defined COMPILE_PCRE16 || defined COMPILE_PCRE32 which I find more readable, but if you prefer I can switch to #ifndef COMPILE_PCRE8 ? (In reply to comment #2) > > The JIT compiler also works in pcre32; I only had to comment out the use of > > the > > fast_forward_first_two_chars() function since I couldn't figure out how to > > port > > it to 32-bit; help appreciated there (and for everything else too :-). > > I have implemented a less platform dependent forward search, which should be > compatible with any machine and any supported code format. Those ugly ifdefs > are gone forever. Thanks! I rebased the branch, and the #warning is now gone :-) > > To check out the code, get the "pcre32" branch from my gitorious repository > > at > > https://gitorious.org/~chpe/pcre/chpe-pcre . (It'll be frequently rebased > > for > > updates from svn.) > > (BTW, I've also set up a (manually updated) git-svn clone of the PCRE svn > > repository at https://gitorious.org/pcre/pcre ). > > I think we should also setup a branch as we did when the 16 bit mode was > developed. Do you prefer me to push a branch to svn instead of keeping the work on gitorious until it lands in svn trunk? -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
