I don't know if anyone has had a chance yet to study the change I proposed in 
an earlier message. In case it's gotten lost in the shuffle, here's the message 
again. Also, please note that the Unicode conformance requirements that 
currently appear to be violated include these:

"C4 A process shall interpret a coded character sequence according to the 
character semantics established by this standard, if that process does 
interpret that coded character sequence."

"C10 When a process interprets a code unit sequence which purports to be in a 
Unicode character encoding form, it shall treat ill-formed code unit sequences 
as an error condition and shall not interpret such sequences as characters."

The standard also notes, "There are important security issues associated with 
encoding conversion, especially with the conversion of malformed text." See 
also <http://www.unicode.org/reports/tr36/>.

It's fine to have explicit ways to override these conformance requirements (I 
propose a macro PCRE_MASK_UTF32_BEYOND_1FFFFF), but PCRE_NO_UTF32_CHECK in 
itself should only turn off the error reporting, not turn off the conformance. 
In other words, if the user specifies PCRE_NO_UTF32_CHECK, of course PCRE is 
freed from the responsibility to report an error for ill-formed UTF-32, but it 
still has the responsibility not to report a regex as matching when (according 
to the standard) it doesn't match. For example, the regex "A" matches the code 
for the letter "A" (0x00000042 in UTF-32), but not a code such as 0x10000042.

--------

It's good that the masking with 0x1fffff now only occurs if PCRE_NO_UTF32_CHECK 
is specified. The Unicode conformance can be improved, and the code made 
slightly smaller, faster, and more flexible, with a simple change to 
pcre_internal.h. By default, PCRE_NO_UTF32_CHECK should disable checking 
without enabling masking. Masking can be enabled by a compile-time option. The 
definition of UTF32_MASK can be replaced by the following:

#if defined PCRE_MASK_UTF32_BEYOND_1FFFFF
#define ADJUST_UTF32_CODE_UNIT(c) ((c) & 0x1fffffu)
#else
#define ADJUST_UTF32_CODE_UNIT(c) (c)
#endif

and these macros can be revised as follows:

#define GETCHAR(c, eptr) \
c = ADJUST_UTF32_CODE_UNIT(*(eptr));

#define GETCHARTEST(c, eptr) \
c = *eptr; \
if (utf) c = ADJUST_UTF32_CODE_UNIT(c);

#define GETCHARINC(c, eptr) \
c = ADJUST_UTF32_CODE_UNIT(*eptr++);

#define GETCHARINCTEST(c, eptr) \
c = *eptr++; \
if (utf) c = ADJUST_UTF32_CODE_UNIT(c);

#define RAWUCHAR(eptr) \
ADJUST_UTF32_CODE_UNIT(*(eptr))

#define RAWUCHARINC(eptr) \
ADJUST_UTF32_CODE_UNIT(*(eptr)++)

#define RAWUCHARTEST(eptr) \
(utf ? (ADJUST_UTF32_CODE_UNIT(*(eptr))) : *(eptr))

#define RAWUCHARINCTEST(eptr) \
(utf ? (ADJUST_UTF32_CODE_UNIT(*(eptr)++)) : *(eptr)++)

Best wishes,

Tom

文林 Wenlin Institute, Inc.        Software for Learning Chinese
E-mail: [email protected]     Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯




-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to