On Sep 18, 2012, at 3:31 PM, Christian Persch <[email protected]> wrote:

> Actually it would be more useful to look at it sooner; you can get the
> current code from gitorious at
> https://gitorious.org/~chpe/pcre/chpe-pcre/ (pcre32 branch). However
> note that the masking isn't actually working yet! :-)

OK, I checked out the code as follows:

git clone -b pcre32 git://gitorious.org/~chpe/pcre/chpe-pcre.git chpe-pcre

There are a few issues.

In pcre32_valid_utf32.c, the validity checking is too lenient in two respects. 
First of all, each code unit is masked with UTF32_MASK (0x1ffffful) before 
checking, so most invalid code units will be treated as though they were valid. 
You could change this line:

  c = *p & UTF32_MASK;

to

  c = *p;

Or else, make the masking conditional on something like 
PCRE_MASK_UTF32_BEYOND_1FFFFF, which should be false by default, for Unicode 
conformance.

Second, the test for noncharacters is incomplete. It should reject all 
noncharacters, namely, the range U+FDD0..U+FDEF, and any code unit whose last 
four digits are FFFE or FFFF. You could change this line:

    if (c == 0xfffeu)

to 

    if (c >= 0xfdd0 && (c <= 0xfdef || ((c & 0xffff) | 1) == 0xffff))

I suggest generalizing the meaning of the macro PCRE_UTF32_ERR2 from 
"Disallowed character 0xfffe" to "Noncharacter".

In pcre32_ord2utf32.c, there are similar issues. UTF32_MASK is used when it 
shouldn't be (by default); the checking for noncharacters is incomplete; and if 
a surrogate or out-of-range code unit is encountered, it's replaced by U+FFFE, 
which is itself a noncharacter. What is the rationale for that? To ensure that 
the output is valid UTF-32, the conventional value to use is U+FFFD REPLACEMENT 
CHARACTER.

In pcre_internal.h, there's a macro GETCHAR:
 
#define GETCHAR(c, eptr) \
  c = *eptr & UTF32_MASK;

I wouldn't use UTF32_MASK here at all; or else, make the masking conditional on 
something like PCRE_MASK_UTF32_BEYOND_1FFFFF, which should be false by default.

By the way, these macros could be made more robust by adding parentheses to 
avoid side-effects; and they seem to have superfluous semi-colons. On first 
glance, it doesn't look like the lack of parentheses would actually cause any 
bugs the way the macros are currently called, but I can't say for sure, not 
having studied all the calls carefully.

You said, "the masking isn't actually working yet". Do you mean that you're 
trying to use the masking to enable the feature where you store extra 
information in the upper eleven bits of each code unit, and it's failing when 
you have nonzero bits up there? Or do you mean that the code is failing even 
when you test it with valid UTF-32?

I'll be happy to do further studying and testing, but first I'd like to 
understand what the real issue is at this point. The masking feature might be 
more trouble than it's worth at this stage. Probably the first priority is to 
get the implementation right for valid UTF-32. I'm interested in optional 
support for code points beyond U+10FFFF, but if the code isn't working right 
for ordinary code points, we shouldn't put the cart before the horse.

Best wishes,

Tom

文林 Wenlin Institute, Inc.        Software for Learning Chinese
E-mail: [email protected]     Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯




-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to