[pcre-dev] [Bug 897] \w and others based on Unicode properties

Philip Hazel Wed, 16 Dec 2009 02:58:56 -0800

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=897

--- Comment #4 from Philip Hazel <[email protected]>  2009-12-16 10:58:29 
---
On Tue, 15 Dec 2009, Pavel Kostromitinov wrote:

> Here (attached) is my attempt to implement checking for \w as \p{L} - as a
> testcase, all the others will follow.
> It requires UTF8_USE_UCP to be set, along with SUPPORT_UTF8 and SUPPORT_UCP.
> 
> I would greatly appreciate if you could review the changes and correct me if I
> did something wrong, or missed something.

I think you have missed something. Here is the code of your first 
change, with my comments:

    case OP_WORDCHAR:
    if (eptr >= md->end_subject)
      {
      SCHECK_PARTIAL();
      RRETURN(MATCH_NOMATCH);
      }
    GETCHARINCTEST(c, eptr);
    ^^^^^^^^^^^^^^ 
That macro tests for UTF-8 mode, and loads either one byte or a whole 
UTF-8 character into the variable c. 

#ifdef UTF8_USES_UCP
        {
        const ucd_record *prop = GET_UCD(c);
        if (_pcre_ucp_gentype[prop->chartype] != ucp_L)
                RRETURN(MATCH_NOMATCH);
        }

However, your patch runs unconditionally, even when the UTF-8 flag is
not set at runtime. I am not sure that this is right. In non-UTF-8 mode
I would expect everything to behave as ASCII, for backwards
compatibility if for no other reason.

#else
    if (
#ifdef SUPPORT_UTF8
       c >= 256 ||
#endif
       (md->ctypes[c] & ctype_word) == 0
       )
      RRETURN(MATCH_NOMATCH);
#endif
    ecode++;
    break;

At the moment, it is true, the code does make use of GET_UCD() in 
non-UTF-8 mode, but only to process \P and \p. 

With your code, the name UTF8_USES_UCP is not correct, because it always 
uses UCP. Something like GENERICS_USE_UCP might be better (for "generic 
character types"). However, I think I would prefer to keep your name, 
and change the code so that the PCRE_UTF8 flag is needed to cause it to 
be used.

I see that you have not patched the code for OP_WORD_BOUNDARY, around
line 1633. That code is already split into UTF-8 and non-UTF-8 cases.
If you just patched the UTF-8 case, the result will be different to your 
\w patch, for the reason I gave above.

> Also it seems pcre_study.c is to be corrected for this to work, but I
> just pass PCRE_NO_START_OPTIMIZE to pcre_exec for now.

pcre_dfa_exec.c will have to be changed too. I told you this would be a 
big job! :-)

Philip

-- 
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

-- 
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] [Bug 897] \w and others based on Unicode properties

Reply via email to