------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=897 --- Comment #4 from Philip Hazel <[email protected]> 2009-12-16 10:58:29 --- On Tue, 15 Dec 2009, Pavel Kostromitinov wrote: > Here (attached) is my attempt to implement checking for \w as \p{L} - as a > testcase, all the others will follow. > It requires UTF8_USE_UCP to be set, along with SUPPORT_UTF8 and SUPPORT_UCP. > > I would greatly appreciate if you could review the changes and correct me if I > did something wrong, or missed something. I think you have missed something. Here is the code of your first change, with my comments: case OP_WORDCHAR: if (eptr >= md->end_subject) { SCHECK_PARTIAL(); RRETURN(MATCH_NOMATCH); } GETCHARINCTEST(c, eptr); ^^^^^^^^^^^^^^ That macro tests for UTF-8 mode, and loads either one byte or a whole UTF-8 character into the variable c. #ifdef UTF8_USES_UCP { const ucd_record *prop = GET_UCD(c); if (_pcre_ucp_gentype[prop->chartype] != ucp_L) RRETURN(MATCH_NOMATCH); } However, your patch runs unconditionally, even when the UTF-8 flag is not set at runtime. I am not sure that this is right. In non-UTF-8 mode I would expect everything to behave as ASCII, for backwards compatibility if for no other reason. #else if ( #ifdef SUPPORT_UTF8 c >= 256 || #endif (md->ctypes[c] & ctype_word) == 0 ) RRETURN(MATCH_NOMATCH); #endif ecode++; break; At the moment, it is true, the code does make use of GET_UCD() in non-UTF-8 mode, but only to process \P and \p. With your code, the name UTF8_USES_UCP is not correct, because it always uses UCP. Something like GENERICS_USE_UCP might be better (for "generic character types"). However, I think I would prefer to keep your name, and change the code so that the PCRE_UTF8 flag is needed to cause it to be used. I see that you have not patched the code for OP_WORD_BOUNDARY, around line 1633. That code is already split into UTF-8 and non-UTF-8 cases. If you just patched the UTF-8 case, the result will be different to your \w patch, for the reason I gave above. > Also it seems pcre_study.c is to be corrected for this to work, but I > just pass PCRE_NO_START_OPTIMIZE to pcre_exec for now. pcre_dfa_exec.c will have to be changed too. I told you this would be a big job! :-) Philip -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at http://lists.exim.org/mailman/listinfo/pcre-dev
