In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/9c62c74d253b05d5e0ec6c62885030bfbe5ccda3?hp=e37d7e38fe022990bbf0ce90dc77f411ebeb158a>
- Log ----------------------------------------------------------------- commit 9c62c74d253b05d5e0ec6c62885030bfbe5ccda3 Merge: e37d7e3 7016d6e Author: David Mitchell <[email protected]> Date: Wed Sep 26 10:13:03 2012 +0100 [MERGE] make regex engine handle non-null-terminated strings commit 7016d6ebb4afd4eb7b71b00f15b7515b5e45fee8 Author: David Mitchell <[email protected]> Date: Fri Sep 21 10:29:04 2012 +0100 stop regex engine reading beyond end of string Historically the regex engine has assumed that any string passed to it will have a trailing null char. This isn't normally an issue in perl code, since perl strings *are* null terminated; but it could cause problems with strings returned by XS code, or with someone calling the regex engine directly from XS, with strend not pointing at a null char. The engine currently relies on there being a null char in the following ways. First, when at the end of string, the main loop of regmatch() still reads in the 'next' character (i.e. the character following the end of string) even if it doesn't make any use of it. This precludes using memory mapped files as strings for example, since the read off the end would SEGV. Second, the matching algorithm often required the trailing character to be \0 to work correctly: the test for 'EOF' was "if next char is null *and* locinput >= PL_regeol, then stop". So a random non-null trailing char could cause an overshoot. Thirdly, some match ops require the trailing char to be null to operate correctly; for example, \b applied at the end of the string only happens to work because the trailing char (\0) happens to match \W. Also, some utf8 ops will try to extract the code point at the end, which can result in multiple bytes past the end of string being read, and possible problems if they don't correspond to well-formed utf8. The main fix is in S_regmatch, where the 'read next char' code has been updated to set it to a special value, NEXTCHR_EOS instead, if we would be reading past the end of the string. Lots of other random bits in the regex engine needed to be fixed up too. To track these down, I temporarily hacked regexec_flags() to make a copy of the string but without trailing \0, then ran all the t/re/*.t tests under valgrind to flush out all buffer overruns. So I think I've removed most of the bad code, but by no means all of it. The code within the various functions in regexec.c is far too complex to be able to visually audit the code with any confidence. M MANIFEST M ext/XS-APItest/APItest.pm M ext/XS-APItest/APItest.xs A ext/XS-APItest/t/callregexec.t M regexec.c commit 895cc420d0398ff184560679b40f5f2c0af72366 Author: David Mitchell <[email protected]> Date: Sun Sep 16 17:39:06 2012 +0100 regmatch(): fix typo in TRIE commentary text M regexec.c commit 3c0563b938225774f2298a18ae180520bc33a48c Author: David Mitchell <[email protected]> Date: Sun Sep 16 17:33:08 2012 +0100 regmatch() annotate ops and separate out branches Annotate each 'case OP:' in the main switch in regmatch() to show what regex pattern this implements. About half the ops had already been done. Also add a blank line between each 'case' statement for readability. (no code changes) M regexec.c commit 3640db6ba49a54c99246b5b4b2b9a2840cfdaef3 Author: David Mitchell <[email protected]> Date: Fri Sep 14 16:19:10 2012 +0100 regmatch(): do nextchr=*locinput at top of loop Currently each branch in the main regmatch() loop is responsible re-initialising nextchar to UCHARAT(locinput) if locinput is modified. By adding nextchr = UCHARAT(locinput); to the head of the loop, we can remove most of the nextchar assignments in the individual branches. We lose slightly for the zero-width assertions like \b which will re-read the same nextchar, but this will make it easier to handle non-null-terminated strings. M regexec.c commit bf798dc4f68faa2dc325a5c35f641f6a172a48bd Author: David Mitchell <[email protected]> Date: Fri Sep 14 15:46:47 2012 +0100 regmatch(): nextchar should always be positive Remove the one bit of code that tests for < 0, and put in a general assert. M regexec.c commit 28b98f76c447cec8a7ac29d73752c2c930de819a Author: David Mitchell <[email protected]> Date: Fri Sep 14 12:37:33 2012 +0100 regmatch(): consolidate locinput++ There are several places in the code that increment locinput by 1 char (which may or may not be 1 byte) then update nextchr. Consolidate these into a single code block with the others goto'ing it. This actually reduces the code more than it appears, since the CCC_TRY* macros expand into several branches, each of which repeatthe increment code. M regexec.c commit b9b31e9d0400a852436a6d750c074dc32b69492e Author: David Mitchell <[email protected]> Date: Fri Sep 14 11:28:08 2012 +0100 regmatch(): use nextchar where available In a couple of places the code was using *locinput, where nextchar already equalled *locinput M regexec.c ----------------------------------------------------------------------- Summary of changes: MANIFEST | 1 + ext/XS-APItest/APItest.pm | 2 +- ext/XS-APItest/APItest.xs | 24 ++ ext/XS-APItest/t/callregexec.t | 66 ++++++ regexec.c | 482 ++++++++++++++++++++++----------------- 5 files changed, 364 insertions(+), 211 deletions(-) create mode 100644 ext/XS-APItest/t/callregexec.t diff --git a/MANIFEST b/MANIFEST index a3e752b..a6884d0 100644 --- a/MANIFEST +++ b/MANIFEST @@ -3989,6 +3989,7 @@ ext/XS-APItest/t/blockhooks.t XS::APItest: tests for PL_blockhooks ext/XS-APItest/t/Block.pm Helper for ./blockhooks.t ext/XS-APItest/t/call_checker.t test call checker plugin API ext/XS-APItest/t/caller.t XS::APItest: tests for caller_cx +ext/XS-APItest/t/callregexec.t XS::APItest: tests for CALLREGEXEC() ext/XS-APItest/t/call.t XS::APItest extension ext/XS-APItest/t/check_warnings.t test scope of "Too late for CHECK" ext/XS-APItest/t/cleanup.t test stack behaviour on unwinding diff --git a/ext/XS-APItest/APItest.pm b/ext/XS-APItest/APItest.pm index 749af95..f33b80b 100644 --- a/ext/XS-APItest/APItest.pm +++ b/ext/XS-APItest/APItest.pm @@ -5,7 +5,7 @@ use strict; use warnings; use Carp; -our $VERSION = '0.43'; +our $VERSION = '0.44'; require XSLoader; diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs index 08694e6..357b033 100644 --- a/ext/XS-APItest/APItest.xs +++ b/ext/XS-APItest/APItest.xs @@ -3407,6 +3407,30 @@ CODE: OUTPUT: RETVAL + # provide access to CALLREGEXEC, except replace pointers within the + # string with offsets from the start of the string + +I32 +callregexec(SV *prog, STRLEN stringarg, STRLEN strend, I32 minend, SV *sv, U32 nosave) +CODE: + { + STRLEN len; + char *strbeg; + if (SvROK(prog)) + prog = SvRV(prog); + strbeg = SvPV_force(sv, len); + RETVAL = CALLREGEXEC((REGEXP *)prog, + strbeg + stringarg, + strbeg + strend, + strbeg, + minend, + sv, + NULL, /* data */ + nosave); + } +OUTPUT: + RETVAL + MODULE = XS::APItest PACKAGE = XS::APItest::AUTOLOADtest diff --git a/ext/XS-APItest/t/callregexec.t b/ext/XS-APItest/t/callregexec.t new file mode 100644 index 0000000..3111390 --- /dev/null +++ b/ext/XS-APItest/t/callregexec.t @@ -0,0 +1,66 @@ +#!perl + +# test CALLREGEXEC() +# (currently it just checks that it handles non-\0 terminated strings; +# full tests haven't been added yet) + +use warnings; +use strict; + +use XS::APItest; +*callregexec = *XS::APItest::callregexec; + +use Test::More tests => 50; + +# Test that the regex engine can handle strings without terminating \0 +# XXX This is by no means comprehensive; it doesn't test all ops, nor all +# code paths within those ops (especially not utf8). + + +# this sub takes a string that has an extraneous char at the end. +# First see if the string (less the last char) matches the regex; +# then see if that string (including the last char) matches when +# calling callregexec(), but with the length arg set to 1 char less than +# the length of the string. +# In theory the result should be the same for both matches, since +# they should both not 'see' the final char. + +sub try { + my ($str, $re, $exp, $desc) = @_; + + my $str1 = substr($str, 0, -1); + ok !!$exp == !!($str1 =~ $re), "$desc str =~ qr"; + + my $bytes = do { use bytes; length $str1 }; + ok !!$exp == !!callregexec($re, 0, $bytes, 0, $str, 0), + "$desc callregexec"; +} + + +{ + try "\nx", qr/\n^/m, 0, 'MBOL'; + try "ax", qr/a$/m, 1, 'MEOL'; + try "ax", qr/a$/s, 1, 'SEOL'; + try "abx", qr/^(ab|X)./s, 0, 'SANY'; + try "abx", qr/^(ab|X)\C/, 0, 'CANY'; + try "abx", qr/^(ab|X)./, 0, 'REG_ANY'; + try "abx", qr/^ab(c|d|e|x)/, 0, 'TRIE/TRIEC'; + try "abx", qr/^abx/, 0, 'EXACT'; + try "abx", qr/^ABX/i, 0, 'EXACTF'; + try "abx", qr/^ab\b/, 1, 'BOUND'; + try "ab-", qr/^ab\B/, 0, 'NBOUND'; + try "aas", qr/a[st]/, 0, 'ANYOF'; + try "aas", qr/a[s\xDF]/i, 0, 'ANYOFV'; + try "ab1", qr/ab\d/, 0, 'DIGIT'; + try "ab\n", qr/ab[[:ascii:]]/, 0, 'POSIX'; + try "aP\x{307}", qr/^a\X/, 1, 'CLUMP 1'; + try "aP\x{307}x", qr/^a\X/, 1, 'CLUMP 2'; + try "\x{100}\r\n", qr/^\x{100}\X/, 1, 'CLUMP 3'; + try "abb", qr/^a(b)\1/, 0, 'REF'; + try "ab\n", qr/^.+\R/, 0, 'LNBREAK'; + try "ab\n", qr/^.+\v/, 0, 'VERTWS'; + try "abx", qr/^.+\V/, 1, 'NVERTWS'; + try "ab\t", qr/^.+\h/, 0, 'HORIZWS'; + try "abx", qr/^.+\H/, 1, 'NHORIZWS'; + try "abx", qr/a.*x/, 0, 'CURLY'; +} diff --git a/regexec.c b/regexec.c index f207cda..989affa 100644 --- a/regexec.c +++ b/regexec.c @@ -121,6 +121,18 @@ #define HOP3(pos,off,lim) (PL_reg_match_utf8 ? reghop3((U8*)(pos), off, (U8*)(lim)) : (U8*)(pos + off)) #define HOP3c(pos,off,lim) ((char*)HOP3(pos,off,lim)) + +#define NEXTCHR_EOS -10 /* nextchr has fallen off the end */ +#define NEXTCHR_IS_EOS (nextchr < 0) + +#define SET_nextchr \ + nextchr = ((locinput < PL_regeol) ? UCHARAT(locinput) : NEXTCHR_EOS) + +#define SET_locinput(p) \ + locinput = (p); \ + SET_nextchr + + /* these are unrolled below in the CCC_TRY_XXX defined */ #define LOAD_UTF8_CHARCLASS(class,str) STMT_START { \ if (!CAT2(PL_utf8_,class)) { \ @@ -165,7 +177,7 @@ * fails, or advance to the next character */ #define _CCC_TRY_CODE(POS_OR_NEG, FUNC, UTF8_TEST, CLASS, STR) \ - if (locinput >= PL_regeol) { \ + if (NEXTCHR_IS_EOS) { \ sayNO; \ } \ if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ @@ -173,15 +185,11 @@ if (POS_OR_NEG (UTF8_TEST)) { \ sayNO; \ } \ - locinput += PL_utf8skip[nextchr]; \ - nextchr = UCHARAT(locinput); \ - break; \ } \ - if (POS_OR_NEG (FUNC(nextchr))) { \ - sayNO; \ + else if (POS_OR_NEG (FUNC(nextchr))) { \ + sayNO; \ } \ - nextchr = UCHARAT(++locinput); \ - break; + goto increment_locinput; /* Handle the non-locale cases for a character class and its complement. It * calls _CCC_TRY_CODE with a ! to complement the test for the character class. @@ -223,24 +231,17 @@ _CCC_TRY_CODE( PLACEHOLDER, LCFUNC, LCFUNC_utf8((U8*)locinput), \ CLASS, STR) \ case NAMEA: \ - if (locinput >= PL_regeol || ! FUNCA(nextchr)) { \ + if (NEXTCHR_IS_EOS || ! FUNCA(nextchr)) { \ sayNO; \ } \ /* Matched a utf8-invariant, so don't have to worry about utf8 */ \ - nextchr = UCHARAT(++locinput); \ + locinput++; \ break; \ case NNAMEA: \ - if (locinput >= PL_regeol || FUNCA(nextchr)) { \ + if (NEXTCHR_IS_EOS || FUNCA(nextchr)) { \ sayNO; \ } \ - if (utf8_target) { \ - locinput += PL_utf8skip[nextchr]; \ - nextchr = UCHARAT(locinput); \ - } \ - else { \ - nextchr = UCHARAT(++locinput); \ - } \ - break; \ + goto increment_locinput; \ /* Generate the non-locale cases */ \ _CCC_TRY_NONLOCALE(NAME, NNAME, FUNC, CLASS, STR) @@ -608,7 +609,21 @@ Perl_re_intuit_start(pTHX_ REGEXP * const rx, SV *sv, char *strpos, goto fail; } - strbeg = (sv && SvPOK(sv)) ? strend - SvCUR(sv) : strpos; + /* XXX we need to pass strbeg as a separate arg: the following is + * guesswork and can be wrong... */ + if (sv && SvPOK(sv)) { + char * p = SvPVX(sv); + STRLEN cur = SvCUR(sv); + if (p <= strpos && strpos < p + cur) { + strbeg = p; + assert(p <= strend && strend <= p + cur); + } + else + strbeg = strend - cur; + } + else + strbeg = strpos; + PL_regeol = strend; if (utf8_target) { if (!prog->check_utf8 && prog->check_substr) @@ -1249,7 +1264,7 @@ STMT_START { \ #define REXEC_FBC_UTF8_SCAN(CoDe) \ STMT_START { \ - while (s + (uskip = UTF8SKIP(s)) <= strend) { \ + while (s < strend && s + (uskip = UTF8SKIP(s)) <= strend) { \ CoDe \ s += uskip; \ } \ @@ -1774,32 +1789,32 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, break; case LNBREAK: REXEC_FBC_CSCAN( - is_LNBREAK_utf8(s), - is_LNBREAK_latin1(s) + is_LNBREAK_utf8_safe(s, strend), + is_LNBREAK_latin1_safe(s, strend) ); break; case VERTWS: REXEC_FBC_CSCAN( - is_VERTWS_utf8(s), - is_VERTWS_latin1(s) + is_VERTWS_utf8_safe(s, strend), + is_VERTWS_latin1_safe(s, strend) ); break; case NVERTWS: REXEC_FBC_CSCAN( - !is_VERTWS_utf8(s), - !is_VERTWS_latin1(s) + !is_VERTWS_utf8_safe(s, strend), + !is_VERTWS_latin1_safe(s, strend) ); break; case HORIZWS: REXEC_FBC_CSCAN( - is_HORIZWS_utf8(s), - is_HORIZWS_latin1(s) + is_HORIZWS_utf8_safe(s, strend), + is_HORIZWS_latin1_safe(s, strend) ); break; case NHORIZWS: REXEC_FBC_CSCAN( - !is_HORIZWS_utf8(s), - !is_HORIZWS_latin1(s) + !is_HORIZWS_utf8_safe(s, strend), + !is_HORIZWS_latin1_safe(s, strend) ); break; case POSIXA: @@ -1934,16 +1949,24 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, } points[pointpos++ % maxlen]= uc; - REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, + if (foldlen || uc < (U8*)strend) { + REXEC_TRIE_READ_CHAR(trie_type, trie, + widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags); - DEBUG_TRIE_EXECUTE_r({ - dump_exec_pos( (char *)uc, c, strend, real_start, - s, utf8_target ); - PerlIO_printf(Perl_debug_log, - " Charid:%3u CP:%4"UVxf" ", - charid, uvc); - }); + DEBUG_TRIE_EXECUTE_r({ + dump_exec_pos( (char *)uc, c, strend, + real_start, s, utf8_target); + PerlIO_printf(Perl_debug_log, + " Charid:%3u CP:%4"UVxf" ", + charid, uvc); + }); + } + else { + len = 0; + charid = 0; + } + do { #ifdef DEBUGGING @@ -2391,7 +2414,11 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, register char *stre while (s <= last1) { if (regtry(®info, &s)) goto got_it; - s += UTF8SKIP(s); + if (s >= last1) { + s++; /* to break out of outer loop */ + break; + } + s += UTF8SKIP(s); } } else { @@ -2693,7 +2720,6 @@ phooey: Safefree(prog->offs); prog->offs = swap; } - return 0; } @@ -3299,7 +3325,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) st = PL_regmatch_state = S_push_slab(aTHX); /* Note that nextchr is a byte even in UTF */ - nextchr = UCHARAT(locinput); + SET_nextchr; scan = prog; while (scan != NULL) { @@ -3324,31 +3350,36 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) reenter_switch: + SET_nextchr; + switch (state_num) { - case BOL: + case BOL: /* /^../ */ if (locinput == PL_bostr) { /* reginfo->till = reginfo->bol; */ break; } sayNO; - case MBOL: + + case MBOL: /* /^../m */ if (locinput == PL_bostr || - ((nextchr || locinput < PL_regeol) && locinput[-1] == '\n')) + (!NEXTCHR_IS_EOS && locinput[-1] == '\n')) { break; } sayNO; - case SBOL: + + case SBOL: /* /^../s */ if (locinput == PL_bostr) break; sayNO; - case GPOS: + + case GPOS: /* \G */ if (locinput == reginfo->ganch) break; sayNO; - case KEEPS: + case KEEPS: /* \K */ /* update the startpoint */ st->u.keeper.val = rex->offs[0].start; rex->offs[0].start = locinput - PL_bostr; @@ -3359,60 +3390,52 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) rex->offs[0].start = st->u.keeper.val; sayNO_SILENT; /*NOT-REACHED*/ - case EOL: + + case EOL: /* /..$/ */ goto seol; - case MEOL: - if ((nextchr || locinput < PL_regeol) && nextchr != '\n') + + case MEOL: /* /..$/m */ + if (!NEXTCHR_IS_EOS && nextchr != '\n') sayNO; break; - case SEOL: + + case SEOL: /* /..$/s */ seol: - if ((nextchr || locinput < PL_regeol) && nextchr != '\n') + if (!NEXTCHR_IS_EOS && nextchr != '\n') sayNO; if (PL_regeol - locinput > 1) sayNO; break; - case EOS: - if (PL_regeol != locinput) + + case EOS: /* \z */ + if (!NEXTCHR_IS_EOS) sayNO; break; - case SANY: - if (!nextchr && locinput >= PL_regeol) + + case SANY: /* /./s */ + if (NEXTCHR_IS_EOS) sayNO; - if (utf8_target) { - locinput += PL_utf8skip[nextchr]; - if (locinput > PL_regeol) - sayNO; - nextchr = UCHARAT(locinput); - } - else - nextchr = UCHARAT(++locinput); - break; - case CANY: - if (!nextchr && locinput >= PL_regeol) + goto increment_locinput; + + case CANY: /* \C */ + if (NEXTCHR_IS_EOS) sayNO; - nextchr = UCHARAT(++locinput); + locinput++; break; - case REG_ANY: - if ((!nextchr && locinput >= PL_regeol) || nextchr == '\n') + + case REG_ANY: /* /./ */ + if ((NEXTCHR_IS_EOS) || nextchr == '\n') sayNO; - if (utf8_target) { - locinput += PL_utf8skip[nextchr]; - if (locinput > PL_regeol) - sayNO; - nextchr = UCHARAT(locinput); - } - else - nextchr = UCHARAT(++locinput); - break; + goto increment_locinput; + #undef ST #define ST st->u.trie - case TRIEC: + case TRIEC: /* (ab|cd) with known charclass */ /* In this case the charclass data is available inline so we can fail fast without a lot of extra overhead. */ - if(!ANYOF_BITMAP_TEST(scan, *locinput)) { + if(!NEXTCHR_IS_EOS && !ANYOF_BITMAP_TEST(scan, nextchr)) { DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, "%*s %sfailed to match trie start class...%s\n", @@ -3422,7 +3445,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) assert(0); /* NOTREACHED */ } /* FALL THROUGH */ - case TRIE: + case TRIE: /* (ab|cd) */ /* the basic plan of execution of the trie is: * At the beginning, run though all the states, and * find the longest-matching word. Also remember the position @@ -3431,7 +3454,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) * ab|a|x|abcd|abc * when matched against the string "abcde", will generate * accept states for all words except 3, with the longest - * matching word being 4, and the shortest being 1 (with + * matching word being 4, and the shortest being 2 (with * the position being after char 1 of the string). * * Then for each matching word, in word order (i.e. 1,2,4,5), @@ -3477,7 +3500,9 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]); U32 state = trie->startstate; - if (trie->bitmap && !TRIE_BITMAP_TEST(trie,*locinput) ) { + if ( trie->bitmap + && (NEXTCHR_IS_EOS || !TRIE_BITMAP_TEST(trie, nextchr))) + { if (trie->states[ state ].wordnum) { DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, @@ -3550,7 +3575,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) }); /* read a char and goto next state */ - if ( base ) { + if ( base && (foldlen || uc < (U8*)PL_regeol)) { I32 offset; REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, @@ -3741,13 +3766,12 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) }); locinput = (char*)uc; - nextchr = UCHARAT(locinput); continue; /* execute rest of RE */ assert(0); /* NOTREACHED */ } #undef ST - case EXACT: { + case EXACT: { /* /abc/ */ char *s = STRING(scan); ln = STR_LEN(scan); if (utf8_target != UTF_PATTERN) { @@ -3809,7 +3833,6 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) } } locinput = l; - nextchr = UCHARAT(locinput); break; } /* The target and the pattern have the same utf8ness. */ @@ -3821,10 +3844,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) if (ln > 1 && memNE(s, locinput, ln)) sayNO; locinput += ln; - nextchr = UCHARAT(locinput); break; } - case EXACTFL: { + + case EXACTFL: { /* /abc/il */ re_fold_t folder; const U8 * fold_array; const char * s; @@ -3836,21 +3859,21 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) fold_utf8_flags = FOLDEQ_UTF8_LOCALE; goto do_exactf; - case EXACTFU_SS: - case EXACTFU_TRICKYFOLD: - case EXACTFU: + case EXACTFU_SS: /* /\x{df}/iu */ + case EXACTFU_TRICKYFOLD: /* /\x{390}/iu */ + case EXACTFU: /* /abc/iu */ folder = foldEQ_latin1; fold_array = PL_fold_latin1; fold_utf8_flags = (UTF_PATTERN) ? FOLDEQ_S1_ALREADY_FOLDED : 0; goto do_exactf; - case EXACTFA: + case EXACTFA: /* /abc/iaa */ folder = foldEQ_latin1; fold_array = PL_fold_latin1; fold_utf8_flags = FOLDEQ_UTF8_NOMIX_ASCII; goto do_exactf; - case EXACTF: + case EXACTF: /* /abc/i */ folder = foldEQ; fold_array = PL_fold; fold_utf8_flags = 0; @@ -3871,7 +3894,6 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) sayNO; } locinput = e; - nextchr = UCHARAT(locinput); break; } @@ -3886,23 +3908,22 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) if (ln > 1 && ! folder(s, locinput, ln)) sayNO; locinput += ln; - nextchr = UCHARAT(locinput); break; } /* XXX Could improve efficiency by separating these all out using a * macro or in-line function. At that point regcomp.c would no longer * have to set the FLAGS fields of these */ - case BOUNDL: - case NBOUNDL: + case BOUNDL: /* /\b/l */ + case NBOUNDL: /* /\B/l */ PL_reg_flags |= RF_tainted; /* FALL THROUGH */ - case BOUND: - case BOUNDU: - case BOUNDA: - case NBOUND: - case NBOUNDU: - case NBOUNDA: + case BOUND: /* /\b/ */ + case BOUNDU: /* /\b/u */ + case BOUNDA: /* /\b/a */ + case NBOUND: /* /\B/ */ + case NBOUNDU: /* /\B/u */ + case NBOUNDA: /* /\B/a */ /* was last char in word? */ if (utf8_target && FLAGS(scan) != REGEX_ASCII_RESTRICTED_CHARSET @@ -3917,12 +3938,17 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) } if (FLAGS(scan) != REGEX_LOCALE_CHARSET) { ln = isALNUM_uni(ln); - LOAD_UTF8_CHARCLASS_ALNUM(); - n = swash_fetch(PL_utf8_alnum, (U8*)locinput, utf8_target); + if (NEXTCHR_IS_EOS) + n = 0; + else { + LOAD_UTF8_CHARCLASS_ALNUM(); + n = swash_fetch(PL_utf8_alnum, (U8*)locinput, + utf8_target); + } } else { ln = isALNUM_LC_uvchr(UNI_TO_NATIVE(ln)); - n = isALNUM_LC_utf8((U8*)locinput); + n = NEXTCHR_IS_EOS ? 0 : isALNUM_LC_utf8((U8*)locinput); } } else { @@ -3943,20 +3969,20 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) switch (FLAGS(scan)) { case REGEX_UNICODE_CHARSET: ln = isWORDCHAR_L1(ln); - n = isWORDCHAR_L1(nextchr); + n = NEXTCHR_IS_EOS ? 0 : isWORDCHAR_L1(nextchr); break; case REGEX_LOCALE_CHARSET: ln = isALNUM_LC(ln); - n = isALNUM_LC(nextchr); + n = NEXTCHR_IS_EOS ? 0 : isALNUM_LC(nextchr); break; case REGEX_DEPENDS_CHARSET: ln = isALNUM(ln); - n = isALNUM(nextchr); + n = NEXTCHR_IS_EOS ? 0 : isALNUM(nextchr); break; case REGEX_ASCII_RESTRICTED_CHARSET: case REGEX_ASCII_MORE_RESTRICTED_CHARSET: ln = isWORDCHAR_A(ln); - n = isWORDCHAR_A(nextchr); + n = NEXTCHR_IS_EOS ? 0 : isWORDCHAR_A(nextchr); break; default: Perl_croak(aTHX_ "panic: Unexpected FLAGS %u in op %u", FLAGS(scan), OP(scan)); @@ -3968,31 +3994,28 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) if (((!ln) == (!n)) == (OP(scan) < NBOUND)) sayNO; break; - case ANYOFV: - case ANYOF: + + case ANYOFV: /* /[abx{df}]/i */ + case ANYOF: /* /[abc]/ */ + if (NEXTCHR_IS_EOS) + sayNO; if (utf8_target || state_num == ANYOFV) { STRLEN inclasslen = PL_regeol - locinput; - if (locinput >= PL_regeol) - sayNO; - if (!reginclass(rex, scan, (U8*)locinput, &inclasslen, utf8_target)) sayNO; locinput += inclasslen; - nextchr = UCHARAT(locinput); break; } else { - if (nextchr < 0) - nextchr = UCHARAT(locinput); - if (!nextchr && locinput >= PL_regeol) - sayNO; if (!REGINCLASS(rex, scan, (U8*)locinput)) sayNO; - nextchr = UCHARAT(++locinput); + locinput++; break; } break; - /* Special char classes - The defines start on line 129 or so */ + + /* Special char classes: \d, \w etc. + * The defines start on line 166 or so */ CCC_TRY_U(ALNUM, NALNUM, isWORDCHAR, ALNUML, NALNUML, isALNUM_LC, isALNUM_LC_utf8, ALNUMU, NALNUMU, isWORDCHAR_L1, @@ -4010,25 +4033,19 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) DIGITA, NDIGITA, isDIGIT_A, digit, "0"); - case POSIXA: - if (locinput >= PL_regeol || ! _generic_isCC_A(nextchr, FLAGS(scan))) { + case POSIXA: /* /[[:ascii:]]/ etc */ + if (NEXTCHR_IS_EOS || ! _generic_isCC_A(nextchr, FLAGS(scan))) { sayNO; } /* Matched a utf8-invariant, so don't have to worry about utf8 */ - nextchr = UCHARAT(++locinput); + locinput++; break; - case NPOSIXA: - if (locinput >= PL_regeol || _generic_isCC_A(nextchr, FLAGS(scan))) { + + case NPOSIXA: /* /[^[:ascii:]]/ etc */ + if (NEXTCHR_IS_EOS || _generic_isCC_A(nextchr, FLAGS(scan))) { sayNO; } - if (utf8_target) { - locinput += PL_utf8skip[nextchr]; - nextchr = UCHARAT(locinput); - } - else { - nextchr = UCHARAT(++locinput); - } - break; + goto increment_locinput; case CLUMP: /* Match \X: logical Unicode character. This is defined as a Unicode extended Grapheme Cluster */ @@ -4064,7 +4081,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) Prepend, that one will be a suitable Begin. */ - if (locinput >= PL_regeol) + if (NEXTCHR_IS_EOS) sayNO; if (! utf8_target) { @@ -4080,7 +4097,9 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) /* Utf8: See if is ( CR LF ); already know that locinput < * PL_regeol, so locinput+1 is in bounds */ - if (nextchr == '\r' && UCHARAT(locinput + 1) == '\n') { + if ( nextchr == '\r' && locinput+1 < PL_regeol + && UCHARAT(locinput + 1) == '\n') + { locinput += 2; } else { @@ -4213,10 +4232,9 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) exit_utf8: if (locinput > PL_regeol) sayNO; } - nextchr = UCHARAT(locinput); break; - case NREFFL: + case NREFFL: /* /\g{name}/il */ { /* The capture buffer cases. The ones beginning with N for the named buffers just convert to the equivalent numbered and pretend they were called as the corresponding numbered buffer @@ -4236,28 +4254,28 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) utf8_fold_flags = FOLDEQ_UTF8_LOCALE; goto do_nref; - case NREFFA: + case NREFFA: /* /\g{name}/iaa */ folder = foldEQ_latin1; fold_array = PL_fold_latin1; type = REFFA; utf8_fold_flags = FOLDEQ_UTF8_NOMIX_ASCII; goto do_nref; - case NREFFU: + case NREFFU: /* /\g{name}/iu */ folder = foldEQ_latin1; fold_array = PL_fold_latin1; type = REFFU; utf8_fold_flags = 0; goto do_nref; - case NREFF: + case NREFF: /* /\g{name}/i */ folder = foldEQ; fold_array = PL_fold; type = REFF; utf8_fold_flags = 0; goto do_nref; - case NREF: + case NREF: /* /\g{name}/ */ type = REF; folder = NULL; fold_array = NULL; @@ -4273,32 +4291,32 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) } goto do_nref_ref_common; - case REFFL: + case REFFL: /* /\1/il */ PL_reg_flags |= RF_tainted; folder = foldEQ_locale; fold_array = PL_fold_locale; utf8_fold_flags = FOLDEQ_UTF8_LOCALE; goto do_ref; - case REFFA: + case REFFA: /* /\1/iaa */ folder = foldEQ_latin1; fold_array = PL_fold_latin1; utf8_fold_flags = FOLDEQ_UTF8_NOMIX_ASCII; goto do_ref; - case REFFU: + case REFFU: /* /\1/iu */ folder = foldEQ_latin1; fold_array = PL_fold_latin1; utf8_fold_flags = 0; goto do_ref; - case REFF: + case REFF: /* /\1/i */ folder = foldEQ; fold_array = PL_fold; utf8_fold_flags = 0; goto do_ref; - case REF: + case REF: /* /\1/ */ folder = NULL; fold_array = NULL; utf8_fold_flags = 0; @@ -4332,12 +4350,12 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) sayNO; } locinput = limit; - nextchr = UCHARAT(locinput); break; } /* Not utf8: Inline the first character, for speed. */ - if (UCHARAT(s) != nextchr && + if (!NEXTCHR_IS_EOS && + UCHARAT(s) != nextchr && (type == REF || UCHARAT(s) != fold_array[nextchr])) sayNO; @@ -4349,13 +4367,16 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) : ! folder(s, locinput, ln))) sayNO; locinput += ln; - nextchr = UCHARAT(locinput); break; } - case NOTHING: - case TAIL: + + case NOTHING: /* null op; e.g. the 'nothing' following + * the '*' in m{(a+|b)*}' */ + break; + case TAIL: /* placeholder while compiling (A|B|C) */ break; - case BACK: + + case BACK: /* ??? doesn't appear to be used ??? */ break; #undef ST @@ -4367,7 +4388,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) regexp_internal *rei; regnode *startpoint; - case GOSTART: + case GOSTART: /* (?R) */ case GOSUB: /* /(...(?1))/ /(...(?&foo))/ */ if (cur_eval && cur_eval->locinput==locinput) { if (cur_eval->u.eval.close_paren == (U32)ARG(scan)) @@ -4391,6 +4412,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) } goto eval_recurse_doit; assert(0); /* NOTREACHED */ + case EVAL: /* /(?{A})B/ /(??{A})B/ and /(?(?{A})X|Y)B/ */ if (cur_eval && cur_eval->locinput==locinput) { if ( ++nochange_depth > max_nochange_depth ) @@ -4719,7 +4741,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) sayNO_SILENT; #undef ST - case OPEN: + case OPEN: /* ( */ n = ARG(scan); /* which paren pair */ rex->offs[n].start_tmp = locinput - PL_bostr; if (n > PL_regsize) @@ -4748,7 +4770,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) (IV)rex->offs[n].end \ )) - case CLOSE: + case CLOSE: /* ) */ n = ARG(scan); /* which paren pair */ CLOSE_CAPTURE; /*if (n > PL_regsize) @@ -4760,7 +4782,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) goto fake_end; } break; - case ACCEPT: + + case ACCEPT: /* (*ACCEPT) */ if (ARG(scan)){ regnode *cursor; for (cursor=scan; @@ -4785,22 +4808,27 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) } goto fake_end; /*NOTREACHED*/ - case GROUPP: + + case GROUPP: /* (?(1)) */ n = ARG(scan); /* which paren pair */ sw = cBOOL(rex->lastparen >= n && rex->offs[n].end != -1); break; - case NGROUPP: + + case NGROUPP: /* (?(<name>)) */ /* reg_check_named_buff_matched returns 0 for no match */ sw = cBOOL(0 < reg_check_named_buff_matched(rex,scan)); break; - case INSUBP: + + case INSUBP: /* (?(R)) */ n = ARG(scan); sw = (cur_eval && (!n || cur_eval->u.eval.close_paren == n)); break; - case DEFINEP: + + case DEFINEP: /* (?(DEFINE)) */ sw = 0; break; - case IFTHEN: + + case IFTHEN: /* (?(cond)A|B) */ PL_reg_leftiter = PL_reg_maxiter; /* Void cache */ if (sw) next = NEXTOPER(NEXTOPER(scan)); @@ -4810,7 +4838,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) next = NEXTOPER(NEXTOPER(next)); } break; - case LOGICAL: + + case LOGICAL: /* modifier for EVAL and IFMATCH */ logical = scan->flags; break; @@ -5171,11 +5200,13 @@ NULL PUSH_STATE_GOTO(BRANCH_next, scan, locinput); } assert(0); /* NOTREACHED */ - case CUTGROUP: + + case CUTGROUP: /* /(*THEN)/ */ sv_yes_mark = st->u.mark.mark_name = scan->flags ? NULL : MUTABLE_SV(rexi->data->data[ ARG( scan ) ]); PUSH_STATE_GOTO(CUTGROUP_next, next, locinput); assert(0); /* NOTREACHED */ + case CUTGROUP_next_fail: do_cutgroup = 1; no_final = 1; @@ -5183,9 +5214,11 @@ NULL sv_commit = st->u.mark.mark_name; sayNO; assert(0); /* NOTREACHED */ + case BRANCH_next: sayYES; assert(0); /* NOTREACHED */ + case BRANCH_next_fail: /* that branch failed; try the next, if any */ if (do_cutgroup) { do_cutgroup = 0; @@ -5208,7 +5241,7 @@ NULL continue; /* execute next BRANCH[J] op */ assert(0); /* NOTREACHED */ - case MINMOD: + case MINMOD: /* next op will be non-greedy, e.g. A*? */ minmod = 1; break; @@ -5336,7 +5369,8 @@ NULL (int)(REPORT_CODE_OFF+(depth*2)), "", (IV)ST.count) ); - if (ST.c1 != CHRTEST_VOID + if ( !NEXTCHR_IS_EOS + && ST.c1 != CHRTEST_VOID && nextchr != ST.c1 && nextchr != ST.c2) { @@ -5390,8 +5424,7 @@ NULL if (ST.count == ARG1(ST.me) /* min */) sayNO; ST.count--; - locinput = HOPc(locinput, -ST.alen); - nextchr = UCHARAT(locinput); + SET_locinput(HOPc(locinput, -ST.alen)); goto curlym_do_B; /* try to match B */ #undef ST @@ -5419,12 +5452,14 @@ NULL ST.max = REG_INFTY; scan = NEXTOPER(scan); goto repeat; + case PLUS: /* /A+B/ where A is width 1 */ ST.paren = 0; ST.min = 1; ST.max = REG_INFTY; scan = NEXTOPER(scan); goto repeat; + case CURLYN: /* /(A){m,n}B/ where A is width 1 */ ST.paren = scan->flags; /* Which paren to set */ ST.lastparen = rex->lastparen; @@ -5440,6 +5475,7 @@ NULL } scan = regnext(NEXTOPER(scan) + NODE_STEP_REGNODE); goto repeat; + case CURLY: /* /A{m,n}B/ where A is width 1 */ ST.paren = 0; ST.min = ARG1(scan); /* min to match */ @@ -5519,8 +5555,7 @@ NULL minmod = 0; if (ST.min && regrepeat(rex, &li, ST.A, ST.min, depth) < ST.min) sayNO; - locinput = li; - nextchr = UCHARAT(locinput); + SET_locinput(li); ST.count = ST.min; REGCP_SET(ST.cp); if (ST.c1 == CHRTEST_VOID) @@ -5557,8 +5592,7 @@ NULL ST.count = regrepeat(rex, &li, ST.A, ST.max, depth); if (ST.count < ST.min) sayNO; - locinput = li; - nextchr = UCHARAT(locinput); + SET_locinput(li); if ((ST.count > ST.min) && (PL_regkind[OP(ST.B)] == EOL) && (OP(ST.B) != MEOL)) { @@ -5698,18 +5732,22 @@ NULL } { UV c = 0; - if (ST.c1 != CHRTEST_VOID) + if (ST.c1 != CHRTEST_VOID && locinput < PL_regeol) c = utf8_target ? utf8n_to_uvchr((U8*)locinput, UTF8_MAXBYTES, 0, uniflags) : (UV) UCHARAT(locinput); /* If it could work, try it. */ - if (ST.c1 == CHRTEST_VOID || c == (UV)ST.c1 || c == (UV)ST.c2) { + if (ST.c1 == CHRTEST_VOID + || (locinput < PL_regeol && + (c == (UV)ST.c1 || c == (UV)ST.c2))) + { CURLY_SETPAREN(ST.paren, ST.count); PUSH_STATE_GOTO(CURLY_B_max, ST.B, locinput); assert(0); /* NOTREACHED */ } } /* FALL THROUGH */ + case CURLY_B_max_fail: /* failed to find B in a greedy match */ @@ -5725,7 +5763,7 @@ NULL #undef ST - case END: + case END: /* last op of main pattern */ fake_end: if (cur_eval) { /* we've just finished A in /(??{A})B/; now continue with B */ @@ -5840,7 +5878,6 @@ NULL if (OP(ST.me) != SUSPEND) { /* restore old position except for (?>...) */ locinput = st->locinput; - nextchr = UCHARAT(locinput); } scan = ST.me + ARG(ST.me); if (scan == ST.me) @@ -5849,28 +5886,33 @@ NULL #undef ST - case LONGJMP: + case LONGJMP: /* alternative with many branches compiles to + * (BRANCHJ; EXACT ...; LONGJMP ) x N */ next = scan + ARG(scan); if (next == scan) next = NULL; break; - case COMMIT: + + case COMMIT: /* (*COMMIT) */ reginfo->cutpoint = PL_regeol; /* FALLTHROUGH */ - case PRUNE: + + case PRUNE: /* (*PRUNE) */ if (!scan->flags) sv_yes_mark = sv_commit = MUTABLE_SV(rexi->data->data[ ARG( scan ) ]); PUSH_STATE_GOTO(COMMIT_next, next, locinput); assert(0); /* NOTREACHED */ + case COMMIT_next_fail: no_final = 1; /* FALLTHROUGH */ - case OPFAIL: + + case OPFAIL: /* (*FAIL) */ sayNO; assert(0); /* NOTREACHED */ #define ST st->u.mark - case MARKPOINT: + case MARKPOINT: /* (*MARK:foo) */ ST.prev_mark = mark_state; ST.mark_name = sv_commit = sv_yes_mark = MUTABLE_SV(rexi->data->data[ ARG( scan ) ]); @@ -5878,10 +5920,12 @@ NULL ST.mark_loc = locinput; PUSH_YES_STATE_GOTO(MARKPOINT_next, next, locinput); assert(0); /* NOTREACHED */ + case MARKPOINT_next: mark_state = ST.prev_mark; sayYES; assert(0); /* NOTREACHED */ + case MARKPOINT_next_fail: if (popmark && sv_eq(ST.mark_name,popmark)) { @@ -5902,7 +5946,8 @@ NULL mark_state->u.mark.mark_name : NULL; sayNO; assert(0); /* NOTREACHED */ - case SKIP: + + case SKIP: /* (*SKIP) */ if (scan->flags) { /* (*SKIP) : if we fail we cut here*/ ST.mark_name = NULL; @@ -5927,6 +5972,7 @@ NULL } /* Didn't find our (*MARK:NAME) so ignore this (*SKIP:NAME) */ break; + case SKIP_next_fail: if (ST.mark_name) { /* (*CUT:NAME) - Set up to search for the name as we @@ -5946,43 +5992,54 @@ NULL sayNO; assert(0); /* NOTREACHED */ #undef ST - case LNBREAK: - if ((n=is_LNBREAK(locinput,utf8_target))) { + + case LNBREAK: /* \R */ + if ((n=is_LNBREAK_safe(locinput, PL_regeol, utf8_target))) { locinput += n; - nextchr = UCHARAT(locinput); } else sayNO; break; #define CASE_CLASS(nAmE) \ case nAmE: \ - if (locinput >= PL_regeol) \ + if (NEXTCHR_IS_EOS) \ sayNO; \ if ((n=is_##nAmE(locinput,utf8_target))) { \ locinput += n; \ - nextchr = UCHARAT(locinput); \ } else \ sayNO; \ break; \ case N##nAmE: \ - if (locinput >= PL_regeol) \ + if (NEXTCHR_IS_EOS) \ sayNO; \ if ((n=is_##nAmE(locinput,utf8_target))) { \ sayNO; \ } else { \ locinput += UTF8SKIP(locinput); \ - nextchr = UCHARAT(locinput); \ } \ break - CASE_CLASS(VERTWS); - CASE_CLASS(HORIZWS); + CASE_CLASS(VERTWS); /* \v \V */ + CASE_CLASS(HORIZWS); /* \h \H */ #undef CASE_CLASS default: PerlIO_printf(Perl_error_log, "%"UVxf" %d\n", PTR2UV(scan), OP(scan)); Perl_croak(aTHX_ "regexp memory corruption"); + + /* this is a point to jump to in order to increment + * locinput by one character */ + increment_locinput: + if (utf8_target) { + locinput += PL_utf8skip[nextchr]; + /* locinput is allowed to go 1 char off the end, but not 2+ */ + if (locinput > PL_regeol) + sayNO; + } + else + locinput++; + break; } /* end switch */ @@ -6030,7 +6087,6 @@ NULL PL_regmatch_state = newst; locinput = pushinput; - nextchr = UCHARAT(locinput); st = newst; continue; assert(0); /* NOTREACHED */ @@ -6081,10 +6137,8 @@ yes: yes_state = st->u.yes.prev_yes_state; PL_regmatch_state = st; - if (no_final) { + if (no_final) locinput= st->locinput; - nextchr = UCHARAT(locinput); - } state_num = st->resume_state + no_final; goto reenter_switch; } @@ -6129,7 +6183,6 @@ no_silent: } PL_regmatch_state = st; locinput= st->locinput; - nextchr = UCHARAT(locinput); DEBUG_STATE_pp("pop"); depth--; @@ -6647,7 +6700,8 @@ S_regrepeat(pTHX_ const regexp *prog, char **startposp, const regnode *p, I32 ma case LNBREAK: if (utf8_target) { loceol = PL_regeol; - while (hardcount < max && scan < loceol && (c=is_LNBREAK_utf8(scan))) { + while (hardcount < max && scan < loceol && + (c=is_LNBREAK_utf8_safe(scan, loceol))) { scan += c; hardcount++; } @@ -6657,7 +6711,7 @@ S_regrepeat(pTHX_ const regexp *prog, char **startposp, const regnode *p, I32 ma because we have a null terminated string, but we have to use hardcount in this situation */ - while (scan < loceol && (c=is_LNBREAK_latin1(scan))) { + while (scan < loceol && (c=is_LNBREAK_latin1_safe(scan, loceol))) { scan+=c; hardcount++; } @@ -6666,24 +6720,28 @@ S_regrepeat(pTHX_ const regexp *prog, char **startposp, const regnode *p, I32 ma case HORIZWS: if (utf8_target) { loceol = PL_regeol; - while (hardcount < max && scan < loceol && (c=is_HORIZWS_utf8(scan))) { + while (hardcount < max && scan < loceol && + (c=is_HORIZWS_utf8_safe(scan, loceol))) + { scan += c; hardcount++; } } else { - while (scan < loceol && is_HORIZWS_latin1(scan)) + while (scan < loceol && is_HORIZWS_latin1_safe(scan, loceol)) scan++; } break; case NHORIZWS: if (utf8_target) { loceol = PL_regeol; - while (hardcount < max && scan < loceol && !is_HORIZWS_utf8(scan)) { + while (hardcount < max && scan < loceol && + !is_HORIZWS_utf8_safe(scan, loceol)) + { scan += UTF8SKIP(scan); hardcount++; } } else { - while (scan < loceol && !is_HORIZWS_latin1(scan)) + while (scan < loceol && !is_HORIZWS_latin1_safe(scan, loceol)) scan++; } @@ -6691,12 +6749,14 @@ S_regrepeat(pTHX_ const regexp *prog, char **startposp, const regnode *p, I32 ma case VERTWS: if (utf8_target) { loceol = PL_regeol; - while (hardcount < max && scan < loceol && (c=is_VERTWS_utf8(scan))) { + while (hardcount < max && scan < loceol && + (c=is_VERTWS_utf8_safe(scan, loceol))) + { scan += c; hardcount++; } } else { - while (scan < loceol && is_VERTWS_latin1(scan)) + while (scan < loceol && is_VERTWS_latin1_safe(scan, loceol)) scan++; } @@ -6704,12 +6764,14 @@ S_regrepeat(pTHX_ const regexp *prog, char **startposp, const regnode *p, I32 ma case NVERTWS: if (utf8_target) { loceol = PL_regeol; - while (hardcount < max && scan < loceol && !is_VERTWS_utf8(scan)) { + while (hardcount < max && scan < loceol && + !is_VERTWS_utf8_safe(scan, loceol)) + { scan += UTF8SKIP(scan); hardcount++; } } else { - while (scan < loceol && !is_VERTWS_latin1(scan)) + while (scan < loceol && !is_VERTWS_latin1_safe(scan, loceol)) scan++; } -- Perl5 Master Repository
