In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/61eef0f5eac901e462178b0556cbaacb988b8a58?hp=a0f6433d30e7c146a90edcb5eaaf811eaf5f20eb>
- Log ----------------------------------------------------------------- commit 61eef0f5eac901e462178b0556cbaacb988b8a58 Merge: a0f6433 1a4edc3 Author: David Mitchell <[email protected]> Date: Sun Mar 16 18:04:13 2014 +0000 [MERGE] more refactoring of re_intuit_start() This is the second merge commit of work I've done on re_intuit_start(). This one mostly concentrates on the block of code that looks for a start class. It does a lot of reorganisation of the ordering of the various nested if/elses, and by doing do manages to eliminate three labels and associated gotos. It also audits that block for UTF8 correctness, either fixing or documenting areas where byte rather than char arithmetic was being performed. At this point I've now done a basic audit of the whole function. It also swaps the start class and 'update BmUSEFUL' blocks. This seems more logical, and makes the code simpler. Finally, it cleans up and updates a lot of comments in the function generally. This merge has mainly been simplifying, cleaning up and fixing work. I still have a big list of stuff that *could* be done, including further optimisations. commit 1a4edc3cc873a3a3fc67edf3a036e5dba2ba3dc4 Author: David Mitchell <[email protected]> Date: Sun Mar 16 16:25:59 2014 +0000 re_intuit_start(): re-comment head of function The blurb above this function is a mess and contains lots of obsolete stuff. Mostly rewrite it. Also sprinkle a few extra comments elsewhere. M regexec.c commit 8fd347206d41b7722b4ca50e08384ab34013c08c Author: David Mitchell <[email protected]> Date: Sun Mar 16 15:41:55 2014 +0000 re_intuit_start(): indent rest of check block Most of the 'check' substring matching code is in its own scope; move the last few bits into that scope too. This is purely cosmetic. M regexec.c commit 66b7ec5c43f4662ee089ad40efd8dff20754fa06 Author: David Mitchell <[email protected]> Date: Sun Mar 16 15:40:00 2014 +0000 re_intuit_start(): add some general comments M regexec.c commit 70563e168c754cdb3be6df5b0000a29f506cdd2a Author: David Mitchell <[email protected]> Date: Sun Mar 16 12:17:18 2014 +0000 re_intuit_start(): update comments in BmUSEFUL blk M regexec.c commit dd170ff56e3f24f1c184d8cb9363235e8bccefda Author: David Mitchell <[email protected]> Date: Sun Mar 16 11:58:51 2014 +0000 re_intuit_start(): update comments in stclass code Improve the description at the start of this code block, and move some of the comments closer to where they apply (and explain why they're probably wrong) M regexec.c commit 557f47af18157e7faacd1b5f76b4144137e8f1cc Author: David Mitchell <[email protected]> Date: Sun Mar 16 11:36:43 2014 +0000 re_intuit_start(): improve main terminating cond At the end of the failure block in the stclass code, it decides whether rx_origin has been advanced too much for the match to ever succeed, and if so, quits. That test has a comment next to it about perhaps testing for this earlier. It also does a byte rather than chars calculation. This commit does the following. It adds a proper chars-based termination test near the beginning of the 'check' code block. This test is essentially free, since at that point we've just done the necessary HOPs. It also calculates start_point using end_point rather the strend as the limit, so avoiding a bit of unnecessary hopping when we're about to fail anyway. It introduces the new HOPMAYBE3 macro wrapper for this purpose. It keeps the old bytes-based test at the end in stclass, with a note that the simple bytes calculation errs on the side of re-entering the check block where the accurate chars-based test is now done. M regexec.c commit c75a398555a9e21701ddca2d18783ad15b6b1700 Author: David Mitchell <[email protected]> Date: Sat Mar 15 15:08:46 2014 +0000 re_intuit_start(): update stclass code comments Document instances where its safe to use bytes verses char length arithmetic, and generally improve some of the commentary in this block of code. M regexec.c commit 8f2ced4fed952b3c2d5fdb8b7ded8c0785c49d0b Author: David Mitchell <[email protected]> Date: Sat Mar 15 15:00:48 2014 +0000 re_intuit_start(): remove an obsolete assert. I added the assert a while ago while I was refactoring this code. It's pretty obvious now that at this point, rx_origin > strpos. (It wasn't back then, as strpos wasn't constant). M regexec.c commit 000dfd2d23e52925b97fdb70c3a45e996df1ff2e Author: David Mitchell <[email protected]> Date: Sat Mar 15 14:27:47 2014 +0000 re_intuit_start(): fix byte/char calculation err On failure to find stclass in float range, it currently does: rx_origin = check_at - start_shift; which should instead be rx_origin = HOP3c(check_at, - start_shift, strbeg); The second added test is an example of when this causes a false negative. (The first test is just the analogue without a UTF8 string, so it already passes). Since we actually already calculate this value earlier as part of determining endpos in the float case, assign the result of that calc to a local var, then just reuse it. M regexec.c M t/re/re_tests commit c43b55202cde0bd54e79e98bc90ceaecb135c151 Author: David Mitchell <[email protected]> Date: Mon Mar 10 17:53:48 2014 +0000 re_intuit_start(): eliminate checked_upto var This var is supposed to record the latest place that stclass had rejected, so if we try again, we don't start any earlier than that. However, an assert showed that nothing in the test suite ever left checked_upto > rx_origin on re-entry, so we could always use rx_origin directly. On failure, checked_upto is reset to HOPBACKc(endpos, start_shift) where endpoint is the highest char that find_byclass() was asked to search to. Now, I think that this is a logical error; it should have been HOPBACKc(endpos, cl_l) or similar; i.e. just hop back the number of chars equal to the length of the character class; using start_shift just makes checked_upto artificially too small. But in either case, consider the following more formal analysis (the arithmetic below is in terms of chars). Assume initially at least, that checked_upto <= rx_origin. The question is whether we can ever end up with checked_upto "overtaking" rx_origin. Where there is an anchored substring or ml_anch, we have: endpos = r_origin + cl_l; on failure, checked_upto = endpos - (start_shift or cl_l) since start_shift should be => cl_l, in either case we end up with checked_upto <= rx_origin. Where there is only a floating substring, we have: endpos = r_origin - start_shift + cl_l; on failure, checked_upto = endpos - (start_shift or cl_l) again, since start_shift should be => cl_l, in either case we end up with checked_upto <= rx_origin. Where there are no substrings, we return on failure; so the updating checked_upto is irrelevant. On success we return, so again the updating of checked_upto is irrelevant. M regexec.c commit 83f2232dc70fcf91e6341ae35e5eb6cdc7f2365e Author: David Mitchell <[email protected]> Date: Mon Mar 10 15:48:25 2014 +0000 re_intuit_start(): fix a comment about overlap The comment said that we don't allow the anchored and floating substrings to overlap. This is trivially wrong, e.g. /aa+/ has an anchored of "aa" at 0 and a float of "a at 1..inf Fix the comment to say what the assertion actually already tests for, i.e. that the float doesn't start before the anchored. M regexec.c commit be778b1af19359f88be2d70344ade5aae03b107d Author: David Mitchell <[email protected]> Date: Sat Mar 8 17:29:43 2014 +0000 re_intuit_start(): trivial reorg of stclass code make all the actions on success follow the actions on failure, rather than being partly before and partly after. No functional changes. M regexec.c commit 6b071d16f412026593626080b01e5398a48ed780 Author: David Mitchell <[email protected]> Date: Sat Mar 8 14:38:56 2014 +0000 re_intuit_start(): always initialise start_shift The start_shift variable is usually initialised to prog->check_offset_min, except for one code path where it is left at zero. It turns out that in that code path, the value isn't (currently) used, so is safe. However, that may change, so unconditionally initialise it. M regexec.c commit ffad1e6aac9969035a0dd84ca2ac1e4482364b5f Author: David Mitchell <[email protected]> Date: Tue Feb 18 10:13:54 2014 +0000 re_intuit_start(): swap BmUSEFUL + stclass blocks Currently the code that does BmUSEFUL()--, and which and frees the substring if the count reaches zero, comes before the stclass code. This is problematic, because by the time the stclass code is executed, the check substrings may have been freed. The stclass code does have some checks for this, but they are inadequate: I managed to produce code that segfaulted. It's also hard to test - you have to run against the same pattern 100 times to get the BmUSEFUL() count down to zero, then on the 101st attempt, do something that both triggers the decrement/free and fails the st class. By moving the BmUSEFUL()-- stuff to *after* the stclass code, it removes this issue completely. As to how it affects the BmUSEFUL() count: I think it makes it slightly better. Before, a substring could match at rx_origin offset 0, triggering BmUSEFUL()--, then the stclass could fail, triggering a new substring match at at a non-zero offset, and thus doing BmUSEFUL()++. The overall affect being neutral. Now, the overall effect is BmUSEFUL += 1, which better reflects the fact that the substring *did* help us find a non-zero offset. M regexec.c commit 97136c8a5affa3c46e244cc603af319eafb3938b Author: David Mitchell <[email protected]> Date: Mon Feb 17 20:46:15 2014 +0000 re_intuit_start(): de-duplicate condition Change if (have_anchored && check_ix == 1) { B; } if (!have_anchored) { A; } C; To: if (have_anchored) { if (check_ix == 1) { B; } } else { A; } C; This change should be functionally equivalent, but eliminates calculating the have_anchored condition twice. M regexec.c commit 3369914b87d2d563caa427611eb6e81820d8796a Author: David Mitchell <[email protected]> Date: Mon Feb 17 20:37:13 2014 +0000 re_intuit_start(): swap two blocks, delete label Change if (!have_anchored) { A; goto hop_and_restart; } if (check_ix == 1) { B; } hop_and_restart: C; To: if (have_anchored && check_ix == 1) { B; } if (!have_anchored) { A; } C; This change should be functionally equivalent, but eliminates a label. M regexec.c commit 9fed8d02fb387e26fe15437b6fa29f16095fba55 Author: David Mitchell <[email protected]> Date: Mon Feb 17 20:20:40 2014 +0000 re_intuit_start(): swap another if/else block Change { ... if (has_anchored) { A; goto restart; } B; goto hop_and_restart; } to { ... if (!has_anchored) { B; goto hop_and_restart; } A; goto restart; } Functionally the same, but will allow simpler code shortly M regexec.c commit 838b29afa5fb090e885342b04b29a9bd24b07981 Author: David Mitchell <[email protected]> Date: Mon Feb 17 20:12:31 2014 +0000 re_intuit_start(): remove redundant assertion assert(prog->substrs->check_ix) is within the block if (prog->substrs->check_ix == 1) so it's not needed any more. M regexec.c commit ab60c45aeb1dd179abc80fd3a27963e6518497a7 Author: David Mitchell <[email protected]> Date: Mon Feb 17 19:45:12 2014 +0000 re_intuit_start(): swap another if/else block Change { if (cond) goto x; A; goto y; } x: into { if (!cond) { A; goto y; } } x: Functionally equivalent, but eliminates one 'goto'. M regexec.c commit 40268e5b6860d2c4c03900f03b8c294af7b8c823 Author: David Mitchell <[email protected]> Date: Mon Feb 17 19:39:44 2014 +0000 re_intuit_start(): swap if/else blocks Change { if (cond) { A; goto x; } B; goto y; } into { if (!cond) { B; goto y; } A; goto x; } Should be functionally equivalent, but will allow for some further code factorisation shortly. M regexec.c commit f746537561e2b62be6215d1e69383606e59deef5 Author: David Mitchell <[email protected]> Date: Mon Feb 17 19:07:08 2014 +0000 re_intuit_start(): eliminate one label Currently the code looks like: if (rx_origin + start_shift >= check_at) goto retry_floating_check; .... retry_floating_check: rx_origin = check_at - start_shift; goto hop_and_restart; But that conditional only kicks in when the floating substring (which was the check substring) matched, so this condition always applies: rx_origin + start_shift <= check_at So the condition in the code will only ever be true when rx_origin + start_shift == check_at In this case, this assignment is a noop: rx_origin = check_at - start_shift; So skip it and jump directly to hop_and_restart. This eliminates the retry_floating_check: label. M regexec.c commit 4f6684feb3d78081996bc01d27f11b937a9c2277 Author: David Mitchell <[email protected]> Date: Mon Feb 17 18:51:48 2014 +0000 add test for a code path in intuit_start() Nothing in the test suite currently triggers *not* taking this branch: /* Have both, check_string is floating */ if (rx_origin + start_shift >= check_at) /* Contradicts floating=check */ So add something that does trigger it. M t/re/re_tests commit 7f376a02026c0a17e986a7322a0ea03b78ce76b3 Author: David Mitchell <[email protected]> Date: Mon Feb 10 19:56:08 2014 +0000 re_intuit_start(): eliminate debug-only var 'what' is only defined and assigned to in debugging builds. Just calculate the value on the fly instead when printing debugging output. Makes the code less messy. M regexec.c commit 5f9c657526f7ba9e5102768312d28624f912b731 Author: David Mitchell <[email protected]> Date: Mon Feb 10 15:37:30 2014 +0000 re_intuit_start(): eliminate t from stclass code The 't' variable now just contains a copy of rx_origin; so just use rx_origin directly, and eliminate t. M regexec.c commit c68d900fa69a98fc307d2b8a88927a1c4b609bc0 Author: David Mitchell <[email protected]> Date: Mon Feb 10 14:52:46 2014 +0000 re_intuit_start(): reduce use of s in stclass code s is mainly acting as a stand-in for rx_origin; so just use rx_origin directly. M regexec.c commit ee5cbb25cd8121857974ab2c66dc5acd2467b5da Author: David Mitchell <[email protected]> Date: Mon Feb 10 14:44:25 2014 +0000 re_intuit_start(): remove if(check) It this point in the stclass block, we have already confirmed that an anchored substring exists and that the check substring is anchored. So check cannot be null at this point. So don't test for it being null. M regexec.c commit 5149a019c7412ab294e1fd1bcfc139bcafc3b5ee Author: David Mitchell <[email protected]> Date: Mon Feb 10 13:28:25 2014 +0000 re_intuit_start(): use check_ix for efficiency M regexec.c commit fa3bb21d8d67c7d0bc07c697143264b3ec46f578 Author: David Mitchell <[email protected]> Date: Mon Feb 10 13:23:07 2014 +0000 re_intuit_start(): stclass: use rx_origin more In the stclass code block, we do s = rx_origin; ... stuff reading the value of s ... s = find_byclass(...); change that to ... stuff reading the value of rx_origin ... s = find_byclass(...); M regexec.c ----------------------------------------------------------------------- Summary of changes: regexec.c | 554 ++++++++++++++++++++++++++++++++-------------------------- t/re/re_tests | 5 + 2 files changed, 313 insertions(+), 246 deletions(-) diff --git a/regexec.c b/regexec.c index e385fe7..f606622 100644 --- a/regexec.c +++ b/regexec.c @@ -119,6 +119,7 @@ static const char* const non_utf8_target_but_utf8_required ? reghop3((U8*)pos, off, \ (U8*)(off >= 0 ? reginfo->strend : reginfo->strbeg)) \ : (U8*)(pos + off)) + #define HOPBACKc(pos, off) \ (char*)(reginfo->is_utf8_target \ ? reghopmaybe3((U8*)pos, -off, (U8*)(reginfo->strbeg)) \ @@ -129,6 +130,14 @@ static const char* const non_utf8_target_but_utf8_required #define HOP3(pos,off,lim) (reginfo->is_utf8_target ? reghop3((U8*)(pos), off, (U8*)(lim)) : (U8*)(pos + off)) #define HOP3c(pos,off,lim) ((char*)HOP3(pos,off,lim)) +/* lim must be +ve. Returns NULL on overshoot */ +#define HOPMAYBE3(pos,off,lim) \ + (reginfo->is_utf8_target \ + ? reghopmaybe3((U8*)pos, off, (U8*)(lim)) \ + : ((U8*)pos + off <= lim) \ + ? (U8*)pos + off \ + : NULL) + /* like HOP3, but limits the result to <= lim even for the non-utf8 case. * off must be >=0; args should be vars rather than expressions */ #define HOP3lim(pos,off,lim) (reginfo->is_utf8_target \ @@ -563,64 +572,70 @@ Perl_pregexec(pTHX_ REGEXP * const prog, char* stringarg, char *strend, } #endif -/* - * Need to implement the following flags for reg_anch: - * - * USE_INTUIT_NOML - Useful to call re_intuit_start() first - * USE_INTUIT_ML - * INTUIT_AUTORITATIVE_NOML - Can trust a positive answer - * INTUIT_AUTORITATIVE_ML - * INTUIT_ONCE_NOML - Intuit can match in one location only. - * INTUIT_ONCE_ML - * - * Another flag for this function: SECOND_TIME (so that float substrs - * with giant delta may be not rechecked). - */ - -/* If SCREAM, then SvPVX_const(sv) should be compatible with strpos and strend. - Otherwise, only SvCUR(sv) is used to get strbeg. */ -/* XXXX Some places assume that there is a fixed substring. - An update may be needed if optimizer marks as "INTUITable" - RExen without fixed substrings. Similarly, it is assumed that - lengths of all the strings are no more than minlen, thus they - cannot come from lookahead. - (Or minlen should take into account lookahead.) - NOTE: Some of this comment is not correct. minlen does now take account - of lookahead/behind. Further research is required. -- demerphq -*/ - -/* A failure to find a constant substring means that there is no need to make - an expensive call to REx engine, thus we celebrate a failure. Similarly, - finding a substring too deep into the string means that fewer calls to - regtry() should be needed. - - REx compiler's optimizer found 4 possible hints: - a) Anchored substring; - b) Fixed substring; - c) Whether we are anchored (beginning-of-line or \G); - d) First node (of those at offset 0) which may distinguish positions; - We use a)b)d) and multiline-part of c), and try to find a position in the - string which does not contradict any of them. - */ - -/* Most of decisions we do here should have been done at compile time. - The nodes of the REx which we used for the search should have been - deleted from the finite automaton. */ - -/* args: - * rx: the regex to match against - * sv: the SV being matched: only used for utf8 flag; the string - * itself is accessed via the pointers below. Note that on - * something like an overloaded SV, SvPOK(sv) may be false - * and the string pointers may point to something unrelated to - * the SV itself. - * strbeg: real beginning of string - * strpos: the point in the string at which to begin matching - * strend: pointer to the byte following the last char of the string - * flags currently unused; set to 0 - * data: currently unused; set to NULL +/* re_intuit_start(): + * + * Based on some optimiser hints, try to find the earliest position in the + * string where the regex could match. + * + * rx: the regex to match against + * sv: the SV being matched: only used for utf8 flag; the string + * itself is accessed via the pointers below. Note that on + * something like an overloaded SV, SvPOK(sv) may be false + * and the string pointers may point to something unrelated to + * the SV itself. + * strbeg: real beginning of string + * strpos: the point in the string at which to begin matching + * strend: pointer to the byte following the last char of the string + * flags currently unused; set to 0 + * data: currently unused; set to NULL + * + * The basic idea of re_intuit_start() is to use some known information + * about the pattern, namely: + * + * a) the longest known anchored substring (i.e. one that's at a + * constant offset from the beginning of the pattern; but not + * necessarily at a fixed offset from the beginning of the + * string); + * b) the longest floating substring (i.e. one that's not at a constant + * offset from the beginning of the pattern); + * c) Whether the pattern is anchored to the string; either + * an absolute anchor: /^../, or anchored to \n: /^.../m, + * or anchored to pos(): /\G/; + * d) A start class: a real or synthetic character class which + * represents which characters are legal at the start of the pattern; + * + * to either quickly reject the match, or to find the earliest position + * within the string at which the pattern might match, thus avoiding + * running the full NFA engine at those earlier locations, only to + * eventually fail and retry further along. + * + * Returns NULL if the pattern can't match, or returns the address within + * the string which is the earliest place the match could occur. + * + * The longest of the anchored and floating substrings is called 'check' + * and is checked first. The other is called 'other' and is checked + * second. The 'other' substring may not be present. For example, + * + * /(abc|xyz)ABC\d{0,3}DEFG/ + * + * will have + * + * check substr (float) = "DEFG", offset 6..9 chars + * other substr (anchored) = "ABC", offset 3..3 chars + * stclass = [ax] + * + * Be aware that during the course of this function, sometimes 'anchored' + * refers to a substring being anchored relative to the start of the + * pattern, and sometimes to the pattern itself being anchored relative to + * the string. For example: + * + * /\dabc/: "abc" is anchored to the pattern; + * /^\dabc/: "abc" is anchored to the pattern and the string; + * /\d+abc/: "abc" is anchored to neither the pattern nor the string; + * /^\d+abc/: "abc" is anchored to neither the pattern nor the string, + * but the pattern is anchored to the string. */ char * @@ -635,7 +650,7 @@ Perl_re_intuit_start(pTHX_ { dVAR; struct regexp *const prog = ReANY(rx); - SSize_t start_shift = 0; + SSize_t start_shift = prog->check_offset_min; /* Should be nonnegative! */ SSize_t end_shift = 0; /* current lowest pos in string where the regex can start matching */ @@ -646,7 +661,6 @@ Perl_re_intuit_start(pTHX_ bool ml_anch = 0; char *other_last = strpos;/* latest pos 'other' substr already checked to */ char *check_at = NULL; /* check substr found at this pos */ - char *checked_upto = NULL; /* how far into the string we have already checked using find_byclass*/ const I32 multiline = prog->extflags & RXf_PMf_MULTILINE; RXi_GET_DECL(prog,progi); regmatch_info reginfo_buf; /* create some info to pass to find_byclass */ @@ -674,7 +688,7 @@ Perl_re_intuit_start(pTHX_ assert(prog->substrs->data[2].max_offset >= 0); /* for now, assume that if both present, that the floating substring - * follows the anchored substring, and that they don't overlap. + * doesn't start before the anchored substring. * If you break this assumption (e.g. doing better optimisations * with lookahead/behind), then you'll need to audit the code in this * function carefully first @@ -684,7 +698,9 @@ Perl_re_intuit_start(pTHX_ && (prog->float_utf8 || prog->float_substr)) || (prog->float_min_offset >= prog->anchored_offset)); - /* CHR_DIST() would be more correct here but it makes things slow. */ + /* byte rather than char calculation for efficiency. It fails + * to quickly reject some cases that can't match, but will reject + * them later after doing full char arithmetic */ if (prog->minlen > strend - strpos) { DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, " String too short...\n")); @@ -759,7 +775,13 @@ Perl_re_intuit_start(pTHX_ /* in the presence of an anchor, the anchored (relative to the * start of the regex) substr must also be anchored relative - * to strpos. So quickly reject if substr isn't found there */ + * to strpos. So quickly reject if substr isn't found there. + * This works for \G too, because the caller will already have + * subtracted gofs from pos, and gofs is the offset from the + * \G to the start of the regex. For example, in /.abc\Gdef/, + * where substr="abcdef", pos()=3, gofs=4, offset_min=1: + * caller will have set strpos=pos()-4; we look for the substr + * at position pos()-4+1, which lines up with the "a" */ if (prog->check_offset_min == prog->check_offset_max && !(prog->intflags & PREGf_CANY_SEEN) @@ -805,7 +827,6 @@ Perl_re_intuit_start(pTHX_ } } - start_shift = prog->check_offset_min; /* okay to underestimate on CC */ end_shift = prog->check_end_shift; #ifdef DEBUGGING /* 7/99: reports of failure (with the older version) */ @@ -815,11 +836,32 @@ Perl_re_intuit_start(pTHX_ #endif restart: - /* Find a candidate regex origin in the region rx_origin..strend - * by looking for the "check" substring in that region, corrected by - * start/end_shift. - */ + /* This is the (re)entry point of the main loop in this function. + * The goal of this loop is to: + * 1) find the "check" substring in the region rx_origin..strend + * (adjusted by start_shift / end_shift). If not found, reject + * immediately. + * 2) If it exists, look for the "other" substr too if defined; for + * example, if the check substr maps to the anchored substr, then + * check the floating substr, and vice-versa. If not found, go + * back to (1) with rx_origin suitably incremented. + * 3) If we find an rx_origin position that doesn't contradict + * either of the substrings, then check the possible additional + * constraints on rx_origin of /^.../m or a known start class. + * If these fail, then depending on which constraints fail, jump + * back to here, or to various other re-entry points further along + * that skip some of the first steps. + * 4) If we pass all those tests, update the BmUSEFUL() count on the + * substring. If the start position was determined to be at the + * beginning of the string - so, not rejected, but not optimised, + * since we have to run regmatch from position 0 - decrement the + * BmUSEFUL() count. Otherwise increment it. + */ + + + /* first, look for the 'check' substring */ + { U8* start_point; U8* end_point; @@ -839,11 +881,16 @@ Perl_re_intuit_start(pTHX_ if (prog->intflags & PREGf_CANY_SEEN) { start_point= (U8*)(rx_origin + start_shift); end_point= (U8*)(strend - end_shift); + if (start_point > end_point) + goto fail_finish; } else { - start_point= HOP3(rx_origin, start_shift, strend); - end_point= HOP3(strend, -end_shift, strbeg); + end_point = HOP3(strend, -end_shift, strbeg); + start_point = HOPMAYBE3(rx_origin, start_shift, end_point); + if (!start_point) + goto fail_finish; } + /* if the regex is absolutely anchored to the start of the string, * then check_offset_max represents an upper bound on the string * where the substr could start */ @@ -868,50 +915,38 @@ Perl_re_intuit_start(pTHX_ check_at = fbm_instr( start_point, end_point, check, multiline ? FBMrf_MULTILINE : 0); - } - /* Update the count-of-usability, remove useless subpatterns, - unshift s. */ - - DEBUG_EXECUTE_r({ - RE_PV_QUOTED_DECL(quoted, utf8_target, PERL_DEBUG_PAD_ZERO(0), - SvPVX_const(check), RE_SV_DUMPLEN(check), 30); - PerlIO_printf(Perl_debug_log, " %s %s substr %s%s%s", - (check_at ? "Found" : "Did not find"), - (check == (utf8_target ? prog->anchored_utf8 : prog->anchored_substr) - ? "anchored" : "floating"), - quoted, - RE_SV_TAIL(check), - (check_at ? " at offset " : "...\n") ); - }); + /* Update the count-of-usability, remove useless subpatterns, + unshift s. */ - if (!check_at) - goto fail_finish; - /* Finish the diagnostic message */ - DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, "%ld...\n", (long)(check_at - strpos)) ); + DEBUG_EXECUTE_r({ + RE_PV_QUOTED_DECL(quoted, utf8_target, PERL_DEBUG_PAD_ZERO(0), + SvPVX_const(check), RE_SV_DUMPLEN(check), 30); + PerlIO_printf(Perl_debug_log, " %s %s substr %s%s%s", + (check_at ? "Found" : "Did not find"), + (check == (utf8_target ? prog->anchored_utf8 : prog->anchored_substr) + ? "anchored" : "floating"), + quoted, + RE_SV_TAIL(check), + (check_at ? " at offset " : "...\n") ); + }); - /* set rx_origin to the minimum position where the regex could start - * matching, given the constraint of the just-matched check substring. - * But don't set it lower than previously. - */ + if (!check_at) + goto fail_finish; + /* Finish the diagnostic message */ + DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, "%ld...\n", (long)(check_at - strpos)) ); - if (check_at - rx_origin > prog->check_offset_max) - rx_origin = HOP3c(check_at, -prog->check_offset_max, rx_origin); + /* set rx_origin to the minimum position where the regex could start + * matching, given the constraint of the just-matched check substring. + * But don't set it lower than previously. + */ + if (check_at - rx_origin > prog->check_offset_max) + rx_origin = HOP3c(check_at, -prog->check_offset_max, rx_origin); + } - /* XXX dmq: first branch is for positive lookbehind... - Our check string is offset from the beginning of the pattern. - So we need to do any stclass tests offset forward from that - point. I think. :-( - */ - - /* Got a candidate. Check MBOL anchoring, and the *other* substr. - Start with the other substr. - XXXX no SCREAM optimization yet - and a very coarse implementation - XXXX /ttx+/ results in anchored="ttx", floating="x". floating will - *always* match. Probably should be marked during compile... - Probably it is right to do no SCREAM here... - */ + + /* now look for the 'other' substring if defined */ if (utf8_target ? prog->substrs->data[other_ix].utf8_substr : prog->substrs->data[other_ix].substr) @@ -1090,7 +1125,7 @@ Perl_re_intuit_start(pTHX_ postprocess_substr_matches: - /* handle the extra constraint of /^/m */ + /* handle the extra constraint of /^.../m if present */ if (ml_anch && rx_origin != strbeg && rx_origin[-1] != '\n' /* May be due to an implicit anchor of m{.*foo} */ @@ -1171,6 +1206,160 @@ Perl_re_intuit_start(pTHX_ PL_colors[0], PL_colors[1])); } + success_at_start: + + + /* if we have a starting character class, then test that extra constraint. + * (trie stclasses are too expensive to use here, we are better off to + * leave it to regmatch itself) */ + + if (progi->regstclass && PL_regkind[OP(progi->regstclass)]!=TRIE) { + const U8* const str = (U8*)STRING(progi->regstclass); + + /* XXX this value could be pre-computed */ + const int cl_l = (PL_regkind[OP(progi->regstclass)] == EXACT + ? (reginfo->is_utf8_pat + ? utf8_distance(str + STR_LEN(progi->regstclass), str) + : STR_LEN(progi->regstclass)) + : 1); + char * endpos; + char *s; + /* latest pos that a matching float substr constrains rx start to */ + char *rx_max_float = NULL; + + /* if the current rx_origin is anchored, either by satisfying an + * anchored substring constraint, or a /^.../m constraint, then we + * can reject the current origin if the start class isn't found + * at the current position. If we have a float-only match, then + * rx_origin is constrained to a range; so look for the start class + * in that range. if neither, then look for the start class in the + * whole rest of the string */ + + /* XXX DAPM it's not clear what the minlen test is for, and why + * it's not used in the floating case. Nothing in the test suite + * causes minlen == 0 here. See <[email protected]>. + * Here are some old comments, which may or may not be correct: + * + * minlen == 0 is possible if regstclass is \b or \B, + * and the fixed substr is ''$. + * Since minlen is already taken into account, rx_origin+1 is + * before strend; accidentally, minlen >= 1 guaranties no false + * positives at rx_origin + 1 even for \b or \B. But (minlen? 1 : + * 0) below assumes that regstclass does not come from lookahead... + * If regstclass takes bytelength more than 1: If charlength==1, OK. + * This leaves EXACTF-ish only, which are dealt with in + * find_byclass(). + */ + + if (prog->anchored_substr || prog->anchored_utf8 || ml_anch) + endpos= HOP3c(rx_origin, (prog->minlen ? cl_l : 0), strend); + else if (prog->float_substr || prog->float_utf8) { + rx_max_float = HOP3c(check_at, -start_shift, strbeg); + endpos= HOP3c(rx_max_float, cl_l, strend); + } + else + endpos= strend; + + DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, + " looking for class: start_shift: %"IVdf" check_at: %"IVdf + " rx_origin: %"IVdf" endpos: %"IVdf"\n", + (IV)start_shift, (IV)(check_at - strbeg), + (IV)(rx_origin - strbeg), (IV)(endpos - strbeg))); + + s = find_byclass(prog, progi->regstclass, rx_origin, endpos, + reginfo); + if (!s) { + if (endpos == strend) { + DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, + " Could not match STCLASS...\n") ); + goto fail; + } + DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, + " This position contradicts STCLASS...\n") ); + if ((prog->intflags & PREGf_ANCH) && !ml_anch) + goto fail; + + /* Contradict one of substrings */ + if (prog->anchored_substr || prog->anchored_utf8) { + if (prog->substrs->check_ix == 1) { /* check is float */ + /* Have both, check_string is floating */ + assert(rx_origin + start_shift <= check_at); + if (rx_origin + start_shift != check_at) { + /* not at latest position float substr could match: + * Recheck anchored substring, but not floating. + * The condition above is in bytes rather than + * chars for efficiency. It's conservative, in + * that it errs on the side of doing 'goto + * do_other_substr', where a more accurate + * char-based calculation will be done */ + DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, + " Looking for anchored substr starting at offset %ld...\n", + (long)(other_last - strpos)) ); + goto do_other_substr; + } + } + } + else { + /* float-only */ + + if (ml_anch) { + /* In the presence of ml_anch, we might be able to + * find another \n without breaking the current float + * constraint. */ + + /* strictly speaking this should be HOP3c(..., 1, ...), + * but since we goto a block of code that's going to + * search for the next \n if any, its safe here */ + rx_origin++; + DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, + " Looking for /%s^%s/m starting at offset %ld...\n", + PL_colors[0], PL_colors[1], + (long)(rx_origin - strpos)) ); + goto postprocess_substr_matches; + } + + /* strictly speaking this can never be true; but might + * be if we ever allow intuit without substrings */ + if (!(utf8_target ? prog->float_utf8 : prog->float_substr)) + goto fail; + + rx_origin = rx_max_float; + } + + /* at this point, any matching substrings have been + * contradicted. Start again... */ + + rx_origin = HOP3c(rx_origin, 1, strend); + + /* uses bytes rather than char calculations for efficiency. + * It's conservative: it errs on the side of doing 'goto restart', + * where there is code that does a proper char-based test */ + if (rx_origin + start_shift + end_shift > strend) { + DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, + " Could not match STCLASS...\n") ); + goto fail; + } + DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, + " Looking for %s substr starting at offset %ld...\n", + (prog->substrs->check_ix ? "floating" : "anchored"), + (long)(rx_origin + start_shift - strpos)) ); + goto restart; + } + + /* Success !!! */ + + if (rx_origin != s) { + DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, + " By STCLASS: moving %ld --> %ld\n", + (long)(rx_origin - strpos), (long)(s - strpos)) + ); + } + else { + DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, + " Does not contradict STCLASS...\n"); + ); + } + } /* Decide whether using the substrings helped */ @@ -1182,12 +1371,11 @@ Perl_re_intuit_start(pTHX_ ++BmUSEFUL(utf8_target ? prog->check_utf8 : prog->check_substr); /* hooray/5 */ } else { - /* The found string does not prohibit matching at strpos, - - no optimization of calling REx engine can be performed, - unless it was an MBOL and we are not after MBOL, - or a future STCLASS check will fail this. */ - success_at_start: - if (!(prog->intflags & PREGf_NAUGHTY) /* XXXX If strpos moved? */ + /* The found rx_origin position does not prohibit matching at + * strpos, so calling intuit didn't gain us anything. Decrement + * the BmUSEFUL() count on the check substring, and if we reach + * zero, free it. */ + if (!(prog->intflags & PREGf_NAUGHTY) && (utf8_target ? ( prog->check_utf8 /* Could be deleted already */ && --BmUSEFUL(prog->check_utf8) < 0 @@ -1222,140 +1410,10 @@ Perl_re_intuit_start(pTHX_ } } - /* Last resort... */ - /* XXXX BmUSEFUL already changed, maybe multiple change is meaningful... */ - /* trie stclasses are too expensive to use here, we are better off to - leave it to regmatch itself */ - if (progi->regstclass && PL_regkind[OP(progi->regstclass)]!=TRIE) { - /* minlen == 0 is possible if regstclass is \b or \B, - and the fixed substr is ''$. - Since minlen is already taken into account, rx_origin+1 is before strend; - accidentally, minlen >= 1 guaranties no false positives at rx_origin + 1 - even for \b or \B. But (minlen? 1 : 0) below assumes that - regstclass does not come from lookahead... */ - /* If regstclass takes bytelength more than 1: If charlength==1, OK. - This leaves EXACTF-ish only, which are dealt with in find_byclass(). */ - const U8* const str = (U8*)STRING(progi->regstclass); - char *t; - - /* XXX this value could be pre-computed */ - const int cl_l = (PL_regkind[OP(progi->regstclass)] == EXACT - ? (reginfo->is_utf8_pat - ? utf8_distance(str + STR_LEN(progi->regstclass), str) - : STR_LEN(progi->regstclass)) - : 1); - char * endpos; - char *s = rx_origin; - if (prog->anchored_substr || prog->anchored_utf8 || ml_anch) - endpos= HOP3c(s, (prog->minlen ? cl_l : 0), strend); - else if (prog->float_substr || prog->float_utf8) - endpos= HOP3c(HOP3c(check_at, -start_shift, strbeg), cl_l, strend); - else - endpos= strend; - - if (checked_upto < s) - checked_upto = s; - DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, - " looking for class: start_shift: %"IVdf" check_at: %"IVdf - " s: %"IVdf" endpos: %"IVdf" checked_upto: %"IVdf"\n", - (IV)start_shift, (IV)(check_at - strbeg), - (IV)(s - strbeg), (IV)(endpos - strbeg), - (IV)(checked_upto- strbeg))); + DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, + "Intuit: %sSuccessfully guessed:%s match at offset %ld\n", + PL_colors[4], PL_colors[5], (long)(rx_origin - strpos)) ); - t = s; - s = find_byclass(prog, progi->regstclass, checked_upto, endpos, - reginfo); - if (s) { - checked_upto = s; - } else { -#ifdef DEBUGGING - const char *what = NULL; -#endif - if (endpos == strend) { - DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, - " Could not match STCLASS...\n") ); - goto fail; - } - DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, - " This position contradicts STCLASS...\n") ); - if ((prog->intflags & PREGf_ANCH) && !ml_anch) - goto fail; - checked_upto = HOPBACKc(endpos, start_shift); - DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, " start_shift: %"IVdf" check_at: %"IVdf" endpos: %"IVdf" checked_upto: %"IVdf"\n", - (IV)start_shift, (IV)(check_at - strbeg), (IV)(endpos - strbeg), (IV)(checked_upto- strbeg))); - /* Contradict one of substrings */ - if (prog->anchored_substr || prog->anchored_utf8) { - if ((utf8_target ? prog->anchored_utf8 : prog->anchored_substr) == check) { - DEBUG_EXECUTE_r( what = "anchored" ); - hop_and_restart: - s = HOP3c(t, 1, strend); - if (s + start_shift + end_shift > strend) { - /* XXXX Should be taken into account earlier? */ - DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, - " Could not match STCLASS...\n") ); - goto fail; - } - rx_origin = s; - if (!check) - goto giveup; - DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, - " Looking for %s substr starting at offset %ld...\n", - what, (long)(rx_origin + start_shift - strpos)) ); - goto restart; - } - /* Have both, check_string is floating */ - if (t + start_shift >= check_at) /* Contradicts floating=check */ - goto retry_floating_check; - /* Recheck anchored substring, but not floating... */ - if (!check) { - rx_origin = NULL; - goto giveup; - } - DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, - " Looking for anchored substr starting at offset %ld...\n", - (long)(other_last - strpos)) ); - assert(prog->substrs->check_ix); /* other is float */ - goto do_other_substr; - } - /* Another way we could have checked stclass at the - current position only: */ - if (ml_anch) { - s = rx_origin = t + 1; - if (!check) - goto giveup; - DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, - " Looking for /%s^%s/m starting at offset %ld...\n", - PL_colors[0], PL_colors[1], - (long)(rx_origin - strpos)) ); - /* XXX DAPM I don't yet know why this is true, but the code - * assumed it when it used to do goto try_at_offset */ - assert(rx_origin != strpos); - goto postprocess_substr_matches; - } - if (!(utf8_target ? prog->float_utf8 : prog->float_substr)) /* Could have been deleted */ - goto fail; - /* Check is floating substring. */ - retry_floating_check: - t = check_at - start_shift; - DEBUG_EXECUTE_r( what = "floating" ); - goto hop_and_restart; - } - if (t != s) { - DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, - " By STCLASS: moving %ld --> %ld\n", - (long)(t - strpos), (long)(s - strpos)) - ); - } - else { - DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, - " Does not contradict STCLASS...\n"); - ); - } - } - giveup: - DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, "Intuit: %s%s:%s match at offset %ld\n", - PL_colors[4], (check ? "Successfully guessed" : "Giving up"), - PL_colors[5], (long)(rx_origin - strpos)) ); return rx_origin; fail_finish: /* Substring not found */ @@ -1367,6 +1425,7 @@ Perl_re_intuit_start(pTHX_ return NULL; } + #define DECL_TRIE_TYPE(scan) \ const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold, \ trie_utf8_exactfa_fold, trie_latin_utf8_exactfa_fold } \ @@ -7827,6 +7886,9 @@ S_reghop4(U8 *s, SSize_t off, const U8* llim, const U8* rlim) return s; } +/* like reghop3, but returns NULL on overrun, rather than returning last + * char pos */ + STATIC U8 * S_reghopmaybe3(U8* s, SSize_t off, const U8* lim) { diff --git a/t/re/re_tests b/t/re/re_tests index 44e94c7..7ab7dc3 100644 --- a/t/re/re_tests +++ b/t/re/re_tests @@ -1879,5 +1879,10 @@ A+(*PRUNE)BC(?{}) AAABC y $& AAABC /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)\11/ abcdefghij\11 y $& abcdefghij\11 /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12/ abcdefghijk\12 y $& abcdefghijk\12 /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\13\14/ abcdefghijk\12\13\14 y $& abcdefghijk\12\13\14 + +\d<(.*?)> a<> n - - +[bcd].{2,3}aaaa XbXaaaaa y - - +[bcd].{2,3}aaaa Xb\x{100}aaaaa y - - + # Keep these lines at the end of the file # vim: softtabstop=0 noexpandtab -- Perl5 Master Repository
