https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947

--- Comment #9 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Tomalak Geret'kal from comment #0)
> Take the following:
> 
> #include <regex>
> int main()
> {
>   static const size_t BufSize = 100;
>   char buf[BufSize] = {};
>   auto begin = std::cbegin(buf), end = std::cend(buf);
> 
>   std::cmatch groups;
>   std::regex::flag_type flags = std::regex_constants::ECMAScript;
>   std::regex re("^what", flags);
>   std::regex_search(begin, end, groups, re);
> }
> 
> This attempts to match the pattern "^what" against a block of 100
> characters. The match is not expected to succeed (in this case, the input is
> simply 100 '\0's, but the problem exists for any non-matching input).
> 
> However, I expect the match to fail as soon as the first character of input
> is examined. By adjusting BufSize to increasingly large values, we observe
> that the execution time increases also, suggesting that the regex engine is
> examining the entire input even though the presence of the anchor "^"
> guarantees that a match will never be found. It only needed to examine the
> first character to know this. When BufSize reaches larger values like 100KB,
> this becomes quite problematic.

This works for the case above:

--- a/libstdc++-v3/include/bits/regex_executor.tcc
+++ b/libstdc++-v3/include/bits/regex_executor.tcc
@@ -46,6 +46,10 @@ _GLIBCXX_BEGIN_INLINE_ABI_NAMESPACE(_V2)
        return true;
       if (_M_flags & regex_constants::match_continuous)
        return false;
+      if (!(_M_flags & regex_constants::__multiline) && _M_nfa.size() > 1
+           && _M_nfa[0]._M_opcode() == _S_opcode_subexpr_begin
+           && _M_nfa[1]._M_opcode() == _S_opcode_line_begin_assertion)
+       return false;
       _M_flags |= regex_constants::match_prev_avail;
       while (_M_begin != _M_end)
        {

With this, the time for regex_search to return false no longer depends on the
size of the input.

However, this gives incorrect results for "^what|b" matching "abc" which should
match at the second position. The check probably needs to iterate over all the
states in that first subexpr and check there are no alternations in it.

Reply via email to