Re: [1003.1(2024)/Issue8 0001857]: Several problems with the new "lazy" regex quantifier.

Hans Åberg via austin-group-l at The Open Group Wed, 25 Sep 2024 02:23:33 -0700


> On 24 Sep 2024, at 23:15, Steffen Nurpmeso <[email protected]> wrote:
> 
> Hans Åberg wrote in
> <[email protected]>:
> |
> |> On 24 Sep 2024, at 00:39, Steffen Nurpmeso <[email protected]> wrote:
> |> 
> |> Hans Åberg wrote in
> |> <[email protected]>:
> |>|I made the most efficient regular expression algorithm possible that \
> |>|catches all called for sub-matches, described here:
> |>|https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472
> |> 
> |> You have not posted an URL to your implementation in this thread?
> |
> |It is not published yet. I wrote it back then, and is now seeing if \
> |I can integrate into a program I am writing.
> 
> Well it seems that thread caused some performance boost if
> i understood that correctly, so..


Given a string, for the match, I essentially extract an NFA labeled with just 
the characters of this string, and no other transitions. Then it can blow up 
exponentially on the string length, if there is sufficient ambiguity along the 
string. So in order to avoid that, I have an iterator allowing one to 
transverse all possible choices.

This makes a general choice method very complicated, which is one reason I did 
not develop it further at the time: I want to have some more general choice 
method on top of that.

Then backreferences dynamically change the NFA, but since all action matches, 
one might add them all, and then make an extraction after the parse of the 
string.

> |>|In short, backtracking is avoided by caches DFA states during the \
> |>|parsing, \
> |>|though all can be computed in advance, and transverse in a loop, like \
> |>|in Flex and Bison, with no stacking or function calls. DFA minimalizatio\
> |>|n \
> |>|proved to not work well with actions of subexpressions:
> |>|
> |>|Some sub-NFAs, corresponding to sub-regexes, are marked with labels, \
> |>|and when transitioning out of one or more, all labeled matches are \
> |>|computed using a partial RNFA, reverse NFA.
> |>|
> |>|Unicode UTF-8 and UTF-32 are treated by generating byte regular expressi\
> |>|ons.
> |>|
> |>|In the context of the discussion here, the task is to find a suitable \
> |>|selection of found matches. For example, backreferences would typically \
> |>|take the last computed value of a labeled match.
> |>|
> |>|My implementation is oriented towards use with a parser, such as Bison. 
> |> 
> |> I am too stupid for yacc/bison lex/flex.  I write parsers by hand.
> |
> |In the context of discussions here, features such as backreferences \
> |might be suitable as a replacement for a parser generator, like in \
> |the example of scanning HTML code given here:
> |https://www.regular-expressions.info/backref.html
> |
> |There I would prefer a parser generator that admits more advanced grammars \
> |than regular expressions, though I am not sure how well such works \
> |with HTML.
> |
> |Handwritten parsers might be preferred if the grammar is well known, \
> |such in a compiler for a well-known programming language.
> |
> |>|Tim Shen said he would make an implementation along these lines. So \
> |>|you might check with him for a reference implementation.
> |> 
> |> Oh!  It is not as if *i* would be writing a regular expression
> |> engine any time soon.  I still remember, however, being a little
> |> bit strained once a QT release came over with a ~30KB (iirc)
> |> regular expression implementation.  One always has an option to
> |> replace a self-written suboptimal thing with a really thought
> |> through optimized variant in the Open Source World!
> |> (Pointers to Russ Cox and Plan9 regex i have seen in the issue.)
> |
> |There is an interesting one here:
> |https://www.genivia.com/doc/reflex/html/
> |https://en.wikipedia.org/wiki/RE/flex
> 
> That is an interesting thing, thanks for this link.
> (Pretty academical read, like the above thread of yours.)

The POSIX function yylex return an int, representing the token value:

There is an interesting variation in the Bison C++ parser, where it is expected 
to return, in addition to a typed token value, also the semantic value, and a 
location.

There is an incompatibility between C and C++ streams: Even though C++ streams 
now, by default, even though it can be changed, are tied to the C stream, which 
in turn, on a POSIX system, is tied to a file descriptor, there is no compiler 
independent way to convert between C++ streams and file descriptors.

So if one has an yylex function that reads C streams in a C++ program that 
opens a C++ stream, there is no compiler independent way to pass it to this 
yylex function. (Though, in GCC, there is a facility for it, and in Clang, it 
is marked private, so it requires a hack to get around that.)

Then, Flex, which I currently use, is badly broken in certain respects: Its 
generated C++ lexer relies on a system installed header which must be accessed 
when compiling, unlike the C lexer then, or the parsers that Bison writes. 
There are several incompatible version of this header, so if a system has an 
older Flex version installed, the C++ program may not compile if compiled with 
later version, and in addition there are incompatible hacked versions, for 
example, the Apple Xcode one, which they say they want to have that way. One 
way is to hack the skeleton file of the C parser so it reads from C++ streams, 
but that part seems broken, too.

This led me to the program above, but it does not resolve the issue, and in 
addition, it relies on a C++ library it installs in the system, which may be 
incompatible if gotten from a package manager.

So in terms of the discussions here, there might be good with better POSIX 
lexing facilities so that they get universally installed, but it may not help 
if working with C++.

Re: [1003.1(2024)/Issue8 0001857]: Several problems with the new "lazy" regex quantifier.

Reply via email to