Re: Pure Regular Expression Engines and Literal Clusters
The problem is that most regex engines are not written to handle some "interesting" features of canonical equivalence, like discontinuity. Suppose that X is canonically equivalent to AB. - A query /X/ can match the separated A and C in the target string "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how should it behave? "pqb", "pbq", "bpq"? If the input was in NFD (for example), should the output be rearranged/decomposed so that it is NFD? and so on. - A query /A/ can match *part* of the X in the target string "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what should result: "apqBb"? The syntax and APIs for regex engines are not built to handle these features. It introduces a enough complications in the code, syntax, and semantics that no major implementation has seen fit to do it. We used to have a section in the spec about this, but were convinced that it was better off handled at a higher level. Mark On Sun, Oct 13, 2019 at 8:31 PM Asmus Freytag via Unicode < unicode@unicode.org> wrote: > On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote: > > On Sun, 13 Oct 2019 17:13:28 -0700 > Asmus Freytag via Unicode wrote: > > > On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: > Besides invalidating complexity metrics, the issue was what \p{Lu} > should match. For example, with PCRE syntax, GNU grep Version 2.25 > \p{Lu} matches U+0100 but not . When I'm respecting > canonical equivalence, I want both to match [:Lu:], and that's what I > do. [:Lu:] can then match a sequence of up to 4 NFD characters. > > Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*; > instead of formally handling NFD, you could extend the syntax to > handle "inherited" properties across combining sequences. > > Am I missing anything? > > Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:] > should not match CIRCUMFLEX ACCENT>. > > Why does it matter if it is precomposed? Why should it? (For anyone other > than a character coding maven). > > Now, I could invent a string property so > that \p{xLu} that meant (:?\p{Lu}\p{Mn}*). > > I don't entirely understand what you said; you may have missed the > distinction between "[:Lu:] can then match" and "[:Lu:] will then > match". I think only Greek letters expand to 4 characters in NFD. > > When I'm respecting canonical equivalence/working with traces, I want > [:insc=vowel_dependent:][:insc=tone_mark:] to match both CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical > equivalent . The canonical closure of that > sequence can be messy even within scripts. Some pairs commute: others > don't, usually for good reasons. > > Some models may be more natural for different scripts. Certainly, in SEA > or Indic scripts, most combining marks are not best modeled with properties > as "inherited". But for L/G/C etc. it would be a different matter. > > For general recommendations, such as UTS#18, it would be good to move the > state of the art so that the "primitives" are in line with the way typical > writing systems behave, so that people can write "linguistically correct" > regexes. > > A./ > > > Regards, > > Richard. > > > >
Re: Pure Regular Expression Engines and Literal Clusters
On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote: On Sun, 13 Oct 2019 17:13:28 -0700 Asmus Freytag via Unicode wrote: On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: Besides invalidating complexity metrics, the issue was what \p{Lu} should match. For example, with PCRE syntax, GNU grep Version 2.25 \p{Lu} matches U+0100 but not . When I'm respecting canonical equivalence, I want both to match [:Lu:], and that's what I do. [:Lu:] can then match a sequence of up to 4 NFD characters. Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*; instead of formally handling NFD, you could extend the syntax to handle "inherited" properties across combining sequences. Am I missing anything? Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:] should not match . Why does it matter if it is precomposed? Why should it? (For anyone other than a character coding maven). Now, I could invent a string property so that \p{xLu} that meant (:?\p{Lu}\p{Mn}*). I don't entirely understand what you said; you may have missed the distinction between "[:Lu:] can then match" and "[:Lu:] will then match". I think only Greek letters expand to 4 characters in NFD. When I'm respecting canonical equivalence/working with traces, I want [:insc=vowel_dependent:][:insc=tone_mark:] to match both and its canonical equivalent . The canonical closure of that sequence can be messy even within scripts. Some pairs commute: others don't, usually for good reasons. Some models may be more natural for different scripts. Certainly, in SEA or Indic scripts, most combining marks are not best modeled with properties as "inherited". But for L/G/C etc. it would be a different matter. For general recommendations, such as UTS#18, it would be good to move the state of the art so that the "primitives" are in line with the way typical writing systems behave, so that people can write "linguistically correct" regexes. A./ Regards, Richard.
Re: Pure Regular Expression Engines and Literal Clusters
On Sun, 13 Oct 2019 17:13:28 -0700 Asmus Freytag via Unicode wrote: > On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: > Besides invalidating complexity metrics, the issue was what \p{Lu} > should match. For example, with PCRE syntax, GNU grep Version 2.25 > \p{Lu} matches U+0100 but not . When I'm respecting > canonical equivalence, I want both to match [:Lu:], and that's what I > do. [:Lu:] can then match a sequence of up to 4 NFD characters. > > Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*; > instead of formally handling NFD, you could extend the syntax to > handle "inherited" properties across combining sequences. > > Am I missing anything? Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:] should not match . Now, I could invent a string property so that \p{xLu} that meant (:?\p{Lu}\p{Mn}*). I don't entirely understand what you said; you may have missed the distinction between "[:Lu:] can then match" and "[:Lu:] will then match". I think only Greek letters expand to 4 characters in NFD. When I'm respecting canonical equivalence/working with traces, I want [:insc=vowel_dependent:][:insc=tone_mark:] to match both and its canonical equivalent . The canonical closure of that sequence can be messy even within scripts. Some pairs commute: others don't, usually for good reasons. Regards, Richard.
Re: Pure Regular Expression Engines and Literal Clusters
On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: Besides invalidating complexity metrics, the issue was what \p{Lu} should match. For example, with PCRE syntax, GNU grep Version 2.25 \p{Lu} matches U+0100 but not . When I'm respecting canonical equivalence, I want both to match [:Lu:], and that's what I do. [:Lu:] can then match a sequence of up to 4 NFD characters. Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*; instead of formally handling NFD, you could extend the syntax to handle "inherited" properties across combining sequences. Am I missing anything? A./
Re: Pure Regular Expression Engines and Literal Clusters
On Mon, 14 Oct 2019 00:22:36 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode > > wrote: >> Besides invalidating complexity metrics, the issue was what \p{Lu} >> should match. For example, with PCRE syntax, GNU grep Version 2.25 >> \p{Lu} matches U+0100 but not . When I'm respecting >> canonical equivalence, I want both to match [:Lu:], and that's what >> I do. [:Lu:] can then match a sequence of up to 4 NFD characters. > Hopefully some experts here can tune in, explaining exactly what > regular expressions they have in mind. The best indication lies at https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents (2008), which is the last version before support for canonical equivalence was dropped as a requirement. It's not entirely coherent, as the authors don't seem to find an expression like \p{L}\p{gcb=extend}* a natural thing to use, as the second factor is mostly sequences of non-starters. At that point, I would say they weren't expecting \p{Lu} to not match , as they were still expecting [ä] to match both "ä" and "a\u0308". They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and were expecting normalisation (even to NFC) to be a possible cure. They had begun to realise that converting expressions to match all or none of a set of canonical equivalents was hard; the issue of non-contiguous matches wasn't mentioned. When I say 'hard', I'm thinking of the problem that concatenation may require dissolution of the two constituent expressions and involve the temporary creation of 54-fold (if text is handled as NFD) or 2^54-fold (no normalisation) sets of extra states. That's what's driven me to write my own regular expression engine for traces. Regards, Richard.
Re: Pure Regular Expression Engines and Literal Clusters
> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode > wrote: > > The point about these examples is that the estimate of one state per > character becomes a severe underestimate. For example, after > processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can > be in any of about 50 states. The number of possible states is not > linear in the length of the expression. While a 'loop iteration' can > keep the size of the compiled regex down, it doesn't prevent the > proliferation of states - just add zeroes to my example. Formally only the expansion of such ranges are NFA, and I haven’t seen anyone considering the complexity with them included. So to me, it seems just a hack. >> I made some C++ templates that translate Unicode code point character >> classes into UTF-8/32 regular expressions. So anything that can be >> reduced to actual regular expressions would work. > > Besides invalidating complexity metrics, the issue was what \p{Lu} > should match. For example, with PCRE syntax, GNU grep Version 2.25 > \p{Lu} matches U+0100 but not . When I'm respecting > canonical equivalence, I want both to match [:Lu:], and that's what I > do. [:Lu:] can then match a sequence of up to 4 NFD characters. Hopefully some experts here can tune in, explaining exactly what regular expressions they have in mind.
Re: Pure Regular Expression Engines and Literal Clusters
On Sun, 13 Oct 2019 22:14:10 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode > > wrote: > > Incidentally, at least some of the sizes and timings I gave seem to > > be wrong even for strings. They won't work with numeric > > quantifiers, as in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/. > > One gets lesser issues in quantifying complexity if one wants "Å" to > > match \p{Lu} when working in NFD - potentially a different state for > > each prefix of the capital letters. (It's also the case except for > > UTF-32 if characters are treated as sequences of code units.) > > Perhaps 'upper case letter that Unicode happens to have encoded as > > a single character' isn't a concept that regular expressions need > > to support concisely. What's needed is to have a set somewhere > > between [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be > > extended to include "ff" - there are English surnames like > > "ffrench”. The point about these examples is that the estimate of one state per character becomes a severe underestimate. For example, after processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can be in any of about 50 states. The number of possible states is not linear in the length of the expression. While a 'loop iteration' can keep the size of the compiled regex down, it doesn't prevent the proliferation of states - just add zeroes to my example. > I made some C++ templates that translate Unicode code point character > classes into UTF-8/32 regular expressions. So anything that can be > reduced to actual regular expressions would work. Besides invalidating complexity metrics, the issue was what \p{Lu} should match. For example, with PCRE syntax, GNU grep Version 2.25 \p{Lu} matches U+0100 but not . When I'm respecting canonical equivalence, I want both to match [:Lu:], and that's what I do. [:Lu:] can then match a sequence of up to 4 NFD characters. Regards, Richard.
Re: Pure Regular Expression Engines and Literal Clusters
> On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode > wrote: > > On Sun, 13 Oct 2019 15:29:04 +0200 > Hans Åberg via Unicode wrote: > >>> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode >>> I'm now beginning to wonder what you are claiming. > >> I start with a NFA with no empty transitions and apply the subset DFA >> construction dynamically for a given string along with some reverse >> NFA-data that is enough to transverse backwards when a final state >> arrives. The result is a NFA where all transverses is a match of the >> string at that position. > > And then the speed comparison depends on how quickly one can extract > the match information required from that data structure. Yes. For example, one should match the saved DFA in constant time, if matched as dynamic sets which is linear in set size, then one can get quadratic time complexity in string size. Even though one can iterate through each match NFA in linear time, it could have say two choices at each character position each leading to the next, which would give an exponential size relative the string length. Normally one is not interested in all matches, this is the disambiguation rules that do that. > Incidentally, at least some of the sizes and timings I gave seem to be > wrong even for strings. They won't work with numeric quantifiers, as > in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/. For those, one normally implements a loop iteration. I did not that. I mentioned this method to Tim Shen on the libstdc++ list, so perhaps he might have implemented something. > One gets lesser issues in quantifying complexity if one wants "Å" to > match \p{Lu} when working in NFD - potentially a different state for > each prefix of the capital letters. (It's also the case except for > UTF-32 if characters are treated as sequences of code units.) Perhaps > 'upper case letter that Unicode happens to have encoded as a single > character' isn't a concept that regular expressions need to support > concisely. What's needed is to have a set somewhere between > [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to > include "ff" - there are English surnames like "ffrench”. I made some C++ templates that translate Unicode code point character classes into UTF-8/32 regular expressions. So anything that can be reduced to actual regular expressions would work.
Re: Pure Regular Expression Engines and Literal Clusters
On Sun, 13 Oct 2019 15:29:04 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode > > I'm now beginning to wonder what you are claiming. > I start with a NFA with no empty transitions and apply the subset DFA > construction dynamically for a given string along with some reverse > NFA-data that is enough to transverse backwards when a final state > arrives. The result is a NFA where all transverses is a match of the > string at that position. And then the speed comparison depends on how quickly one can extract the match information required from that data structure. Incidentally, at least some of the sizes and timings I gave seem to be wrong even for strings. They won't work with numeric quantifiers, as in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/. One gets lesser issues in quantifying complexity if one wants "Å" to match \p{Lu} when working in NFD - potentially a different state for each prefix of the capital letters. (It's also the case except for UTF-32 if characters are treated as sequences of code units.) Perhaps 'upper case letter that Unicode happens to have encoded as a single character' isn't a concept that regular expressions need to support concisely. What's needed is to have a set somewhere between [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to include "ff" - there are English surnames like "ffrench". Regards, Richard.
Re: Pure Regular Expression Engines and Literal Clusters
> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode > wrote: > >>> On Sat, 12 Oct 2019 21:36:45 +0200 >>> Hans Åberg via Unicode wrote: >>> > On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode > wrote: > > But remember that 'having longer first' is meaningless for a > non-deterministic finite automaton that does a single pass through > the string to be searched. It is possible to identify all submatches deterministically in linear time without backtracking — I a made an algorithm for that. > > I'm now beginning to wonder what you are claiming. I start with a NFA with no empty transitions and apply the subset DFA construction dynamically for a given string along with some reverse NFA-data that is enough to transverse backwards when a final state arrives. The result is a NFA where all transverses is a match of the string at that position.
Re: Pure Regular Expression Engines and Literal Clusters
On Sun, 13 Oct 2019 10:04:34 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode > > wrote: > > > > On Sat, 12 Oct 2019 21:36:45 +0200 > > Hans Åberg via Unicode wrote: > > > >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode > >>> wrote: > >>> > >>> But remember that 'having longer first' is meaningless for a > >>> non-deterministic finite automaton that does a single pass through > >>> the string to be searched. > >> > >> It is possible to identify all submatches deterministically in > >> linear time without backtracking — I a made an algorithm for > >> that. > > > > That's impressive, as the number of possible submatches for > > a*(a*)a* is quadratic in the string length. > > That is probably after the possibilities in the matching graph have > been expanded, which can even be exponential. As an analogy, think of > a polynomial product, I compute the product, not the expansion. I'm now beginning to wonder what you are claiming. One thing one can do without backtracking is to determine which capture groups capture something, and which combinations of capturing or not occur. That's a straightforward extension of doing the overall 'recognition' in linear time - at least, linear in length (n) of the searched string. (I say straightforward, but it would mess up my state naming algorithm.) The time can also depend on the complexity of the regular expression, which can be bounded by the length (m) of the expression if working with mere strings, giving time O(mn) if one doesn't undertake the worst case O(2^m) task of converting the non-deterministic FSM to a deterministic FSM. Using m as a complexity measure for traces may be misleading, and I think plain wrong; for moderate m, the complexity can easily go up as fast as m^10, and I think higher powers are possible. Strings exercising the higher complexities are linguistically implausible. Regards, Richard.
Re: Pure Regular Expression Engines and Literal Clusters
> On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode > wrote: > > On Sat, 12 Oct 2019 21:36:45 +0200 > Hans Åberg via Unicode wrote: > >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode >>> wrote: >>> >>> But remember that 'having longer first' is meaningless for a >>> non-deterministic finite automaton that does a single pass through >>> the string to be searched. >> >> It is possible to identify all submatches deterministically in linear >> time without backtracking — I a made an algorithm for that. > > That's impressive, as the number of possible submatches for a*(a*)a* is > quadratic in the string length. That is probably after the possibilities in the matching graph have been expanded, which can even be exponential. As an analogy, think of a polynomial product, I compute the product, not the expansion.