Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Mark Davis ☕️ via Unicode
The problem is that most regex engines are not written to handle some
"interesting" features of canonical equivalence, like discontinuity.
Suppose that X is canonically equivalent to AB.

   - A query /X/ can match the separated A and C in the target string
   "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how should it
   behave? "pqb", "pbq", "bpq"? If the input was in NFD (for example), should
   the output be rearranged/decomposed so that it is NFD? and so on.
   - A query /A/ can match *part* of the X in the target string "aXb". So
   if I have code to do [replace /A/ in "aXb" by "pq"], what should result:
   "apqBb"?

The syntax and APIs for regex engines are not built to handle these
features. It introduces a enough complications in the code, syntax, and
semantics that no major implementation has seen fit to do it. We used to
have a section in the spec about this, but were convinced that it was
better off handled at a higher level.

Mark


On Sun, Oct 13, 2019 at 8:31 PM Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote:
>
> On Sun, 13 Oct 2019 17:13:28 -0700
> Asmus Freytag via Unicode   wrote:
>
>
> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not .  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.
>
> Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
> instead of formally handling NFD, you could extend the syntax to
> handle "inherited" properties across combining sequences.
>
> Am I missing anything?
>
> Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
> should not match  CIRCUMFLEX ACCENT>.
>
> Why does it matter if it is precomposed? Why should it? (For anyone other
> than a character coding maven).
>
>  Now, I could invent a string property so
> that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).
>
> I don't entirely understand what you said; you may have missed the
> distinction between "[:Lu:] can then match" and "[:Lu:] will then
> match".  I think only Greek letters expand to 4 characters in NFD.
>
> When I'm respecting canonical equivalence/working with traces, I want
> [:insc=vowel_dependent:][:insc=tone_mark:] to match both  CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical
> equivalent .  The canonical closure of that
> sequence can be messy even within scripts.  Some pairs commute: others
> don't, usually for good reasons.
>
> Some models may be more natural for different scripts. Certainly, in SEA
> or Indic scripts, most combining marks are not best modeled with properties
> as "inherited". But for L/G/C etc. it would be a different matter.
>
> For general recommendations, such as UTS#18, it would be good to move the
> state of the art so that the "primitives" are in line with the way typical
> writing systems behave, so that people can write "linguistically correct"
> regexes.
>
> A./
>
>
> Regards,
>
> Richard.
>
>
>
>


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Asmus Freytag via Unicode

  
  
On 10/13/2019 6:38 PM, Richard
  Wordingham via Unicode wrote:


  On Sun, 13 Oct 2019 17:13:28 -0700
Asmus Freytag via Unicode  wrote:


  
On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
Besides invalidating complexity metrics, the issue was what \p{Lu}
should match.  For example, with PCRE syntax, GNU grep Version 2.25
\p{Lu} matches U+0100 but not .  When I'm respecting
canonical equivalence, I want both to match [:Lu:], and that's what I
do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
instead of formally handling NFD, you could extend the syntax to
handle "inherited" properties across combining sequences.

Am I missing anything?

  
  
Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
should not match . 

Why does it matter if it is precomposed? Why should it? (For
  anyone other than a character coding maven).


   Now, I could invent a string property so
that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).

I don't entirely understand what you said; you may have missed the
distinction between "[:Lu:] can then match" and "[:Lu:] will then
match".  I think only Greek letters expand to 4 characters in NFD.

When I'm respecting canonical equivalence/working with traces, I want
[:insc=vowel_dependent:][:insc=tone_mark:] to match both  and its canonical
equivalent .  The canonical closure of that
sequence can be messy even within scripts.  Some pairs commute: others
don't, usually for good reasons.


Some models may be more natural for different scripts. Certainly,
  in SEA or Indic scripts, most combining marks are not best modeled
  with properties as "inherited". But for L/G/C etc. it would be a
  different matter.
For general recommendations, such as UTS#18, it would be good to
  move the state of the art so that the "primitives" are in line
  with the way typical writing systems behave, so that people can
  write "linguistically correct" regexes.
A./




  
Regards,

Richard.





  



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 17:13:28 -0700
Asmus Freytag via Unicode  wrote:

> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not .  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.
> 
> Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
> instead of formally handling NFD, you could extend the syntax to
> handle "inherited" properties across combining sequences.
> 
> Am I missing anything?

Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
should not match .  Now, I could invent a string property so
that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).

I don't entirely understand what you said; you may have missed the
distinction between "[:Lu:] can then match" and "[:Lu:] will then
match".  I think only Greek letters expand to 4 characters in NFD.

When I'm respecting canonical equivalence/working with traces, I want
[:insc=vowel_dependent:][:insc=tone_mark:] to match both  and its canonical
equivalent .  The canonical closure of that
sequence can be messy even within scripts.  Some pairs commute: others
don't, usually for good reasons.

Regards,

Richard.


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Asmus Freytag via Unicode

  
  
On 10/13/2019 2:54 PM, Richard
  Wordingham via Unicode wrote:


  Besides invalidating complexity metrics, the issue was what \p{Lu}
should match.  For example, with PCRE syntax, GNU grep Version 2.25
\p{Lu} matches U+0100 but not .  When I'm respecting
canonical equivalence, I want both to match [:Lu:], and that's what I
do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Formally, wouldn't that be rewriting \p{Lu}
to match \p{Lu}\p{Mn}*; instead of formally handling NFD, you
could extend the syntax to handle "inherited" properties across
combining sequences.
Am I missing anything?
A./
  
  



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 00:22:36 +0200
Hans Åberg via Unicode  wrote:

> > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
> >  wrote:

>> Besides invalidating complexity metrics, the issue was what \p{Lu}
>> should match.  For example, with PCRE syntax, GNU grep Version 2.25
>> \p{Lu} matches U+0100 but not .  When I'm respecting
>> canonical equivalence, I want both to match [:Lu:], and that's what
>> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  
 
> Hopefully some experts here can tune in, explaining exactly what
> regular expressions they have in mind.

The best indication lies at
https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents
(2008), which is the last version before support for canonical
equivalence was dropped as a requirement.

It's not entirely coherent, as the authors don't seem to find an
expression like

\p{L}\p{gcb=extend}*

a natural thing to use, as the second factor is mostly sequences of
non-starters.  At that point, I would say they weren't expecting
\p{Lu} to not match  , as they were still expecting  [ä] to
match both "ä" and "a\u0308".

They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and
were expecting normalisation (even to NFC) to be a possible cure.  They
had begun to realise that converting expressions to match all or none
of a set of canonical equivalents was hard; the issue of non-contiguous
matches wasn't mentioned.

When I say 'hard', I'm thinking of the problem that concatenation may
require dissolution of the two constituent expressions and involve the
temporary creation of 54-fold (if text is handled as NFD) or 2^54-fold
(no normalisation) sets of extra states.  That's what's driven me to
write my own regular expression engine for traces.

Regards,

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode


> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode 
>  wrote:
> 
> The point about these examples is that the estimate of one state per
> character becomes a severe underestimate.  For example, after
> processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can
> be in any of about 50 states.  The number of possible states is not
> linear in the length of the expression.  While a 'loop iteration' can
> keep the size of the compiled regex down, it doesn't prevent the
> proliferation of states - just add zeroes to my example. 

Formally only the expansion of such ranges are NFA, and I haven’t seen anyone 
considering the complexity with them included. So to me, it seems just a hack.

>> I made some C++ templates that translate Unicode code point character
>> classes into UTF-8/32 regular expressions. So anything that can be
>> reduced to actual regular expressions would work. 
> 
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not .  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Hopefully some experts here can tune in, explaining exactly what regular 
expressions they have in mind.





Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 22:14:10 +0200
Hans Åberg via Unicode  wrote:

> > On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode
> >  wrote:

> > Incidentally, at least some of the sizes and timings I gave seem to
> > be wrong even for strings.  They won't work with numeric
> > quantifiers, as in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.  

> > One gets lesser issues in quantifying complexity if one wants "Å" to
> > match \p{Lu} when working in NFD - potentially a different state for
> > each prefix of the capital letters.  (It's also the case except for
> > UTF-32 if characters are treated as sequences of code units.)
> > Perhaps 'upper case letter that Unicode happens to have encoded as
> > a single character' isn't a concept that regular expressions need
> > to support concisely. What's needed is to have a set somewhere
> > between [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be
> > extended to include "ff" - there are English surnames like
> > "ffrench”.  

The point about these examples is that the estimate of one state per
character becomes a severe underestimate.  For example, after
processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can
be in any of about 50 states.  The number of possible states is not
linear in the length of the expression.  While a 'loop iteration' can
keep the size of the compiled regex down, it doesn't prevent the
proliferation of states - just add zeroes to my example. 

> I made some C++ templates that translate Unicode code point character
> classes into UTF-8/32 regular expressions. So anything that can be
> reduced to actual regular expressions would work. 

Besides invalidating complexity metrics, the issue was what \p{Lu}
should match.  For example, with PCRE syntax, GNU grep Version 2.25
\p{Lu} matches U+0100 but not .  When I'm respecting
canonical equivalence, I want both to match [:Lu:], and that's what I
do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Regards,

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode


> On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode 
>  wrote:
> 
> On Sun, 13 Oct 2019 15:29:04 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
>>> I'm now beginning to wonder what you are claiming.  
> 
>> I start with a NFA with no empty transitions and apply the subset DFA
>> construction dynamically for a given string along with some reverse
>> NFA-data that is enough to transverse backwards when a final state
>> arrives. The result is a NFA where all transverses is a match of the
>> string at that position.
> 
> And then the speed comparison depends on how quickly one can extract
> the match information required from that data structure.

Yes. For example, one should match the saved DFA in constant time, if matched 
as dynamic sets which is linear in set size, then one can get quadratic time 
complexity in string size.

Even though one can iterate through each match NFA in linear time, it could 
have say two choices at each character position each leading to the next, which 
would give an exponential size relative the string length.

Normally one is not interested in all matches, this is the disambiguation rules 
that do that.

> Incidentally, at least some of the sizes and timings I gave seem to be
> wrong even for strings.  They won't work with numeric quantifiers, as
> in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.

For those, one normally implements a loop iteration. I did not that. I 
mentioned this method to Tim Shen on the libstdc++ list, so perhaps he might 
have implemented something.

> One gets lesser issues in quantifying complexity if one wants "Å" to
> match \p{Lu} when working in NFD - potentially a different state for
> each prefix of the capital letters.  (It's also the case except for
> UTF-32 if characters are treated as sequences of code units.)  Perhaps
> 'upper case letter that Unicode happens to have encoded as a single
> character' isn't a concept that regular expressions need to support
> concisely. What's needed is to have a set somewhere between
> [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to
> include "ff" - there are English surnames like "ffrench”.

I made some C++ templates that translate Unicode code point character classes 
into UTF-8/32 regular expressions. So anything that can be reduced to actual 
regular expressions would work. 





Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 15:29:04 +0200
Hans Åberg via Unicode  wrote:

> > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
> > I'm now beginning to wonder what you are claiming.  

> I start with a NFA with no empty transitions and apply the subset DFA
> construction dynamically for a given string along with some reverse
> NFA-data that is enough to transverse backwards when a final state
> arrives. The result is a NFA where all transverses is a match of the
> string at that position.

And then the speed comparison depends on how quickly one can extract
the match information required from that data structure.

Incidentally, at least some of the sizes and timings I gave seem to be
wrong even for strings.  They won't work with numeric quantifiers, as
in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.

One gets lesser issues in quantifying complexity if one wants "Å" to
match \p{Lu} when working in NFD - potentially a different state for
each prefix of the capital letters.  (It's also the case except for
UTF-32 if characters are treated as sequences of code units.)  Perhaps
'upper case letter that Unicode happens to have encoded as a single
character' isn't a concept that regular expressions need to support
concisely. What's needed is to have a set somewhere between
[\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to
include "ff" - there are English surnames like "ffrench".

Regards,

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode


> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode 
>  wrote:
> 
>>> On Sat, 12 Oct 2019 21:36:45 +0200
>>> Hans Åberg via Unicode  wrote:
>>> 
> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
>  wrote:
> 
> But remember that 'having longer first' is meaningless for a
> non-deterministic finite automaton that does a single pass through
> the string to be searched.
 
 It is possible to identify all submatches deterministically in
 linear time without backtracking — I a made an algorithm for
 that.  
> 
> I'm now beginning to wonder what you are claiming.

I start with a NFA with no empty transitions and apply the subset DFA 
construction dynamically for a given string along with some reverse NFA-data 
that is enough to transverse backwards when a final state arrives. The result 
is a NFA where all transverses is a match of the string at that position. 




Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 10:04:34 +0200
Hans Åberg via Unicode  wrote:

> > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode
> >  wrote:
> > 
> > On Sat, 12 Oct 2019 21:36:45 +0200
> > Hans Åberg via Unicode  wrote:
> >   
> >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
> >>>  wrote:
> >>> 
> >>> But remember that 'having longer first' is meaningless for a
> >>> non-deterministic finite automaton that does a single pass through
> >>> the string to be searched.
> >> 
> >> It is possible to identify all submatches deterministically in
> >> linear time without backtracking — I a made an algorithm for
> >> that.  
> > 
> > That's impressive, as the number of possible submatches for
> > a*(a*)a* is quadratic in the string length.  
> 
> That is probably after the possibilities in the matching graph have
> been expanded, which can even be exponential. As an analogy, think of
> a polynomial product, I compute the product, not the expansion.

I'm now beginning to wonder what you are claiming. One thing one can
do without backtracking is to determine which capture groups capture
something, and which combinations of capturing or not occur.  That's
a straightforward extension of doing the overall 'recognition' in linear
time - at least, linear in length (n) of the searched string.  (I say
straightforward, but it would mess up my state naming algorithm.)  The
time can also depend on the complexity of the regular expression, which
can be bounded by the length (m) of the expression if working with mere
strings, giving time O(mn) if one doesn't undertake the worst case
O(2^m) task of converting the non-deterministic FSM to a deterministic
FSM.

Using m as a complexity measure for traces may be misleading, and I
think plain wrong; for moderate m, the complexity can easily go up as
fast as m^10, and I think higher powers are possible.  Strings
exercising the higher complexities are linguistically implausible.

Regards,

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode


> On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode 
>  wrote:
> 
> On Sat, 12 Oct 2019 21:36:45 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
>>>  wrote:
>>> 
>>> But remember that 'having longer first' is meaningless for a
>>> non-deterministic finite automaton that does a single pass through
>>> the string to be searched.  
>> 
>> It is possible to identify all submatches deterministically in linear
>> time without backtracking — I a made an algorithm for that.
> 
> That's impressive, as the number of possible submatches for a*(a*)a* is
> quadratic in the string length.

That is probably after the possibilities in the matching graph have been 
expanded, which can even be exponential. As an analogy, think of a polynomial 
product, I compute the product, not the expansion.