On Thu, 1 Feb 2018 08:03:31 +0100 Philippe Verdy via Unicode <unicode@unicode.org> wrote:
> 2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode < > unicode@unicode.org>: >> On Wed, 31 Jan 2018 19:45:56 +0100 >> Philippe Verdy via Unicode <unicode@unicode.org> wrote: >>> 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode < >>> unicode@unicode.org>: >>>> For example, <U WITH DIARESIS AND MACRON, >>>> COMBINING DOT BELOW> contains, under canonical equivalence, the >>>> substring <U WITH DIAERESIS, COMBINING DOT BELOW>. Your regular >>>> expressions would not detect this relationship. >>> My regular expression WILL detect this: scanning the text implies >>> first composing it to "full equivalent decomposition >>> form" (without even reordering it, and possibly recompose it to >>> NFD) while reading it and bufering it in forward direction (it >>> just requires the decomposition pairs from the UCD, including >>> those that are "excluded" from NFC/NFD). >> To be consistent, to find <U WITH DIAERESIS, COMBINING DOT BELOW> >> you would construct (i.e. Philippe Verdy would construct) >> <LATIN SMALL LETTER U> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * >> ( <COMBINING DIAERESIS> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* >> <COMBINING DOT BELOW> >> | <COMBINING DOT BELOW> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* >> <COMBINING DIAERESIS> >> ) >> However, <U WITH DIARESIS AND MACRON, COMBINING DOT BELOW> >> decomposes to <LATIN SMALL LETTER U, COMBINING DIAERESIS, COMBINING >> MACRON, COMBINING DOT BELOW>. It doesn't match your regular >> expression, for between COMBINING >> DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for >> which ccc = above! > And my regexp contained all the necessay asterisk, so yes it does not > match because the combining macron blocks the combining dot below and > combining diaeresis from commuting, and so there's no canonical > equivalence. I'm not sure what you mean by 'commuting' in this case, but either your statement or your deduction is wrong! Although adjacent characters with the same non-zero canonical combining class cannot be interchanged, that does not stop the members of the pair commuting with their neighbour with a different non-zero ccc whilst preserving canonical equivalence. Thus the searched string is canonically equivalent <LATIN SMALL LETTER U, COMBINING DIAERESIS, COMBINING DOT BELOW, COMBINING MACRON>. In your scheme, assuming you are looking for the most compact match, one should generate the regular expression: <LATIN SMALL LETTER U> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * ( <COMBINING DIAERESIS> [[ [^[[:cc=0:]]] - [[:cc=below:]] ]]* <COMBINING DOT BELOW> | <COMBINING DOT BELOW> [[ [^[[:cc=0:]]] - [[:cc=above:]] ]]* <COMBINING DIAERESIS> ) Have you got a program doing this and reporting to you, or did you assemble the construction by hand? Constructing regular expressions is known to be tricky. You cannot replace this by a more restrictive albeit wordier regex as you suggested on Sunday 28 January (http://www.unicode.org/mail-arch/unicode-ml/y2018-m01/0145.html). There is no upper bound on the length of matching expressions. >>> The automata is of course using the classic NFA used by regexp >>> engine (and not the DFA which explodes combinatorially in all >>> regexps), but which is still fully deterministic (the current >>> "state" in the automata is not a single integer for the node >>> number in the traversal graph, but a set of node numbers, and all >>> regexps have a finite number of nodes in the traversal graph, >>> this number being proportional to the length of the regexp, it >>> does not need lot of memory, and the size of the current "state" >>> is also fully bounded, never larger than the length of the >>> regexp). Optimizing some contextual parts of the NFA to DFA is >>> possible (to speedup the matching process and reduce the maximum >>> storage size of the "current state") but only if it does not >>> cause a growth of the total number of nodes in the traversal >>> graph, or as long as this growth of the total number does not >>> exceed some threshold e.g. not more than 2 or 3 times the regexp >>> size). >> In your claim, what is the length of the regexp for searching for ậ >> in a trace? Is it 3, or is it abut 14? If the former, I am very >> interested in how you do it. If the latter, I would say you >> already have a form of blow up in the way you cater for canonical >> equivalence. > For searching â, the transformed regexp is just > > <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:] ]] * > <COMBINING CIRCUMFLEX> > The NFA traversal graph contains nodes at location pointed below by > apostrophes: > > ' <LATIN SMALL LETTER A> > ' [[ [^[[:cc=0:]]] - [[:cc=above:] ]] > ' * > ' <COMBINING CIRCUMFLEX> > ' > > It has 5 nodes only (assuming that the regexp engine will compute > lookup tables to build the character classes). I asked about ‘ậ’, not ‘â’. Anyway, you’ve effectively answered the question. You’re talking about regular expressions for strings, not regular expressions for traces. > When there's a > quantifier (like "*" here) which is not "{1,1}" after each character > or character class it inserts a node after it. No node is inserted in > the NFA traversal graph for non-capturing parentheses, but nodes may > be inserted for capturing parentheses, and these's the two nodes > represening the start of the regexp and the end of the regexp (1 node > is also insert before the "^" or "$" context delimiters to match the > start and and of input lines (they are character classes, excluded > from the capture), or start and end of input text (for regexps using > the "multiline" flag) Quantifier nodes like {2,4} probably break down for non-deterministic expressions like (a(bc)?|b){2,4}. The string "ab" contains 2 iterations, but the longer string "abc" contains one iteration. > > Even with the dirty trick of normalising the searched trace for > > input (I wanted the NFA propagation to be defined by the trace - I > > also didn't want to have to worry about the well-formedness of DFAs > > or NFAs), I found that the number of states for a concatenation of > > regular languages of traces was bounded above by the product of the > > number of states > For this worst case, you don't generate a product of the two NFA > traversal graphs, you can directly concatenate each graph, the growth > in size remaing propertional to the total lenth of the initial > regexp, with a small bounded factor (this factor depends on how you > represent each node (with character classes possibly including SOT > and EOT, or without character classes by separate nodes for each > character or SOT or EOT). It does not seem unreasonable to build > these character classes and compute their unions/intersections where > necessary across branches when factorizing them. you just need to > take care of nodes added for non-{1,1} quantifiers. Concatenation doesn’t work with traces when the first expression can end in non-starters and the second expression can begin with them. This is a real issue when parsing sequences of characters for grammatical correctness, which is why I got interested in traces. A regular trace expression of the form [:ccc=1:][:ccc=2:]…[:ccc=n:] seems to require 2^n states in your scheme. As I effectively only apply the regex to NFD input strings, I use fewer states. However, the efficiency of my scheme depends on the order of the commuting factors - reverse order would require the 2^n states. Richard.