2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode < unicode@unicode.org>:
> On Mon, 29 Jan 2018 14:15:04 +0100 > > The case of u with diaeresis and macron is simpler: it has two > > combining characters of the same combining class and they don't > > commute, still the regexp to match it is something like: > > > > U [[:cc>0:]-[:cc=above:]]* <DIAERESIS> [[:cc>0:]-[:cc=above:]]* > > <MACRON> [[:cc>0:]-[:cc=above:]]* > > <U WITH DIAERESIS AND MACRON> was meant to be an example of a searched > string. For example, <U WITH DIARESIS AND MACRON, COMBINING DOT BELOW> > contains, under canonical equivalence, the substring <U WITH DIAERESIS, > COMBINING DOT BELOW>. Your regular expressions would not detect this > relationship. My regular expression WILL detect this: scanning the text implies first composing it to "full equivalent decomposition form" (without even reordering it, and possibly recompose it to NFD) while reading it and bufering it in forward direction (it just requires the decomposition pairs from the UCD, including those that are "excluded" from NFC/NFD). The regexp exgine will then only process the "fully decomposed" input text to find matches, using the regexp transformed as above (which has some initial "complex" setup to "fully decompose" the initial regexp), but only once when constructing it, but not while processing the input text which can be then done straightforward with its full decomposition easily performed on the fly without any additional buffering except the very small lookahead whose length is never longer than the longest "canonical" decompositions in the UCD, i.e. at most 2 code points of look ahead). The automata is of course using the classic NFA used by regexp engine (and not the DFA which explodes combinatorially in all regexps), but which is still fully deterministic (the current "state" in the automata is not a single integer for the node number in the traversal graph, but a set of node numbers, and all regexps have a finite number of nodes in the traversal graph, this number being proportional to the length of the regexp, it does not need lot of memory, and the size of the current "state" is also fully bounded, never larger than the length of the regexp). Optimizing some contextual parts of the NFA to DFA is possible (to speedup the matching process and reduce the maximum storage size of the "current state") but only if it does not cause a growth of the total number of nodes in the traversal graph, or as long as this growth of the total number does not exceed some threshold e.g. not more than 2 or 3 times the regexp size). In practice, most regexps never exceed several hundreds of characters (including meta-characters of the regexp syntax itself), and the maximum number of active nodes in the graph traversal rarely exceeds 2 or 3, so the "current state" is not several hundreds integers, but a handful of integers, and een if you optimize the NFA partly to DFA, you can double or triple the number of nodes to significantly speedup the engine (in order to reduce the number of node numbers to store in the "current state"). Some common examples of reduction of nodes in the traversal graph is to compute character classes, or the local expansion of "bounded non-empty repetitions" (like in the regexp /x{m,n}/ when m>=1 and n is small).