2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode < unicode@unicode.org>:
> On Wed, 31 Jan 2018 19:45:56 +0100 > Philippe Verdy via Unicode <unicode@unicode.org> wrote: > > > 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode < > > unicode@unicode.org>: > > > > On Mon, 29 Jan 2018 14:15:04 +0100 > > > <U WITH DIAERESIS AND MACRON> was meant to be an example of a > > > searched string. For example, <U WITH DIARESIS AND MACRON, > > > COMBINING DOT BELOW> contains, under canonical equivalence, the > > > substring <U WITH DIAERESIS, COMBINING DOT BELOW>. Your regular > > > expressions would not detect this relationship. > > > My regular expression WILL detect this: scanning the text implies > > first composing it to "full equivalent decomposition form" (without > > even reordering it, and possibly recompose it to NFD) while reading > > it and bufering it in forward direction (it just requires the > > decomposition pairs from the UCD, including those that are "excluded" > > from NFC/NFD). > > No. To find <LATIN SMALL LETTER A WITH CIRCUMFLEX AND COMBINING DOT > BELOW>, you constructed, on "Sun, 28 Jan 2018 20:30:44 +0100": > > <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * ( <COMBINING CIRCUMFLEX> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT BELOW> > | <COMBINING DOT BELOW> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * <COMBINING CIRCUMFLEX> > > To be consistent, to find <U WITH DIAERESIS, COMBINING DOT BELOW> you > would construct > > <LATIN SMALL LETTER U> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * > ( <COMBINING DIAERESIS> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* > <COMBINING DOT BELOW> > | <COMBINING DOT BELOW> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* > <COMBINING DIAERESIS> > ) > > (A final ')' got lost between brain and text; I have restored it.) > This was a minor omission, ONLY this final parenthese was missing, as it was truncated from its last single line where this was the only character (don't know why it was truncated there, but is easy to restore). You did not correct anything else. > > However, <U WITH DIARESIS AND MACRON, COMBINING DOT BELOW> decomposes to > <LATIN SMALL LETTER U, COMBINING DIAERESIS, COMBINING MACRON, COMBINING > DOT BELOW>. It doesn't match your regular expression, for between > COMBINING > DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for which > ccc = above! > And my regexp contained all the necessay asterisk, so yes it does not match because the combining macron blocks the combining dot below and combining diaeresis from commuting, and so there's no canonical equivalence. <U,DOT BELOW,MACRON,DIAERESIS> or <U WITH MACRON,DOT BELOW,DIAERESIS> cannot be matched in any case by searching <U,DOT BELOW,DIAERESIS> using canonical equivalence rules. So this regexp is perfectly correct. No error at all (except the missing final parenthese), and my argument remains valid. > > The regexp exgine will then only process the "fully decomposed" input > > text to find matches, using the regexp transformed as above (which > > has some initial "complex" setup to "fully decompose" the initial > > regexp), but only once when constructing it, but not while processing > > the input text which can be then done straightforward with its full > > decomposition easily performed on the fly without any additional > > buffering except the very small lookahead whose length is never > > longer than the longest "canonical" decompositions in the UCD, i.e. > > at most 2 code points of look ahead). > > Nitpick: U+1F84 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND > YPOGEGRAMMENI decomposes to <U+03B1, U+0313, U+0301, U+0345>. > > Conversion to NFD on input only requires a small buffer for natural > orthographies. I suspect the worst in natural language will come from > either narrow IPA transcriptions or Classical Greek. > OK, the canonical decompositions may expand to more than 2, because some canonical decomposition pairs may contain decomposable pais, but this is still bounded (4 here). The complete set of full decompositions from the UCD is well known, it fits in a resonnably small static table for each version of Unicode (and its size grows very slowly only when there are new character encoded that are decomposable without breaking the statibility rules about all existing combining characters). Even if this expands the input text to 4 times its length (in number of code points), it will still be a very small input look ahead buffer. Very few entries in this table will be decomposable to more than 2 and this only occurs for the oldest characters in Unicode, notably in Greek because there's a **single** case of a combining character that has a canonical decomposition pair (this comes from the encoding of a combining character mapped for compatibility from a legacy non-Unicode charset). All the other pairs are a base character (cc=0), possibly decomposable again only one time (e.g. Vietnamese Latin letters and a single non decomposable combining character with cc>0), and fully decomposable to 3 characters: these encoded characters have multiple diacritics, and are quite rare in the UCD except in the extended Latin blocks. > > The automata is of course using the classic NFA used by regexp engine > > (and not the DFA which explodes combinatorially in all regexps), but > > which is still fully deterministic (the current "state" in the > > automata is not a single integer for the node number in the traversal > > graph, but a set of node numbers, and all regexps have a finite > > number of nodes in the traversal graph, this number being > > proportional to the length of the regexp, it does not need lot of > > memory, and the size of the current "state" is also fully bounded, > > never larger than the length of the regexp). Optimizing some > > contextual parts of the NFA to DFA is possible (to speedup the > > matching process and reduce the maximum storage size of the "current > > state") but only if it does not cause a growth of the total number of > > nodes in the traversal graph, or as long as this growth of the total > > number does not exceed some threshold e.g. not more than 2 or 3 times > > the regexp size). > > In your claim, what is the length of the regexp for searching for ậ in a > trace? Is it 3, or is it abut 14? If the former, I am very interested > in how you do it. If the latter, I would say you already have a form > of blow up in the way you cater for canonical equivalence. > For searching â, the transformed regexp is just <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:] ]] * <COMBINING CIRCUMFLEX> The NFA traversal graph contains nodes at location pointed below by apostrophes: ' <LATIN SMALL LETTER A> ' [[ [^[[:cc=0:]]] - [[:cc=above:] ]] ' * ' <COMBINING CIRCUMFLEX> ' It has 5 nodes only (assuming that the regexp engine will compute lookup tables to build the character classes). When there's a quantifier (like "*" here) which is not "{1,1}" after each character or character class it inserts a node after it. No node is inserted in the NFA traversal graph for non-capturing parentheses, but nodes may be inserted for capturing parentheses, and these's the two nodes represening the start of the regexp and the end of the regexp (1 node is also insert before the "^" or "$" context delimiters to match the start and and of input lines (they are character classes, excluded from the capture), or start and end of input text (for regexps using the "multiline" flag) > Even with the dirty trick of normalising the searched trace for input > (I wanted the NFA propagation to be defined by the trace - I also didn't > want to have to worry about the well-formedness of DFAs or NFAs), I > found that the number of states for a concatenation of regular > languages of traces was bounded above by the product of the number of > states This worst case occurs when each regexps can match a zero-length input string (i.e. its final node in the traversal graph is a quantifier like "{0,m}" or "*" or "?" that applies to the whole regexp), or the traversal graph is made of parallel branches starting from the same point and having each one such quantifier. The traversal graph needs to resolves the parentheses and capturing groups to combine them into single quantifier nodes, this transform from the bounded non resolved graph does not cause its expansion in size (number of nodes), but instead causes it to be compacted (characters classes from the link input going to the same target node an be factorized by computing their intersection and separate it from each branch and merge it to a separate branch and you drop the remaining branches where there remaining character classes without this intersection is empty). This is simple to compute. For this worst case, you don't generate a product of the two NFA traversal graphs, you can directly concatenate each graph, the growth in size remaing propertional to the total lenth of the initial regexp, with a small bounded factor (this factor depends on how you represent each node (with character classes possibly including SOT and EOT, or without character classes by separate nodes for each character or SOT or EOT). It does not seem unreasonable to build these character classes and compute their unions/intersections where necessary across branches when factorizing them. you just need to take care of nodes added for non-{1,1} quantifiers. The regexp engine can also choose to expand {m,} or {m,n} quantifiers where m > 1, by concatenating at most m occurences of the subgraph before it (it can do that at least once so that /(a){2,}/ (without capturing groups for parantheses here) is treated like if it was /(a)(a)*/, and if the subgraph for /(a)/ is small (e.g. not more than 64 nodes, for example) it can perform this expansion more times. For example The NFA traversal graph for /(a|b}{10,}/ is (assuming that a and b are orthogonal graphs and not single characters that can be combined by computing character classes): '<SOT> / \ 'a 'b \ / '{10,} | '<EOT> it has 5 nodes (including those for SOT and EOT). Expanding it one time becomes: '<SOT> / \ 'a 'b / \ / \ 'a 'b 'a 'b \ / \ / '{9,} '{9,} \ / '<EOT> This is for this preexpansion of quantifiers that you can see the graph expansion. If you expand /A{m,n}/ one time to /AA{m-1,n-1}/ where the graph for /A{m,n}/ has (k+2) nodes (including SOT and EOT), then the new graph will have (2k+2) nodes if n>2, or only (2k) nodes if n=2. For example, in /a{2}/ has 4 nodes (still marked by leading apostrophes here), and so k=2: '<SOT> | 'a | '{2,2} | '<EOT> You can expand the '{2} quantifier one time it gives a graph with 2k nodes (not 2k+2 because n=2 in this quantifier and the expansion finally drops the the remaining {1,1} quantifier), i.e. '<SOT> | 'a | 'a | '<EOT> In that case, the expansion does not grow the graph size because n=2 only in the quantifier and k=1 is small enough (the subgraph on which the quantifier loops is just a single character or character class !) The regexp always has the option of precomuting this expansion... or not. The expansion does not lower the number of "active states" in the graph, it just allows faster traversal of the graph by avoiding to pass through quantifier nodes (that need counters in their state and require an additional step). I suggest not expanding quantifiers {m,n} not more than one to 4 times and not doing it at all if the subgraph is large enough (not more than 4 nodes). Above this the gain of performance is marginal given that your graph will have its size grow dramatically, and you'll reduce the locality of data memory caches in processors for the graph itself, and you'll need to allocate more dynamic memory. This tuning is to experiment by profiling your actual implementation of the graph traversal for the matcher (when scannin the input), and the resources (time and storage space) needed to compile the regexp into this expanded graph.