Re: Internationalised Computer Science Exercises

Philippe Verdy via Unicode Wed, 31 Jan 2018 23:07:00 -0800

2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode <
unicode@unicode.org>:

> On Wed, 31 Jan 2018 19:45:56 +0100
> Philippe Verdy via Unicode <unicode@unicode.org> wrote:
>
> > 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode <
> > unicode@unicode.org>:
>
> > > On Mon, 29 Jan 2018 14:15:04 +0100
> > > <U WITH DIAERESIS AND MACRON> was meant to be an example of a
> > > searched string.  For example, <U WITH DIARESIS AND MACRON,
> > > COMBINING DOT BELOW> contains, under canonical equivalence, the
> > > substring <U WITH DIAERESIS, COMBINING DOT BELOW>.  Your regular
> > > expressions would not detect this relationship.
>
> > My regular expression WILL detect this: scanning the text implies
> > first composing it to "full equivalent decomposition form" (without
> > even reordering it, and possibly recompose it to NFD) while reading
> > it and bufering it in forward direction (it just requires the
> > decomposition pairs from the UCD, including those that are "excluded"
> > from NFC/NFD).
>
> No.  To find <LATIN SMALL LETTER A WITH CIRCUMFLEX AND COMBINING DOT
> BELOW>, you constructed, on "Sun, 28 Jan 2018 20:30:44 +0100":
>
>   <LATIN SMALL LETTER A>

  [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *

  ( <COMBINING CIRCUMFLEX>

    [[ [^[[:cc=0:]]] -   [[:cc=above:][:cc=below:]] ]] *

    <COMBINING DOT BELOW>
>   | <COMBINING DOT BELOW>

     [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *

    <COMBINING  CIRCUMFLEX>
>
> To be consistent, to find <U WITH DIAERESIS, COMBINING DOT BELOW> you
> would construct
>
>   <LATIN SMALL LETTER U>
>   [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] *
>  ( <COMBINING DIAERESIS>
>    [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]*
>    <COMBINING DOT BELOW>
>  | <COMBINING DOT BELOW>
>    [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]*
>    <COMBINING DIAERESIS>
>  )
>
> (A final ')' got lost between brain and text; I have restored it.)
>

This was a minor omission, ONLY this final parenthese was missing, as it
was truncated from its last single line where this was the only character
(don't know why it was truncated there, but is easy to restore). You did
not correct anything else.

>
> However, <U WITH DIARESIS AND MACRON, COMBINING DOT BELOW> decomposes to
> <LATIN SMALL LETTER U, COMBINING DIAERESIS, COMBINING MACRON, COMBINING
> DOT BELOW>.  It doesn't match your regular expression, for between
> COMBINING
> DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for which
> ccc = above!
>

And my regexp contained all the necessay asterisk, so yes it does not match
because the combining macron blocks the combining dot below and combining
diaeresis from commuting, and so there's no canonical equivalence. <U,DOT
BELOW,MACRON,DIAERESIS> or  <U WITH  MACRON,DOT BELOW,DIAERESIS> cannot be
matched in any case by searching  <U,DOT BELOW,DIAERESIS> using canonical
equivalence rules.

So this regexp is perfectly correct. No error at all (except the missing
final parenthese), and my argument remains valid.

> > The regexp exgine will then only process the "fully decomposed" input
> > text to find matches, using the regexp transformed as above (which
> > has some initial "complex" setup to "fully decompose" the initial
> > regexp), but only once when constructing it, but not while processing
> > the input text which can be then done straightforward with its full
> > decomposition easily performed on the fly without any additional
> > buffering except the very small lookahead whose length is never
> > longer than the longest "canonical" decompositions in the UCD, i.e.
> > at most 2 code points of look ahead).
>
> Nitpick: U+1F84 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND
> YPOGEGRAMMENI decomposes to <U+03B1, U+0313, U+0301, U+0345>.
>
> Conversion to NFD on input only requires a small buffer for natural
> orthographies.  I suspect the worst in natural language will come from
> either narrow IPA transcriptions or Classical Greek.
>

OK, the canonical decompositions may expand to more than 2, because some
canonical decomposition pairs may contain decomposable pais, but this is
still bounded (4 here). The complete set of full decompositions from the
UCD is well known, it fits in a resonnably small static table for each
version of Unicode (and its size grows very slowly only when there are new
character encoded that are decomposable without breaking the statibility
rules about all existing combining characters). Even if this expands the
input text to 4 times its length (in number of code points), it will still
be a very small input look ahead buffer.

Very few entries in this table will be decomposable to more than 2 and this
only occurs for the oldest characters in Unicode, notably in Greek because
there's a **single** case of a combining character that has a canonical
decomposition pair (this comes from the encoding of a combining character
mapped for compatibility from a legacy non-Unicode charset).
All the other pairs are a base character (cc=0), possibly decomposable
again only one time (e.g. Vietnamese Latin letters and a single non
decomposable combining character with cc>0), and fully decomposable to 3
characters: these encoded characters have multiple diacritics, and are
quite rare in the UCD except in the extended Latin blocks.

> > The automata is of course using the classic NFA used by regexp engine
> > (and not the DFA which explodes combinatorially in all regexps), but
> > which is still fully deterministic (the current "state" in the
> > automata is not a single integer for the node number in the traversal
> > graph, but a set of node numbers, and all regexps have a finite
> > number of nodes in the traversal graph, this number being
> > proportional to the length of the regexp, it does not need lot of
> > memory, and the size of the current "state" is also fully bounded,
> > never larger than the length of the regexp). Optimizing some
> > contextual parts of the NFA to DFA is possible (to speedup the
> > matching process and reduce the maximum storage  size of the "current
> > state") but only if it does not cause a growth of the total number of
> > nodes in the traversal graph, or as long as this growth of the total
> > number does not exceed some threshold e.g. not more than 2 or 3 times
> > the regexp size).
>
> In your claim, what is the length of the regexp for searching for ậ in a
> trace? Is it 3, or is it abut 14?  If the former, I am very interested
> in how you do it.  If the latter, I would say you already have a form
> of blow up in the way you cater for canonical equivalence.
>

For searching â, the transformed regexp is just

  <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:] ]] * <COMBINING
CIRCUMFLEX>

The NFA traversal graph contains nodes at location pointed below by
apostrophes:

  ' <LATIN SMALL LETTER A>
  '  [[ [^[[:cc=0:]]] - [[:cc=above:] ]]
  ' *
  ' <COMBINING CIRCUMFLEX>
  '

It has 5 nodes only (assuming that the regexp engine will compute lookup
tables to build the character classes). When there's a quantifier (like "*"
here) which is not "{1,1}" after each character or character class it
inserts a node after it. No node is inserted in the NFA traversal graph for
non-capturing parentheses, but nodes may be inserted for capturing
parentheses, and these's the two nodes represening the start of the regexp
and the end of the regexp (1 node is also insert before the "^" or "$"
context delimiters to match the start and and of input lines (they are
character classes, excluded from the capture), or start and end of input
text (for regexps using the "multiline" flag)

> Even with the dirty trick of normalising the searched trace for input
> (I wanted the NFA propagation to be defined by the trace - I also didn't
> want to have to worry about the well-formedness of DFAs or NFAs), I
> found that the number of states for a concatenation of regular
> languages of traces was bounded above by the product of the number of
> states

This worst case occurs when each regexps can match a zero-length input
string (i.e. its final node in the traversal graph is a quantifier like
"{0,m}" or "*" or "?" that applies to the whole regexp), or the traversal
graph is made of parallel branches starting from the same point and having
each one such quantifier.

The traversal graph needs to resolves the parentheses and capturing groups
to combine them into single quantifier nodes, this transform from the
bounded non resolved graph does not cause its expansion in size (number of
nodes), but instead causes it to be compacted (characters classes from the
link input going to the same target node an be factorized by computing
their intersection and separate it from each branch and merge it to a
separate branch and you drop the remaining branches where there remaining
character classes without this intersection is empty). This is simple to
compute.

For this worst case, you don't generate a product of the two NFA traversal
graphs, you can directly concatenate each graph, the growth in size remaing
propertional to the total lenth of the initial regexp, with a small bounded
factor (this factor depends on how you represent each node (with character
classes possibly including SOT and EOT, or without character classes by
separate nodes for each character or SOT or EOT). It does not seem
unreasonable to build these character classes and compute their
unions/intersections where necessary across branches when factorizing them.
you just need to take care of nodes added for non-{1,1} quantifiers.

The regexp engine can also choose to expand {m,} or {m,n} quantifiers where
m > 1, by concatenating at most m occurences of the subgraph before it (it
can do that at least once so that /(a){2,}/ (without capturing groups for
parantheses here) is treated like if it was /(a)(a)*/, and if the subgraph
for /(a)/ is small (e.g. not more than 64  nodes, for example) it can
perform this expansion more times. For example The NFA traversal graph for
/(a|b}{10,}/ is (assuming that a and b are orthogonal graphs and not single
characters that can be combined by computing character classes):

  '<SOT>
    /      \
  'a      'b
   \       /
    '{10,}
        |
    '<EOT>

it has 5 nodes (including those for SOT and EOT). Expanding it one time
becomes:

        '<SOT>
        /         \
      'a          'b
    /    \         /  \
  'a    'b      'a    'b
   \     /         \     /
    '{9,}         '{9,}
       \          /
       '<EOT>

This is for this preexpansion of quantifiers that you can see the graph
expansion.

If you expand /A{m,n}/ one time to /AA{m-1,n-1}/ where the graph for
/A{m,n}/ has (k+2) nodes (including SOT and EOT), then the new graph will
have (2k+2) nodes if n>2, or only (2k) nodes if n=2.

For example, in  /a{2}/ has 4 nodes (still marked by leading apostrophes
here), and so k=2:

  '<SOT>
      |
     'a
      |
     '{2,2}
      |
  '<EOT>

You can expand the '{2} quantifier one time it gives a graph with 2k nodes
(not 2k+2 because n=2 in this quantifier and the expansion finally drops
the the remaining {1,1} quantifier), i.e.

  '<SOT>
      |
     'a
      |
     'a
      |
  '<EOT>

In that case, the expansion does not grow the graph size because n=2 only
in the quantifier and k=1 is small enough (the subgraph on which the
quantifier loops is just a single character or character class !)

The regexp always has the option of precomuting this expansion... or not.
The expansion does not lower the number of "active states" in the graph, it
just allows faster traversal of the graph by avoiding to pass through
quantifier nodes (that need counters in their state and require an
additional step).

I suggest not expanding quantifiers {m,n} not more than one to 4 times and
not doing it at all if the subgraph is large enough (not more than 4
nodes). Above this the gain of performance is marginal given that your
graph will have its size grow dramatically, and you'll reduce the locality
of data memory caches in processors for the graph itself, and you'll need
to allocate more dynamic memory.

This tuning is to experiment by profiling your actual implementation of the
graph traversal for the matcher (when scannin the input), and the resources
(time and storage space) needed to compile the regexp into this expanded
graph.

Re: Internationalised Computer Science Exercises

Reply via email to