On Sat, 2021-06-12 at 15:38 -0400, Graydon wrote:
>  This test is meant to test only that no words have been lost or
> re-ordered; that the transformation is semantically correct is out of
> scope for it.

Somerandomwitterings...

So, i'd probably consider
(1) make a sequence of words from document A

Now, if you really hate your CPU :) you could transform A.seq into a
regular expression,
  w0.*w1.*w2...
and match it against the extracted string value of A.

Starting with ^.*?w0 might reduce the run-time in practice, but the
others all need arbitrary backtracking in case the transformation
introducedone or more  words that occur at that point in the document,
so you have
  w0 w1 w2 w3 w1 w2 w3 w4
to match against
  w0 w1          w2 w3 w3

This could also be written with a recursive function and a helper; the
helper would find the longest match at the current position, and if
that's empty the function returns "nope" and you have to back-track.

Doug Lenat i think has written a book around parsing algorithms, as has
Anne Brüggemann-Klein; Michael Sperberg-McQueen gave a paper at
Balisage about applications to Schema Validation (or at Extreme
Markup). Anne's abstraction, whose namei can't remember (sorry), is
most promising since your problem can be recast as equivalent to
matching XML Schema grammars to input documents, with the unique
particle attribution restriction lifted; RelaxNG does this with a hedge
automaton and that's another approach.


-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org

Reply via email to