ENB: Technical docs for the new @shadow algorithm

Edward K. Ream Tue, 10 Feb 2015 08:43:31 -0800

This is an Engineering Notebook post.  Feel free to ignore.  However, it 
may be of general interest, in particular to anyone who uses difflib.


This post discusses the new @shadow algorithm in detail, with a simple, 
informal, proof of its correctness. This post will be pre-writing for the 
technical documentation of the new @shadow update algorithm.

===== Background

The new algorithm depends on four simple, guaranteed, properties of 
SequenceMatcher.opcodes.  Full docs here 
<https://docs.python.org/2/library/difflib.html#sequencematcher-examples>.

Fact 1:  The opcodes tell how to turn x.a (a list of lines) into x.b 
(another list of lines).

The new code uses the a and b terminology.  It's concise and easy to 
remember.

Fact 2: The opcode indices ai, aj, bi, bj, *never* change because neither a 
nor b changes.

Thus, the plain lines of the result can be built up by copying lines from 
x.b to x.results:

'replace'   x.results.extend(x.b[j1:j2])
'delete'     do nothing  (j1 == j2)
'insert'      x.results.extend(x.b[j1:j2])
'equal'      x.results.extend(x.b[j1:j2])

This is an easy pattern to remember ;-)

Fact 3: The opcodes *cover* both x.a and x.b, in order, without any gaps.

This is an explicit requirement of sm.get_opcode. The docs say that:

- The first tuple has ai==aj==bi==bj==0.
- Remaining tuples have ai == (aj from the preceding tuple) and bi == (bj 
from the previous tuple).

Fact 4: The x.sentinels array contains the sentinels that precede each line 
of x.a.

For any index i, x.sentinels[i] is the (possibly empty) list of sentinel 
lines that precede line a[i]. Computing x.sentinels from old_private_lines 
is easy.

Crucially, x.a and x.sentinels are *parallel arrays*.  That is, len(x.a) == 
len(x.sentinels), so indices into x.a are *also* indices into x.sentinels.

Envisioning the x.sentinels array was the Aha that collapsed the complexity 
of the algorithm.  Sentinels no longer affect indices so opcode indices are 
indices into the x.a, x.b and x.sentinels arrays.

===== Strategy & proof of correctness

With these facts in mind, the strategy for creating the results is simple.  
Given indices ai, aj, bi, bj from an opcode, the algorithm:

- Writes sentinels from x.sentinels[i], for i in range(ai,aj).
- Writes plain lines from b[i], for i in range(bi,bj).

This "just works" because the indices cover both a and b.

Sentinels are written exactly once (in order) because each sentinel appears 
in x.sentinels[i] for some i in range(len(x.a)).  Plain lines are written 
exactly once (in order) because each plain line appears in x.b[i] for some 
i in range(len(x.b)). This completes an informal proof of the correctness 
of the algorithm.

The leading and trailing sentinels lines are easy special cases.   This 
code, appearing before the main loop, ensures that leading lines are 
written first, and only once::

    x.put_sentinels(0)
    x.sentinels[0] = []

Similarly, this line, at the end of the main loop, writes trailing 
sentinels::

    x.results.extend(x.trailing_sentinels)

===== Summary

The algorithm depends on simple, guaranteed, properties of indices in 
SequenceMatcher opcode.

The algorithm steps through x.sentinels and x.b, extending x.results as it 
goes.

The algorithm gets all needed data directly from opcode indices into 
x.sentinels and x.b. Using opcode indices requires neither reader classes 
nor auxiliary indices.

The algorithm is simple enough to be understood at first reading. I'll 
remember its details for the rest of my life. 

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/leo-editor.
For more options, visit https://groups.google.com/d/optout.

ENB: Technical docs for the new @shadow algorithm

Reply via email to