This is an Engineering Notebook post. Feel free to ignore. However, it may be of general interest, in particular to anyone who uses difflib.
This post discusses the new @shadow algorithm in detail, with a simple, informal, proof of its correctness. This post will be pre-writing for the technical documentation of the new @shadow update algorithm. ===== Background The new algorithm depends on four simple, guaranteed, properties of SequenceMatcher.opcodes. Full docs here <https://docs.python.org/2/library/difflib.html#sequencematcher-examples>. Fact 1: The opcodes tell how to turn x.a (a list of lines) into x.b (another list of lines). The new code uses the a and b terminology. It's concise and easy to remember. Fact 2: The opcode indices ai, aj, bi, bj, *never* change because neither a nor b changes. Thus, the plain lines of the result can be built up by copying lines from x.b to x.results: 'replace' x.results.extend(x.b[j1:j2]) 'delete' do nothing (j1 == j2) 'insert' x.results.extend(x.b[j1:j2]) 'equal' x.results.extend(x.b[j1:j2]) This is an easy pattern to remember ;-) Fact 3: The opcodes *cover* both x.a and x.b, in order, without any gaps. This is an explicit requirement of sm.get_opcode. The docs say that: - The first tuple has ai==aj==bi==bj==0. - Remaining tuples have ai == (aj from the preceding tuple) and bi == (bj from the previous tuple). Fact 4: The x.sentinels array contains the sentinels that precede each line of x.a. For any index i, x.sentinels[i] is the (possibly empty) list of sentinel lines that precede line a[i]. Computing x.sentinels from old_private_lines is easy. Crucially, x.a and x.sentinels are *parallel arrays*. That is, len(x.a) == len(x.sentinels), so indices into x.a are *also* indices into x.sentinels. Envisioning the x.sentinels array was the Aha that collapsed the complexity of the algorithm. Sentinels no longer affect indices so opcode indices are indices into the x.a, x.b and x.sentinels arrays. ===== Strategy & proof of correctness With these facts in mind, the strategy for creating the results is simple. Given indices ai, aj, bi, bj from an opcode, the algorithm: - Writes sentinels from x.sentinels[i], for i in range(ai,aj). - Writes plain lines from b[i], for i in range(bi,bj). This "just works" because the indices cover both a and b. Sentinels are written exactly once (in order) because each sentinel appears in x.sentinels[i] for some i in range(len(x.a)). Plain lines are written exactly once (in order) because each plain line appears in x.b[i] for some i in range(len(x.b)). This completes an informal proof of the correctness of the algorithm. The leading and trailing sentinels lines are easy special cases. This code, appearing before the main loop, ensures that leading lines are written first, and only once:: x.put_sentinels(0) x.sentinels[0] = [] Similarly, this line, at the end of the main loop, writes trailing sentinels:: x.results.extend(x.trailing_sentinels) ===== Summary The algorithm depends on simple, guaranteed, properties of indices in SequenceMatcher opcode. The algorithm steps through x.sentinels and x.b, extending x.results as it goes. The algorithm gets all needed data directly from opcode indices into x.sentinels and x.b. Using opcode indices requires neither reader classes nor auxiliary indices. The algorithm is simple enough to be understood at first reading. I'll remember its details for the rest of my life. Edward -- You received this message because you are subscribed to the Google Groups "leo-editor" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/leo-editor. For more options, visit https://groups.google.com/d/optout.
