On Jul 8, 2:55 pm, "Edward K. Ream" <[EMAIL PROTECTED]> wrote:
> The trunk now contains the first incarnation of leoShadow.py. This
> contains revised/refactored code from the mod_shadow plugin.
The shadow branch contains the latest work.
Last night I spent about four hours studying the heart of the shadow
code, the method called propagate_changed_lines. I believe I now
understand this code in considerable detail, something that has
*never* been true for me in the past.
I plan to greatly improve the unit testing today, and it is my
intention that the results will provide detailed examples of all the
essential features of the code. That is, I soon hope to be able to
demonstrate visually exactly how and why the code works.
But I'm not there yet, so In this post I'll report that Aha's that
made the code comprehensible. These notes are mostly notes to
myself. Feel free to ignore them :-)
propagate_changed_lines computes new_private_lines (the changed lines
*with* sentinels) given two arguments:
- old_private_lines (containing sentinels) and
- new_public_lines (the changed lines without sentinels)
As an overview, propagate_changed_lines works as follows:
1. It reconstitutes old_public_lines (the original lines without
sentinels) by removing sentinels from old_private_lines.
2. It calls
difflib.SequenceMatcher(None,old_public_lines,new_public_lines) to
create a list of diffs of the old and new versions of the lines
**without** sentinels.
3. It (very cleverly) uses the list of diffs to drive what is, in
effect, a two-way merge of old_private_lines (with sentinels) and
new_public_lines (without sentinels). All sentinels and unchanged
regular lines come from old_private_lines. All inserted or replacement
lines come from new_public_lines.
There are a *lot* of details in the algorithm. Probably the turning
point, the point at which I truly started to understand the algorithm,
was the realization that the main loop involves only input from
old_private_lines and new_public_lines. Once I saw that, I could
begin to the following:
Most of the indices returned by SequenceMatcher.get_opcodes are *not*
used in the main loop. In the present code, these indices are called
old_i, old_j, new_i and new_j.
Indeed, the main loop mostly ignores these indices (I'm exaggerating a
bit) and instead mostly relies on implied pointers into the
old_private_lines and new_public_lines arrays. These "implied
pointers" are the position of the corresponding reader classes used to
access old_private_lines and new_public_lines.
This lead me, finally, to the realization that the main loop is, in
effect, a two-way merge of old_private_lines and new_public_lines
arrays. It's a clever, tricky, merge, but (in essence) a merge is
exactly what it is.
There is one crucial detail that you must understand to begin to make
sense of this merge: the mapping array.
propagate_changed_lines reconstitutes old_public_lines (the original
lines without sentinels) by removing sentinels from old_private_lines
as follows::
old_public_lines, mapping =
self.strip_sentinels_with_map(old_private_lines,marker)
As the name implies, this method not only strips sentinel lines, but
it returns a mapping. This mapping is an array of n+1 entries, where
n is len(old_public_lines). This mapping array is an essential part
of the algorithm. mapping[i] is the index into old_private_lines of
line i in old_public_lines. That is, for each i <
len(old_public_lines), line i in old_public_lines ==
old_private_lines[i]. In other words, mapping[i] tells where line i
of old_public_lines "came from".
We are now in a position to begin to understand the main loop in
detail. The main loop processes one SequenceMatcher opcode, of the
form tag,old_i,old_j,new_i,new_j. tag is the opcode kind. It is in
('insert','delete','equal','replace').
The following code starts the opcode loop:
# Ignore (delete) all unwritten lines of old_private_lines_rdr up
to index mapping[old_i].
# Because of this, nothing has to be explicitly deleted below.
self.copy_sentinels(old_private_lines_rdr,new_private_lines_wtr,marker,limit=mapping[old_i])
Understanding this single line is essential to understanding the rest
of the tag (opcode) handlers.
This line copies only sentinel lines from the current position of the
old_private_lines_rdr up to mapping[old_i]. As a side effect, the
current position of the old_private_lines_rdr is advanced to
mapping[old_i], so in effect all non-sentinel lines from the last
written line to limit are deleted from old_private_lines. That is,
those lines will never be copied to the output.
Now let's take a look at the rest of the main loop:
if tag == 'equal':
# Copy all lines (including sentinels) from the old private file
to the new private file.
while old_private_lines_rdr.index() <= mapping[old_j-1]:
line = old_private_lines_rdr.get()
new_private_lines_wtr.put(line)
# Ignore all new lines up to new_j: the same lines (with
sentinels) have just been written.
new_public_lines_rdr.sync(new_j)
Diff is telling us that old_public_lines[old_i:old_j] are the same as
new_public_lines[new_i:new_j]. However, we can't use either of these
arrays directly because we would lose sentinels lines. Instead we
copy all lines, including sentinels, from the current position in
old_private_lines_rdr to mapping[old_j-1] (I'm dubious about the -1:
unit testing is required). Once we have done that, we skip all lines
in new_public_lines up until line new_j.
Note that *only* index old_j is used in this code. The old_i, new_i
and new_j indices are not used at all. Similar remarks can be made
about the other cases to be discussed below. The realization that
most indices aren't used was an important Aha.
elif tag in ('insert','replace'):
# All unwritten lines from old_private_lines_rdr up to
mapping[old_i] have already been ignored.
# Copy lines from new_public_lines_rdr up to new_j.
while new_public_lines_rdr.index() < new_j:
line = new_public_lines_rdr.get()
new_private_lines_wtr.put(line)
Diff is telling us that new_public_lines[new_i:new_j] should replace
old_public_lines[old_i:old_j]. However, the call to copy_sentinels
has *already* skipped lines up to old_private_lines[mapping[old_i]],
and these correspond (via the mapping) to the lines up to
old_public_lines[old_i]. Thus, for both the 'insert' and 'replace'
opcodes, all we have to do is copy unwritten lines from
new_public_llines_rdr up to index new_j.
Note: we could assert that new_public_lines.index() == new_i at the
start of this case. This assertion, and similar assertions, are true
because SequenceMatcher returns opcodes in increasing order without
gaps.
elif tag=='delete':
# All unwritten lines from old_private_lines_rdr up to
mapping[old_i] have already been ignored.
# Leave new_public_lines_rdr unchanged.
pass
Diff is telling us delete old_public_lines[old_i:old_j]. The call to
copy_sentinels has deleted old_private_lines up to index mapping
[old_i], **not** up to mapping [old_j]. We could call::
self.copy_sentinels(old_private_lines_rdr,new_private_lines_wtr,marker,limit=mapping[old_j])
here, but there is no need to do so. The next time through the loop
will do the equivalent. Finally, after the loop completes, the
following code ensures that all trailing sentinels are written by
calling::
self.copy_sentinels(old_private_lines_rdr,new_private_lines_wtr,marker,limit=old_private_lines_rdr.size())
So that's it. The big picture is that the main loop is a two-way
merge of old_private_lines (with sentinels) with new_public_lines
(without sentinels). Unchanged lines come from old_private_lines;
changed lines come from new_public_lines. The result of the merge
forms new_private_lines.
Clearly, every line of new_public_line gets written to the output. In
the 'equal' case, however, those lines actually come from
old_private_lines. Assertions are needed to ensure that lines are
written correctly with respect to sentinels. I believe those
assertions apply, but thorough unit testing is an absolute must.
Edward
P.S. I believe I now have sufficient insight into the code to propose
some possible additions. In particular, when inserting code, we might
use leading whitespace to determine whether to insert lines at the
start of one node or the end of another. It may be desirable to look
ahead when inserting lines to see if the lines in old_private_lines
are sentinels that start a node. If so, we could move inserted lines
to the next node by copying those sentinels first. But I am getting
ahead of myself: I must completely demonstrate the correctness of the
basic algorithm first.
EKR
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"leo-editor" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/leo-editor?hl=en
-~----------~----~----~----~------~----~------~--~---