ENB: still a juicy project

Edward K. Ream Sat, 16 Nov 2019 06:24:11 -0800

And an all-consuming one. For the last week it has taken two or more hours 
just to process all the ideas I have awakened with. Today was no exception.

This Engineering Notebook discusses yesterday's work and the remaining
issues. I am writing this ENB to prime my mental pump.

*tl;dr:* There are two strategies for creating two-way links between tokens
and ast nodes. Neither will be easy. The goal is to find a *feasibly simple*
solution.

*Infrastructure*

Much of my work involves developing infrastructure: the test runner, diffs
and dumps. Making these clean and powerful pays off handsomely:

- Diffs keep reminded me yesterdays of the complexities. Their clarity
helps prime my mental pump.

- Dumps show various parts of the ground truth. Yesterday, brief_dump
reminded me that the eat method was indirectly responsible for the two-way
links. Naturally, not calling eat killed all those links. I knew then it
was time for a break, after 15+ hours of work.

*A bug in difflib.ndiff*

TOG.verify now produces a diff of the token list against a new results
list. This was less than straightforward because difflib.ndiff barfs on all
Leo files!

File "...\Anaconda3\lib\difflib.py", line 1017, in _fancy_replace
TypeError: can only concatenate str (not "tuple") to str

Perhaps ndiff trips over Leo sentinels. Googling reveals the same error in
other parts of Python's standard library. I'll report this bug when I can
isolate it to just a few lines. Happily, difflib.Differ().compare works
just fine.

*The juicy problem*

After yesterday's post about diffs I naively assumed that diff would make
everything easy. Alas, it does not.

Recall that the task is to sync (somehow!) the tokens produced by
tokenize.tokenize with the list of results produced by traversing the tree.
This can *never* be a trivial task, for two reasons:

1. Ast nodes provide no data about comment tokens and the spelling of
whitespace. *There will always be gaps in the results list.*

The tree traversal can never generate 'comment' and 'nl' tokens. *Diffs
will never be empty.*

*2.* Tokens know nothing (initially) about ast nodes.

So it seems there are only two ways to create the two-way links:

*Strategy 1*: Generate the links immediately, when traversing the parse
tree.

This is what the eat method attempted to do. Previous attempts where not
feasibly simple, but I haven't entirely given up yet.

*Strategy 2*: Generate the links after all nodes have been created, using
the diffs. There are three cases to consider:

Case A. The easy case: the token and result match.

The token's kind and value fields match the result, which is a tuple(kind,
val), created by tog.put.

tog.begin_visitor sets the tog.node ivar, so this ivar is correct at the
time the visitor calls tog.put. put could create tuples (kind, val,
ast_node), which means that all data could be made available to the diff
handler. With a bit more work, the diff handler can set the two-way links.

Case B: a token corresponds to no result.
Case C: a result corresponds to no token.

Let's lump these two together, because these often come in matching pairs!

This happens when the spelling of a 'string' token does not match the
spelling of 'string' result. In that case, the ground truth is the spelling
of the token. The result will contain the ast_node link. We use the
ast_node link from the result, and the spelling from the token.

Other mismatches are possible, because there *must *be gaps in the results.
These are the hardest cases of all, because no ast_node link is *directly*
available.

This is not a gotcha. This just means that it's not so clear what node
should be associated with the "extra" tokens. This creates an *association
problem.* A better term might be the *association policy.*

Happily, the diffs, tokens and results are real lists (not generators), so
the diff handler can look both ahead and behind in all the lists. Whether
that can be made feasibly simple is an open question.

We could imagine combining strategies 1 and 2, that won't solve either
problem! It would only be a check, a kind of double-entry accounting.

*Conflicting intuitions*

One of my math professors said that when working on a problem one can often
be guided by the problem's inherent difficulty. One should not look for
elementary solutions to deep problems, nor for deep solutions to elementary
problems.

In this case, however, I have conflicting intuitions about how hard the
underlying problem really is. The difflib docs
<https://docs.python.org/3/library/difflib.html>say:

"The Ratcliff-Obershelp algorithm is cubic time in the worst case and
quadratic time in the expected case. SequenceMatcher is quadratic time for
the worst case and has expected-case behavior dependent in a complicated
way on how many elements the sequences have in common; best case time is
linear."

This means that relying on diff will be slow. It will, however, be accurate
:-)

Otoh, eat knows a lot more more than diff about tokens and results!

*Summary*

The goal is to find a feasibly simple way of associating two-way links
between tokens and tree nodes. There are two competing strategies:

1. Create the links in tog.eat, as results are generated. This strategy is
worth further work. It promises linear time performance.

2. Create the links in a post pass, using diffs. This will be much slower.

Significant bookkeeping issues bedevil either strategy.

Edward

--
You received this message because you are subscribed to the Google Groups
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/leo-editor/f171f8c4-2d1d-4450-a586-95ec7593c21b%40googlegroups.com.

ENB: still a juicy project

Reply via email to