And an all-consuming one. For the last week it has taken two or more hours 
just to process all the ideas I have awakened with. Today was no exception.

This Engineering Notebook discusses yesterday's work and the remaining 
issues. I am writing this ENB to prime my mental pump.

*tl;dr:* There are two strategies for creating two-way links between tokens 
and ast nodes. Neither will be easy. The goal is to find a *feasibly simple* 
solution.


*Infrastructure*

Much of my work involves developing infrastructure: the test runner, diffs 
and dumps. Making these clean and powerful pays off handsomely:

- Diffs keep reminded me yesterdays of the complexities. Their clarity 
helps prime my mental pump.

- Dumps show various parts of the ground truth. Yesterday, brief_dump 
reminded me that the eat method was indirectly responsible for the two-way 
links.  Naturally, not calling eat killed all those links. I knew then it 
was time for a break, after 15+ hours of work.


*A bug in difflib.ndiff*

TOG.verify now produces a diff of the token list against a new results 
list. This was less than straightforward because difflib.ndiff barfs on all 
Leo files!

File "...\Anaconda3\lib\difflib.py", line 1017, in _fancy_replace
TypeError: can only concatenate str (not "tuple") to str

Perhaps ndiff trips over Leo sentinels. Googling reveals the same error in 
other parts of Python's standard library. I'll report this bug when I can 
isolate it to just a few lines. Happily, difflib.Differ().compare works 
just fine.


*The juicy problem*

After yesterday's post about diffs I naively assumed that diff would make 
everything easy. Alas, it does not.

Recall that the task is to sync (somehow!) the tokens produced by 
tokenize.tokenize with the list of results produced by traversing the tree. 
This can *never* be a trivial task, for two reasons:

1. Ast nodes provide no data about comment tokens and the spelling of 
whitespace. *There will always be gaps in the results list.*

The tree traversal can never generate 'comment' and 'nl' tokens. *Diffs 
will never be empty.*

*2.* Tokens know nothing (initially) about ast nodes.

So it seems there are only two ways to create the two-way links:

*Strategy 1*: Generate the links immediately, when traversing the parse 
tree.

This is what the eat method attempted to do. Previous attempts where not 
feasibly simple, but I haven't entirely given up yet.

*Strategy 2*: Generate the links after all nodes have been created, using 
the diffs. There are three cases to consider:

Case A. The easy case: the token and result match.

The token's kind and value fields match the result, which is a tuple(kind, 
val), created by tog.put.

tog.begin_visitor sets the tog.node ivar, so this ivar is correct at the 
time the visitor calls tog.put. put could create tuples (kind, val, 
ast_node), which means that all data could be made available to the diff 
handler. With a bit more work, the diff handler can set the two-way links.

Case B: a token corresponds to no result.
Case C: a result corresponds to no token.

Let's lump these two together, because these often come in matching pairs!

This happens when the spelling of a 'string' token does not match the 
spelling of 'string' result. In that case, the ground truth is the spelling 
of the token. The result will contain the ast_node link. We use the 
ast_node link from the result, and the spelling from the token.

Other mismatches are possible, because there *must *be gaps in the results. 
These are the hardest cases of all, because no ast_node link is *directly* 
available.

This is not a gotcha. This just means that it's not so clear what node 
should be associated with the "extra" tokens. This creates an *association 
problem.* A better term might be the *association policy.*

Happily, the diffs, tokens and results are real lists (not generators), so 
the diff handler can look both ahead and behind in all the lists. Whether 
that can be made feasibly simple is an open question.

We could imagine combining strategies 1 and 2, that won't solve either 
problem!  It would only be a check, a kind of double-entry accounting.

*Conflicting intuitions*

One of my math professors said that when working on a problem one can often 
be guided by the problem's inherent difficulty. One should not look for 
elementary solutions to deep problems, nor for deep solutions to elementary 
problems.

In this case, however, I have conflicting intuitions about how hard the 
underlying problem really is. The difflib docs 
<https://docs.python.org/3/library/difflib.html>say:

"The Ratcliff-Obershelp algorithm is cubic time in the worst case and 
quadratic time in the expected case. SequenceMatcher is quadratic time for 
the worst case and has expected-case behavior dependent in a complicated 
way on how many elements the sequences have in common; best case time is 
linear."

This means that relying on diff will be slow. It will, however, be accurate 
:-)

Otoh, eat knows a lot more more than diff about tokens and results!

*Summary*

The goal is to find a feasibly simple way of associating two-way links 
between tokens and tree nodes. There are two competing strategies:

1. Create the links in tog.eat, as results are generated. This strategy is 
worth further work. It promises linear time performance.

2. Create the links in a post pass, using diffs. This will be much slower.

Significant bookkeeping issues bedevil either strategy.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/f171f8c4-2d1d-4450-a586-95ec7593c21b%40googlegroups.com.

Reply via email to