And an all-consuming one. For the last week it has taken two or more hours just to process all the ideas I have awakened with. Today was no exception.
This Engineering Notebook discusses yesterday's work and the remaining issues. I am writing this ENB to prime my mental pump. *tl;dr:* There are two strategies for creating two-way links between tokens and ast nodes. Neither will be easy. The goal is to find a *feasibly simple* solution. *Infrastructure* Much of my work involves developing infrastructure: the test runner, diffs and dumps. Making these clean and powerful pays off handsomely: - Diffs keep reminded me yesterdays of the complexities. Their clarity helps prime my mental pump. - Dumps show various parts of the ground truth. Yesterday, brief_dump reminded me that the eat method was indirectly responsible for the two-way links. Naturally, not calling eat killed all those links. I knew then it was time for a break, after 15+ hours of work. *A bug in difflib.ndiff* TOG.verify now produces a diff of the token list against a new results list. This was less than straightforward because difflib.ndiff barfs on all Leo files! File "...\Anaconda3\lib\difflib.py", line 1017, in _fancy_replace TypeError: can only concatenate str (not "tuple") to str Perhaps ndiff trips over Leo sentinels. Googling reveals the same error in other parts of Python's standard library. I'll report this bug when I can isolate it to just a few lines. Happily, difflib.Differ().compare works just fine. *The juicy problem* After yesterday's post about diffs I naively assumed that diff would make everything easy. Alas, it does not. Recall that the task is to sync (somehow!) the tokens produced by tokenize.tokenize with the list of results produced by traversing the tree. This can *never* be a trivial task, for two reasons: 1. Ast nodes provide no data about comment tokens and the spelling of whitespace. *There will always be gaps in the results list.* The tree traversal can never generate 'comment' and 'nl' tokens. *Diffs will never be empty.* *2.* Tokens know nothing (initially) about ast nodes. So it seems there are only two ways to create the two-way links: *Strategy 1*: Generate the links immediately, when traversing the parse tree. This is what the eat method attempted to do. Previous attempts where not feasibly simple, but I haven't entirely given up yet. *Strategy 2*: Generate the links after all nodes have been created, using the diffs. There are three cases to consider: Case A. The easy case: the token and result match. The token's kind and value fields match the result, which is a tuple(kind, val), created by tog.put. tog.begin_visitor sets the tog.node ivar, so this ivar is correct at the time the visitor calls tog.put. put could create tuples (kind, val, ast_node), which means that all data could be made available to the diff handler. With a bit more work, the diff handler can set the two-way links. Case B: a token corresponds to no result. Case C: a result corresponds to no token. Let's lump these two together, because these often come in matching pairs! This happens when the spelling of a 'string' token does not match the spelling of 'string' result. In that case, the ground truth is the spelling of the token. The result will contain the ast_node link. We use the ast_node link from the result, and the spelling from the token. Other mismatches are possible, because there *must *be gaps in the results. These are the hardest cases of all, because no ast_node link is *directly* available. This is not a gotcha. This just means that it's not so clear what node should be associated with the "extra" tokens. This creates an *association problem.* A better term might be the *association policy.* Happily, the diffs, tokens and results are real lists (not generators), so the diff handler can look both ahead and behind in all the lists. Whether that can be made feasibly simple is an open question. We could imagine combining strategies 1 and 2, that won't solve either problem! It would only be a check, a kind of double-entry accounting. *Conflicting intuitions* One of my math professors said that when working on a problem one can often be guided by the problem's inherent difficulty. One should not look for elementary solutions to deep problems, nor for deep solutions to elementary problems. In this case, however, I have conflicting intuitions about how hard the underlying problem really is. The difflib docs <https://docs.python.org/3/library/difflib.html>say: "The Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear." This means that relying on diff will be slow. It will, however, be accurate :-) Otoh, eat knows a lot more more than diff about tokens and results! *Summary* The goal is to find a feasibly simple way of associating two-way links between tokens and tree nodes. There are two competing strategies: 1. Create the links in tog.eat, as results are generated. This strategy is worth further work. It promises linear time performance. 2. Create the links in a post pass, using diffs. This will be much slower. Significant bookkeeping issues bedevil either strategy. Edward -- You received this message because you are subscribed to the Google Groups "leo-editor" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/leo-editor/f171f8c4-2d1d-4450-a586-95ec7593c21b%40googlegroups.com.
