For several days I've been thinking that the crucial phase of the
"unification" project might be completed in a few hours. Still true :-)
After yesterday's work I know *exactly* what complications remain.
This Engineering Notebook post will discuss the complications in detail and
offer a road map towards solving them. This post will be pre-writing for
the most important part of the Theory of Operation.
*tl;dr:* Clearly, these complications can be solved: the token list and the
parse tree contain all the required data. Alas, no trivial solution can
possibly exist. An elegant solution might still be possible.
*Overview of the code*
The driver (main line) is a non-recursive traversal of the parse tree that
visits each *visitor *in *token order*.
All visitors are generators. Each visitors contains one or more lines of
one of the following forms:
yield from self.gen(tree_node)
yield from self.gen_token(token_kind, token_value)
Several abbreviations for gen_token exist for simplicity. The gen method
"delegates" the traversal to constituent tree nodes. The *sync_token*
method is the heart of the entire project. It does two things:
1. It verifies that the entire algorithm calls sync_token in the order in
which they appear in the token list.
2. It creates two-way links between self.node (the present parse tree node)
and token passed to put_token.
sync_token raises *AssignLinksError *if the token presented to sync_token
does not match the *present token*, namely tog.token_list [tog.px]. tog.px
is the *token index*. This index monotonically increases. Only sync_token
ever updates this index.
Important: the entire "product" of the TOG classes are the two-way links
between tokens and tree nodes.
*Complications, part 1: ambiguous trees*
Naively, one might think that only the token index is needed. Alas, that's
not true, because the parse tree can be *ambiguous*. Two *different* inputs
(and token lists) can generate *exactly the same* parse tree.
In particular, there is, in principle, *no way* that the do_If visitor can
whether it should call gen_name('if') or gen_name('elif') by looking only
at the parse tree. So do_If needs help from the token list.
*A second* index, *tog.if_index*, tracks all 'if', 'elif', and 'else'
tokens in the token list. The *peek_if* helper returns the current 'if',
'elif', or 'else' token. The *advance_if* helper moves to the next 'if',
'elif' or 'else' token in the token list.
Note: The do_IfExp visitor handles ternary operators. This visitor must
also call advance_if, because ternary operators must "consume" one "if"
token and one "else" token.
*Complications, part deux: joined and formatted strings*
The second set of complications became fully visible only late yesterday.
Solving them will involve a new index, *tog.string index*, and a new helper,*
tog.advance_str*. Alas, these do not suffice, as I discovered yesterday,
because...
1. A FormattedValue node represents a *single* token (a formatted string)
that corresponds to an *arbitrarily large* subtree of parse nodes.
2. A JoinedStr node is a *single* tree node that corresponds to *arbitrarily
many* tokens.
3. Parse trees can contain nested FormattedValue and JoinedStr nodes. This
is the acid test.
For example, here is a (failing!) unit test that illustrates the
difficulties:
k.k1(
f"{'A' if self.a else ''}"
f"{'B' if self.b else ''}"
": "
)
k.k2()
The token list (not fully linked, because AssignLinksError intervenes):
Tokens...
0 encoding '' line: 0 level: 0
1 name 'k' line: 1 level: 0 3232 Name
children: 0 parent: 8344 Attribute
2 op '.' line: 1 level: 0 8344 Attribute
children: 1 parent: 2128 Call
3 name 'k1' line: 1 level: 0 8344 Attribute
children: 1 parent: 2128 Call
4 op '(' line: 1 level: 0
5 nl '\n' line: 1 level: 0
6 ws 4 line: 2 level: 0
7 string 'f"{\'A\' if... line: 2 level: 0 3008 FormattedValue
children: 1 parent: 3400 JoinedStr
8 nl '\n' line: 2 level: 0
9 ws 4 line: 3 level: 0
10 string 'f"{\'B\' if... line: 3 level: 0 3008 FormattedValue
children: 1 parent: 3400 JoinedStr
11 nl '\n' line: 3 level: 0
12 ws 4 line: 4 level: 0
13 string '": "' line: 4 level: 0 9016 FormattedValue
children: 1 parent: 3400 JoinedStr
14 nl '\n' line: 4 level: 0
15 op ')' line: 5 level: 0
16 newline '\n' line: 5 level: 0 9016 FormattedValue
children: 1 parent: 3400 JoinedStr
17 name 'k' line: 6 level: 0 9016 FormattedValue
children: 1 parent: 3400 JoinedStr
18 op '.' line: 6 level: 0 9016 FormattedValue
children: 1 parent: 3400 JoinedStr
19 name 'k2' line: 6 level: 0
20 op '(' line: 6 level: 0
21 op ')' line: 6 level: 0
22 newline '\n' line: 6 level: 0
23 endmarker '' line: 7 level: 0
And here is the parse tree, again not fully linked:
parent lines node
tokens
====== ===== ====
======
0 Module
0 Module 1 Expr
1 Expr 2 Call
2 Call 1 3 Attribute op.
name(k1)
3 Attribute 1 4 Name: id='k' name(
k)
2 Call 5 JoinedStr: values=FormattedVal...
5 JoinedStr 2..3 6 FormattedValue
string(f"{'A' if self.a ...) string(f"{'B' if self.b ...)
6 FormattedValue 7 IfExp
7 IfExp 8 Str: s='A'
7 IfExp 9 Attribute
9 Attribute 10 Name: id='self'
7 IfExp 11 Str: s=''
5 JoinedStr 4..6 12 FormattedValue
string(": ") newline(5:2) name(k)
op.
12 FormattedValue 13 IfExp
13 IfExp 14 Str: s='B'
13 IfExp 15 Attribute
15 Attribute 16 Name: id='self'
13 IfExp 17 Str: s=''
0 Module 18 Expr
18 Expr 19 Call
19 Call 20 Attribute
20 Attribute 21 Name: id='k'
*Summary*
The heart of the puzzle is a many-to-many relationship between tokens and
tree nodes:
- f-strings (FormattedValue's) can be the root of an arbitrarily large
parse tree.
- concatenated strings (JoinedStr's) can correspond to arbitrarily many
tokens.
We can say the following about a possible solution:
1. The solution is possible, because all necessary data are present.
2. The solution will be non-trivial. Special case hacks have no chance of
working. An elegant solution may be possible. If so, comments must explain
why it works.
3. Any solution must involve cooperation between sync_token and the do_Str,
do_FormattedValue and do_JoinedStr visitors, *including especially *the
advance_if and advance_string helpers.
The challenge will be to create a *sound *solution that is also as simple
as possible. It's a big ask. It's worth any amount of work.
Edward
Writing started: 5:15 am; finished 7:07 am.
--
You received this message because you are subscribed to the Google Groups
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/leo-editor/8cbf7e97-3693-43c7-81aa-7ad43fbcc446%40googlegroups.com.