For several days I've been thinking that the crucial phase of the 
"unification" project might be completed in a few hours. Still true :-) 
After yesterday's work I know *exactly* what complications remain.

This Engineering Notebook post will discuss the complications in detail and 
offer a road map towards solving them. This post will be pre-writing for 
the most important part of the Theory of Operation.

*tl;dr:* Clearly, these complications can be solved: the token list and the 
parse tree contain all the required data. Alas, no trivial solution can 
possibly exist.  An elegant solution might still be possible.


*Overview of the code*

The driver (main line) is a non-recursive traversal of the parse tree that 
visits each *visitor *in *token order*.

All visitors are generators. Each visitors contains one or more lines of 
one of the following forms:

yield from self.gen(tree_node)
yield from self.gen_token(token_kind, token_value)

Several abbreviations for gen_token exist for simplicity. The gen method 
"delegates" the traversal to constituent tree nodes. The *sync_token* 
method is the heart of the entire project. It does two things:

1. It verifies that the entire algorithm calls sync_token in the order in 
which they appear in the token list.
2. It creates two-way links between self.node (the present parse tree node) 
and token passed to put_token.

sync_token raises *AssignLinksError *if the token presented to sync_token 
does not match the *present token*, namely tog.token_list [tog.px]. tog.px 
is the *token index*. This index monotonically increases. Only sync_token 
ever updates this index.

Important: the entire "product" of the TOG classes are the two-way links 
between tokens and tree nodes.


*Complications, part 1: ambiguous trees*

Naively, one might think that only the token index is needed. Alas, that's 
not true, because the parse tree can be *ambiguous*. Two *different* inputs 
(and token lists) can generate *exactly the same* parse tree.

In particular, there is, in principle, *no way* that the do_If visitor can 
whether it should call gen_name('if') or gen_name('elif') by looking only 
at the parse tree. So do_If needs help from the token list.

*A second* index, *tog.if_index*, tracks all 'if', 'elif', and 'else' 
tokens in the token list. The *peek_if* helper returns the current  'if', 
'elif', or 'else' token. The *advance_if* helper moves to the next 'if', 
'elif' or 'else' token in the token list.

Note: The do_IfExp visitor handles ternary operators. This visitor must 
also call advance_if, because ternary operators must "consume" one "if" 
token and one "else" token.

*Complications, part deux: joined and formatted strings*

The second set of complications became fully visible only late yesterday. 
Solving them will involve a new index, *tog.string index*, and a new helper,* 
tog.advance_str*. Alas, these do not suffice, as I discovered yesterday, 
because...

1. A FormattedValue node represents a *single* token (a formatted string) 
that corresponds to an *arbitrarily large* subtree of parse nodes.

2. A JoinedStr node is a *single* tree node that corresponds to *arbitrarily 
many* tokens.

3. Parse trees can contain nested FormattedValue and JoinedStr nodes. This 
is the acid test.

For example, here is a (failing!) unit test that illustrates the 
difficulties:

k.k1(
    f"{'A' if self.a else ''}"
    f"{'B' if self.b else ''}"
    ": "
)
k.k2()

The token list (not fully linked, because AssignLinksError intervenes):

Tokens...

  0    encoding ''              line: 0  level: 0
  1        name 'k'             line: 1  level: 0  3232 Name             
children: 0 parent: 8344 Attribute
  2          op '.'             line: 1  level: 0  8344 Attribute       
 children: 1 parent: 2128 Call
  3        name 'k1'            line: 1  level: 0  8344 Attribute       
 children: 1 parent: 2128 Call
  4          op '('             line: 1  level: 0
  5          nl '\n'            line: 1  level: 0
  6          ws 4               line: 2  level: 0
  7      string 'f"{\'A\' if... line: 2  level: 0  3008 FormattedValue   
children: 1 parent: 3400 JoinedStr
  8          nl '\n'            line: 2  level: 0
  9          ws 4               line: 3  level: 0
 10      string 'f"{\'B\' if... line: 3  level: 0  3008 FormattedValue   
children: 1 parent: 3400 JoinedStr
 11          nl '\n'            line: 3  level: 0
 12          ws 4               line: 4  level: 0
 13      string '": "'          line: 4  level: 0  9016 FormattedValue   
children: 1 parent: 3400 JoinedStr
 14          nl '\n'            line: 4  level: 0
 15          op ')'             line: 5  level: 0
 16     newline '\n'            line: 5  level: 0  9016 FormattedValue   
children: 1 parent: 3400 JoinedStr
 17        name 'k'             line: 6  level: 0  9016 FormattedValue   
children: 1 parent: 3400 JoinedStr
 18          op '.'             line: 6  level: 0  9016 FormattedValue   
children: 1 parent: 3400 JoinedStr
 19        name 'k2'            line: 6  level: 0
 20          op '('             line: 6  level: 0
 21          op ')'             line: 6  level: 0
 22     newline '\n'            line: 6  level: 0
 23   endmarker ''              line: 7  level: 0

And here is the parse tree, again not fully linked:

parent           lines    node                                         
tokens
======           =====    ====                                         
======
                          0   Module
0   Module                1     Expr
1   Expr                  2       Call
2   Call         1        3         Attribute                          op. 
name(k1)
3   Attribute    1        4           Name: id='k'                     name(
k)
2   Call                  5         JoinedStr: values=FormattedVal...
5   JoinedStr    2..3     6           FormattedValue                   
string(f"{'A' if self.a ...) string(f"{'B' if self.b ...)

6   FormattedValue          7             IfExp
7   IfExp                 8               Str: s='A'
7   IfExp                 9               Attribute
9   Attribute             10                Name: id='self'
7   IfExp                 11              Str: s=''
5   JoinedStr    4..6     12          FormattedValue                   
string(": ") newline(5:2) name(k)
                                                                       op.
12  FormattedValue          13            IfExp
13  IfExp                 14              Str: s='B'
13  IfExp                 15              Attribute
15  Attribute             16                Name: id='self'
13  IfExp                 17              Str: s=''
0   Module                18    Expr
18  Expr                  19      Call
19  Call                  20        Attribute
20  Attribute             21          Name: id='k'

*Summary*

The heart of the puzzle is a many-to-many relationship between tokens and 
tree nodes:

- f-strings (FormattedValue's) can be the root of an arbitrarily large 
parse tree.
- concatenated strings (JoinedStr's) can correspond to arbitrarily many 
tokens.

We can say the following about a possible solution:

1. The solution is possible, because all necessary data are present. 

2. The solution will be non-trivial. Special case hacks have no chance of 
working. An elegant solution may be possible. If so, comments must explain 
why it works.

3. Any solution must involve cooperation between sync_token and the do_Str, 
do_FormattedValue and do_JoinedStr visitors, *including especially *the 
advance_if and advance_string helpers.

The challenge will be to create a *sound *solution that is also as simple 
as possible. It's a big ask. It's worth any amount of work.

Edward

Writing started: 5:15 am; finished 7:07 am.

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/8cbf7e97-3693-43c7-81aa-7ad43fbcc446%40googlegroups.com.

Reply via email to