ENB: Improving token assignment

Edward K. Ream Sun, 19 Jan 2020 04:36:17 -0800

The next phase of the project is to complete the code that splits long 
lines and joins short lines. I want this code to be as simple as possible. 
The crucial split/join "snippets" should advertise the virtues of the TOG 
class.

Just as with the code that handles slices, I have only a vague idea of what
the final split/join code will look like.This ENB notebook post attempts to
clarify issues relating to the split/join logic. As always, feel free to
ignore it.

*Background*

At present, the code that splits lines is *entirely* token based. This
*usually* works well enough, but the token-based code relies on an open
parenthesis (token) already being present in the statement. If this open
paren exists, the long line may safely be split anywhere between tokens.
Most long lines involve function call statements (ast.Call nodes), and such
statements do indeed contain the needed open paren. Alas, other python
statements, including returns and assignments, may not already have the
needed open paren. The split could must know where to insert the required
pair of parens.

In short, my working assumption is that having access to the parse tree is
essential (very helpful) in the split logic. Ditto for the join logic.

*Gaining access to the parse tree*

o.colon could get the relevant parse tree from self.token.node, because *colons
are significant tokens*. Job done.

How to access the parse tree for long lines? Using the newline token seems
reasonable, because newline tokens are also significant. However, the
one-line code snippets used by the split/join logic don't contain *any*
newlines.

*Problems assigning newline tokens*

More generally, the last newline of a code snippet is assigned to the
ast.Module node. At the very least, this must be changed. Or does it? And
if so, how?

We could ignore (temporarily) the problems with assigning tokens to nodes.
For example, we could "trigger" the split join logic in the o.name token
handler. "name" tokens are significant, so self.token.node will be the
parse tree for the name. For function calls, we would have to look up the
tree to determine whether the name is a function name. Doable, but not
pretty.

"return", "if", "while" etc are keywords, so the parse tree is usable as
is. Assignments would require a trigger on "=" tokens, that is, op tokens
whose values is "=".

So this approach is clunky. It spreads the split/join logic over too many
nodes. It seems more reasonable to trigger the split/join logic on the
'newline' token, or the 'endmarker' token for the special case that the
file/snippet ends without a newline. Or maybe we can just force a trailing
newline for all files/snippets.

*Extrinsically significant tokens*

At present, tokens are either classified as significant or insignificant.
That is "significance" is an *intrinsic *property of each token. This is
foolish, and limiting.

Indeed, the ast.Call and ast.Tuple visitors already call tog.gen_token for
parentheses tokens. In such contexts, parens should be considered
significant, and the eventual call to tog.sync_token should synchronize on
those tokens. This would assure that the parens are assigned to the proper
node! Alas, sync_token doesn't do that. At present, it just stupidly
returns, assigning the parens (later) to the next "officially" significant
token. As a result, rarens are not assigned properly for calls and tuples.

We could go further, and have various visitors generate comma tokens, but I
doubt that would ever be useful.

*Summary*

Whatever happens, the code should properly assign paren tokens in calls and
tuples. Ditto for newline tokens that end many statement lines. Only
tog.sync_token will need to change, but that will be surprisingly tricky.
Details omitted.

I'll investigate using the parse tree as a guide to splitting and joining
lines only after parens and newline tokens are more reasonably assigned to
ast nodes.

Edward

P. S. There is another complication: statements may become "long" via
python's backslash-newline convention. The black tool itself takes the
extreme view that backslash-newlines should always be eliminated. But this
would be wrong, wrong, wrong in Leo, because Leo nodes can not represent
underindented triple-quoted strings. For example, all of the unit tests in
leoAst.py for multi-line test code contain this pattern:

# use r'""" if lines contain back-slashes.
contents = """\
line 1
line 2
"""

Depending on the outline level of the node in which this code resides, line
1, line 2 etc. will initially contain *unseen leading whitespace*. The
test-running code removes such leading whitespace. Anyway, Leo depends on
the backslash newline convention.

Even outside Leo the coding pattern shown above seems perfectly reasonable
for unit-tests. Why prohibit it?

EKR

--
You received this message because you are subscribed to the Google Groups
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/leo-editor/4078c167-1649-4a3f-9497-f2ef0db854c1%40googlegroups.com.

ENB: Improving token assignment

Reply via email to