ENB: About tokens and related commands

Edward K. Ream Fri, 01 Nov 2019 06:33:32 -0700

I started this Engineering Notebook post to clarify my thinking. It has 
shown me the way forward.


Part of it has turned into a Theory of Operation for Leo's token-based 
code. In the new documentation framework 
<https://www.divio.com/blog/documentation/>, this is a Discussion.  That 
is, it is oriented towards understanding.

This post explores picky issues related to python tokens. Feel free to skip 
any or all of it.

*Background*

Imo, python devs are biased in favor of parse trees in programs involving 
text manipulations.  The "real" black and fstringify tools would be 
significantly simpler, clearer and faster if they used python's tokenize 
module instead of python's ast module.  Leo's own black and fstringify 
commands prove my contention to *my* satisfaction.

I would like to "go public" with my opinion.  This opinion will be 
controversial, so I want to make the strongest possible case. I need to *prove 
*that handling tokens can be done simply and correctly in *all* cases.  
This is a big ask, because python's tokens are complicated.  See the Lexical 
Analysis <https://docs.python.org/3/reference/lexical_analysis.html> 
section of the Python Language Reference 
<https://docs.python.org/3/reference/index.html>.

The beautify2 branch is intended to provide the required proof.

*Strategy*

The beautify2 branch bases all token handling on the *untokenize *function 
in python's tokenize module.

Given a stream of tokens (5-tuples) from tokenize.generate_tokens(code), 
untokenize *reproduces *code, the original source code. This *round 
tripping* property of untokenize is basis for the required proof.

Recreating the source code *within *each token is straightforward. The hard 
part is recreating the *between-token whitespace*. This tokenize.untokenize 
is guaranteed to do! 

So the strategy is simple.  All commands will create input tokens based on 
tokenize.untokenize. This will guarantee that token handling is *sound*, 
that is, that the list of input tokens will contain *exactly* the correct 
token data.

*Classes*

The beautify2 branch defines several classes that use tokens. Each class 
does the following:

1. Creates a list (not an iterator) of *input tokens*.  Using real lists 
allows lookahead, which is impossible with iterators.

2. Defines one or more input token *handlers*. Handlers produce zero or 
more *output tokens*.
 
A straightforward concatenation of all output tokens produces the result of 
each command.

Here are the actual classes:

- class *BeautifierToken*:
  Input and output tokens are instances of this class.

- class *NullTokenBeautifier*:
  The base class for actual commands. This class is the natural place to 
test round-tripping. 

- class *FStringifyTokens*(NullTokenBeautifier).
  Implements Leo's token-based fstringify commands. It defines a handler 
only for string input tokens.

- class *PythonTokenBeautifier*(NullTokenBeautifier)
  Implements Leo's token-based beautify commands. It defines handlers for 
all input tokens.

*Tokenizing and token hooks*

NullTokenBeautifier.*make_input_tokens *creates a list of input tokens from 
a sequence of 5-tuples produced by tokenize.generate_tokens.  There is no 
need for subclasses to override make_input_tokens, because...

make_input_tokens is *exactly *the same as tokenize.untokenize, except that 
it calls *token hooks* in various places.  These hooks allow subclasses to 
modify the tokens returned by make_input_tokens. The *null hooks* (in 
NullTokenBeautifier) make make_input_tokens work *exactly* the same as 
tokenize.untokenize.

This scheme is the simplest thing that could possibly work. Subclasses may 
adjust the input tokens to make token handling easier:

1. The null hooks create pseudo "ws" tokens (in the proper places!) that 
carry the between-token whitespace. Nothing more needs to be done!

2. Extra "ws" tokens would complicate token-related parsing in the 
FStringifyTokens and PythonTokenBeautifier. Instead, the token hooks in 
these two classes "piggyback" between-token whitespace on already-created 
tokens. It's quite simple. See the token_hook methods of these two classes. 
Note that these two hooks are similar, but not exactly the same.

Alas, there is an itsy bitsy problem...

*A bug in untokenize*

Alas, tokenizer.untokenize does *not *properly "round trip" this valid 
python program:

print \
    ("abc")

The result is:

print\
    ("abc")

The whitespace before the backslash is not preserved.

*Does the bug matter?*

I have given this question considerable thought.  It's a case of theory vs 
practice.

In *practice*, this bug *doesn't* matter:

1. The odds of a programmer writing the code above are small.  Crucially, 
backspace-newlines within strings are always handled correctly.

2. Even if the buggy case did happen, Leo's beautify and fstringify 
commands would carry on without incident.

3. It's highly unlikely that anyone would complain about the diffs.

4. The bug could even be called a feature :-)

In *theory*, this bug is much more troubling.  I want to argue publicly 
that:

1. Basing token-based tools on tokenize.untokenize is absolutely sound.  
Alas, it is not.

2. tokenize.untokenize is easy to understand. Alas it is not.

untokenize's helper, tokenize.add_whitespace, is a faux-clever hack. After 
hours of study and tracing, there is no obvious way to fix the bug.

*Summary*

This post contains a high-level theory of operation for flexible 
token-based classes.

The code in the beautify2 branch has much to recommend it:

- It is demonstrably as sound as tokenize.untokenize, a big advance over 
previous code.
- It could easily be used as a public exhortation to base text-based tools 
on tokens, not parse trees.

For now, I'll ignore the bug, except for filing a python bug, and ask for 
guidance about fixing it.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/16c27746-902e-4242-b585-caea4e89ef4d%40googlegroups.com.

ENB: About tokens and related commands

Reply via email to