[Python-Dev] A much better tokenize.untokenize function

Edward K. Ream Sun, 03 Nov 2019 08:21:28 -0800

I am creating this post as a courtesy to anyone interested in python's tokenize 
module.

**tl;dr:** Various posts, linked below, discuss a much better replacement for
untokenize. Do with it as you will.

This code is very unlikely to be buggy. *Please* let me know if you find
problems with it.

**About the new untokenize**

This post: https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/VPqtB9lTEAAJ
announces a replacement for the untokenize function in tokenize.py:
https://github.com/python/cpython/blob/3.8/Lib/tokenize.py

To summarize this post:

I have "discovered" a spectacular replacement for Untokenizer.untokenize in
python's tokenize library module:

- The wretched, buggy, and impossible-to-fix add_whitespace method is gone.
- The new code has no significant 'if' statements, and knows almost nothing
about tokens!

As I see it, the only possible failure modes might involve the zero-length line
0. See the above post for a full discussion.

**Testing**

This post: https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/5X8IDzpgEAAJ
discusses testing issues.
Imo, the new code should easily pass all existing unit tests.

The new code also passes a new unit test for Python issue 38663:
https://bugs.python.org/issue38663,
something existing tests fail to do, even in "compatibility mode" (2-tuples) .

Imo, the way is now clear for proper unit testing of python's Untokenize class.

In particular, it is, imo, time to remove compatibility mode. This hack has
masked serious issues with untokenize:
https://bugs.python.org/issue?%40columns=id%2Cactivity%2Ctitle%2Ccreator%2Cassignee%2Cstatus%2Ctype&%40sort=-activity&%40filter=status&%40action=searchid&ignore=file%3Acontent&%40search_text=untokenize&submit=search&status=-1%2C1%2C2%2C3

**Summary**

The new untokenize is the way it is written in The Book.

I have done the heavy lifting on issue 38663. Python devs are free to do with
it as they like.

Your choice will not affect me or Leo in any way. The new code will soon become
the foundation of Leo's token-oriented commands.

Edward

P.S. I would imagine that tokenize.untokenize is pretty much off most dev's
radar :-)

This Engineering Notebook
post:https://groups.google.com/d/msg/leo-editor/aivhFnXW85Q/b2a8GHvEDwAJ
discusses (in way too much detail :-) why untokenize is important to me.

To summarize that post:

Imo, python devs are biased in favor of parse trees in programs involving text
manipulations. I assert that the "real" black and fstringify tools would be
significantly simpler, clearer and faster if they used python's tokenize module
instead of python's ast module. Leo's own "beautify" and "fstringify" commands
prove my assertion to my own satisfaction.

This opinion will be controversial, so I want to make the strongest possible
case. I need to prove that handling tokens can be done simply and correctly in
all cases. This is a big ask, because python's tokens are complicated. See the
Lexical Analysis section of the Python Language Reference.

The new untokenize furnishes the required proof, and does so elegantly.

EKR
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/AXZRDUUAUMID2CPP5A24SCG45C4ZDR6C/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] A much better tokenize.untokenize function

Reply via email to