[issue38663] Untokenize does not round-trip ws before bs-nl

Terry J. Reedy Sun, 03 Nov 2019 14:09:14 -0800


Terry J. Reedy <tjre...@udel.edu> added the comment:


Since these posts were more or less copied to pydev list, I am copying my 
response on the list here.
---

> **tl;dr:** Various posts, linked below, discuss a much better replacement for 
> untokenize.
If that were true, I would be interested.  But as explained below, I don't 
believe it.  Even if I did, https://bugs.python.org/issue38663 gives no 
evidence that you have signed the PSF contributor agreement.  In any case, it 
has no PR.  We only use code that is actually contributed on the issue or in a 
PR under that agreement.

To continue, the first two lines of tokenize.untokenize() are
    ut = Untokenizer()
    out = ut.untokenize(iterable)

Your leoBeautify.Untokenize class appears to be completely unsuited as a 
replacement for tokenize.Untokenizer as the API for the class and method are 
incompatible with the above.

1. untokenize.Untokenizer takes no argument. leoBeautify.Untokenize() requires 
a 'contents' argument, a (unicode) string, that is otherwise undocumented.  At 
first glance, it appears that 'contents' needs to be something like the desired 
output.  (I could read the code where you call Untokenizer to improve my guess, 
but not now.)  Since our exising tests do not pass 'contents', they should all 
fail.

2. untokenize.Untokenizer.untokenize(iterable) require an iterable that returns 
"sequences with at least two elements, the token type and the token string."
https://docs.python.org/3/library/tokenize.html#tokenize.untokenize
One can generate python code from a sequence of pairs with a guarantee that the 
resulting code will be tokenized by the python.exe parser into the same 
sequence.

The doc continues "Any additional sequence elements are ignored."
The intent is that a tool can tokenize a file, modify the file (and thereby 
possibly invalidate the begin, end, and line elements of the original token 
stream) and generate a modified file.

[Note that the end index (4th element), when present, is not ignored but is 
used to improve white space insertion.  I believe that this should be 
documented.  What if the end index is no longer valid?  Should be also use the 
start index?]

leoBeautify.Untokenize.untokenize() requires an iterable of 5-tuples.  It makes 
uses of both the start and end elements, as well as the mysterious required 
'contents' string.

> I have "discovered" a spectacular replacement for Untokenizer.untokenize in 
> python's tokenize library module:
To pass 'code == untokenize(tokenize(code))' (ignoring api details), there is 
an even more spectacular replacement: rebuild the code from the 'line' 
elements.  But while the above is an essential test, it is a toy example with 
respect to applications.  The challenge is to create a correct and valid file 
from less information, possibly with only token type and string.  (The latter 
is 'compatibility mode'.)
> In particular, it is, imo, time to remove compatibility mode.
And break all usage that requires it?  Before doing much more with tokenize, I 
would want to understand more how it is actually used.

> Imo, python devs are biased in favor of parse trees in programs involving 
> text manipulations.  [snip]
So why have 46 of us contributed to this one module?  This sort of polemic is a 
net negative here. We a multiple individuals with differing opinions.

----------
nosy: +terry.reedy
stage:  -> test needed
versions: +Python 3.9 -Python 3.6

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38663>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38663] Untokenize does not round-trip ws before bs-nl

Reply via email to