[issue38953] Untokenize and retokenize does not round-trip

Terry J. Reedy Fri, 23 Oct 2020 17:12:07 -0700

Terry J. Reedy <tjre...@udel.edu> added the comment:

Zac, thank you for Hypothesmith.  I am thinking about how to maybe use it to 
test certain IDLE functions.  But your 2nd example, as posted, does not 
compile, even with 3.8.  Typo?


Thank you also for the two failure examples.  I worked on untokenize in 2013 
and an not surprised that there are still bugs.  The test assert matches the 
doc claim that the untokenize return "is guaranteed to tokenize back to match 
the input" of untokenize as far as type and string.  To get output that could 
be used to fix the bugs, I converted to unittest (and ran with 3.10).

from io import StringIO as SIO
import tokenize
import unittest

class RoundtripTest(unittest.TestCase):
    def test_examples(self):
        examples = ("#", "\n\\\n", "#\n\x0cpass#\n",)
        for code in examples:
            with self.subTest(code=code):
                tokens = list(tokenize.generate_tokens(SIO(code).readline))
                print(tokens)
                outstring = tokenize.untokenize(tokens)  # may change 
whitespace from source
                print(outstring)
                output = tokenize.generate_tokens(SIO(outstring).readline)
                self.assertEqual([(t.type, t.string) for t in tokens],
                                 [(t.type, t.string) for t in output])

unittest.main()

"#" compiles: untokenize calls add_whitespace, which failed on line 173 with
 ValueError: start (1,1) precedes previous end (2,0)
tokens = [
TokenInfo(type=60 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#'),
TokenInfo(type=61 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line=''),
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

The doc for NL, a tokenize-only token, says "Token value used to indicate a 
non-terminating newline. The NEWLINE token indicates the end of a logical line 
of Python code; NL tokens are generated when a logical line of code is 
continued over multiple physical lines."  The NL token seems to be a mistake 
here.

Calling add_whitespace also seems like a mistake.  In any case, raising on a 
valid token stream is obviously bad.


"\n\\\n" does not compile in 3.8 or 3.10.
>>> compile("\n\\\n", '', 'exec')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "", line 2
    \
     ^
SyntaxError: unexpected EOF while parsing

generate_tokens calls _tokenize, which failed on line 521, as it should, with
tokenize.TokenError: ('EOF in multi-line statement', (3, 0))
Nothing to fix here.


"#\n\x0cpass#\n": outstring is '#\n pass#\n', which fails compile with 
IndentationError.
[(60, '#'),(61, '\n'),         (1, 'pass'),(60, '#'),(4, '\n'),        (0, '')] 
!= 
[(60, '#'),(61, '\n'),(5, ' '),(1, 'pass'),(60, '#'),(4, '\n'),(6, ''),(0, '')]

test_tokenize tests the roundtrip with various modules and test strings, but 
perhaps none with formfeed.  I think the bug is tokenizing 'pass' as starting 
in column 1 instead of column 0.
TokenInfo(type=1 (NAME), string='pass', start=(2, 1), end=(2, 5), 
line='\x0cpass#\n')

Formfeed = '\f' = '\x0c' is legal here precisely because it is non-space 
'whitespace' that does not advance the column counter.  In _tokenize, \f sets 
the column counter for indentation to 0.  Otherwise, there would be an 
IndentationError, as there is with the outstring.  But the string position 
ignores the indentation counter.  Either the string position must be adjusted, 
so \f is replaced with nothing, or a token for \f must be emitted so it is not 
replaced.

Tokens 5 and 6 are INDENT and DEDENT, so the latter will go away with the 
former.

What is a bit silly about untokenize is that it ignores the physical line of 
5-tuples even when present.  Another issue, along with the dreadful API.


A note could be added to 
https://docs.python.org/3/reference/lexical_analysis.html#whitespace-between-tokens
 when _tokenize is patched.
---

BPO is aimed at facilitating patches.  Other discussions are best done 
elsewhere.  But I have a quick question and comment.

Can hypothesis be integrated as is with unittest?  Does it work to decorate 
test_xyz and get a sensible report of multiple failures?  Is there now or 
possibly in the future an iterator interface, so one could write "for testcase 
in testcases:   with subTest...."

About properties: your blog post 
https://github.com/Zac-HD/stdlib-property-tests pointed me to metamorphic 
testing https://www.hillelwayne.com/post/metamorphic-testing/ which lead me to 
the PDF version of "“Metamorphic Testing: A Review of Challenges and 
Opportunities”.  Most properties I have seen correspond to metamorphic 
relations.  The 'metamorphic' is broader than may be immediately obvious.  I 
would like to discuss this more on a better channel.

----------
stage:  -> test needed
versions: +Python 3.10 -Python 3.8

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38953>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38953] Untokenize and retokenize does not round-trip

Reply via email to