Gareth Rees added the comment: Yury, let me see if I can move this issue forward. I clearly haven't done a good job of explaining these problems, how they are related, and why it makes sense to solve them together, so let me have a go now.
1. tokenize.untokenize() raises AssertionError if you pass it a sequence of tokens output from tokenize.tokenize(). This was my original problem report, and it's still not fixed in Python 3.4: Python 3.4.0b3 (default, Jan 27 2014, 02:26:41) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import tokenize, io >>> t = list(tokenize.tokenize(io.BytesIO('1+1'.encode('utf8')).readline)) >>> tokenize.untokenize(t) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 317, in untokenize out = ut.untokenize(iterable) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 246, in untokenize self.add_whitespace(start) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 232, in add_whitespace assert row <= self.prev_row AssertionError This defeats any attempt to use the sequence: input code -> tokenize -> transform -> untokenize -> output code to transform Python code. But this ought to be the main use case for the untokenize function! That's how I came across the problem in the first place, when I was starting to write Minipy <https://github.com/gareth-rees/minipy>. 2. Fixing problem #1 is easy (just swap <= for >=), but it raises the question: why wasn't this mistake caught by test_tokenize? There's a test function roundtrip() whose docstring says: Test roundtrip for `untokenize`. `f` is an open file or a string. The source code in f is tokenized, converted back to source code via tokenize.untokenize(), and tokenized again from the latter. The test fails if the second tokenization doesn't match the first. If I don't fix the problem with roundtrip(), then how can I be sure I have fixed the problem? Clearly it's necessary to fix the test case and establish that it provokes the assertion. So why doesn't roundtrip() detect the error? Well, it turns out that tokenize.untokenize() has two modes of operation and roundtrip() only tests one of them. The documentation for tokenize.untokenize() is rather cryptic, and all it says is: Each element returned by the [input] iterable must be a token sequence with at least two elements, a token number and token value. If only two tokens are passed, the resulting output is poor. By reverse-engineering the implementation, it seems that it has two modes of operation. In the first mode (which I have called "compatibility" mode after the method Untokenizer.compat() that implements it) you pass it tokens in the form of 2-element tuples (type, text). These must have exactly 2 elements. In the second mode (which I have called "full" mode based on the description "full input" in the docstring) you pass it tokens in the form of tuples with 5 elements (type, text, start, end, line). These are compatible with the namedtuples returned from tokenize.tokenize(). The "full" mode has the buggy assertion, but test_tokenize.roundtrip() only tests the "compatibility" mode. So I must (i) fix roundtrip() so that it tests both modes; (ii) improve the documentation for tokenize.untokenize() so that programmers have some chance of figuring this out in future! 3. As soon as I make roundtrip() test both modes it provokes the assertion failure. Good, so I can fix the assertion. Problem #1 solved. But now there are test failures in "full" mode: $ ./python.exe -m test test_tokenize [1/1] test_tokenize ********************************************************************** File "/Users/gdr/hg.python.org/cpython/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests Failed example: for testfile in testfiles: if not roundtrip(open(testfile, 'rb')): print("Roundtrip failed for file %s" % testfile) break else: True Expected: True Got: Roundtrip failed for file /Users/gdr/hg.python.org/cpython/Lib/test/test_platform.py ********************************************************************** 1 items had failures: 1 of 73 in test.test_tokenize.__test__.doctests ***Test Failed*** 1 failures. test test_tokenize failed -- 1 of 78 doctests failed 1 test failed: test_tokenize Examination of the failed tokenization shows that if the source contains a backslashed-newline, then in "full" mode Untokenize.add_whitespace() fails to leave any space between the token before the blackslash-newline and the token afterwards. The trouble is that backslash-newlines don't result in a token from tokenize(). But in update Untokenize.add_whitespace() I can deduce that there must have been a backslash-newline whenever I see the first token on a new line (other than a DEDENT or ENDMARKER) and the previous token was not a NEWLINE or NL. 4. OK, with problems 1-3 fixed, shouldn't I now actually test the documented round-trip property fully? The property that roundtrip() currently tests is: tokenize(untokenize(tokenize(code))) == tokenize(code) which is all that is guaranteed in "compatibility" mode. But the docstring for tokenize.untokenize() says: Round-trip invariant for full input: Untokenized source will match input source exactly So in "full" mode it seems that the following is guaranteed: untokenize(tokenize(code)) == code But if I add a test for this stronger round-trip property, it immediately fails. The docstring is not telling the truth! 5. Why doesn't the "full" round-trip property hold? It seems that there are (at least) two separate problems: i. tokenize() doesn't output a token for a backslash-newline so although untokenize() can deduce that there must have been one, it can't know what whitespace there was before it. ii. In compatibility mode, the first token is always discarded. If the tokens came from tokenize.tokenize(), then this is the ENCODING token, so the encoding is never applied. (Benjamin Peterson had previously discovered this problem but just suppressed the failing test case instead of reporting it: <http://hg.python.org/cpython/rev/c13abed5d764>) Looking through this sequence of woes again, it's clear that I can save problems 4 and 5 for another issue. But I am confident that problems 1-3 must be fixed together. So although Meador Inge said above that "the current patch fixes way too many issues", I hope that if he reads the argument above he will see why I did what I did: this is the minimum sensible change I know how to make. I have prepared a revised patch that applies to the current head. Please review it: I'll do my best to fix all the problems you find. ---------- Added file: http://bugs.python.org/file33919/Issue12691.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12691> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com