ENB: A much better untokenizer

Edward K. Ream Sun, 03 Nov 2019 01:21:00 -0800

This Engineering Notebook post will be referenced in an upcoming 
announcement to the python-dev 
<https://mail.python.org/archives/list/[email protected]/> list.


*Executive summary*

I have "discovered" a spectacular replacement for Untokenizer.untokenize in 
python's tokenize library module. The wretched, buggy, and 
impossible-to-fix add_whitespace method is gone. The new code has no 
significant 'if' statements, and knows almost nothing about tokens!  This 
is the way untokenize is written in The Book.

The new code should put an end to a long series of issues 
<https://bugs.python.org/issue?%40columns=id%2Cactivity%2Ctitle%2Ccreator%2Cassignee%2Cstatus%2Ctype&%40sort=-activity&%40filter=status&%40action=searchid&ignore=file%3Acontent&%40search_text=untokenize&submit=search&status=-1%2C1%2C2%2C3>
 
against untokenize code in python's tokenize 
<https://github.com/python/cpython/blob/master/Lib/tokenize.py>library 
module.  Some closed issues were blunders arising from dumbing-down the 
TestRoundtrip.check_roundtrip method in test_tokenize.py 
<https://github.com/python/cpython/blob/master/Lib/test/test_tokenize.py>.  
The docstring says, in part:

When untokenize bugs are fixed, untokenize with 5-tuples should
reproduce code that does not contain a backslash continuation
following spaces.  A proper test should test this.

Imo, the way is now clear for proper unit testing of python's Untokenize 
class.

The new code pass all of python's related unit test using a proper 
(rigorous) version of check_roundtrip. The new code also passes a new unit 
test for python issue 38663 <https://bugs.python.org/issue38663>, which 
fails with python's library code, even with the fudged version of 
check_roundtrip. 

This is a long post.  I recommend skimming it, unless you are a python dev 
interested in understanding why the new code works.

This post also tells how I "discovered" the new code.  It is mainly of 
historical interest.

*Background*

This post 
<https://groups.google.com/d/msg/leo-editor/aivhFnXW85Q/b2a8GHvEDwAJ> 
discusses in detail why tokenize.untokenize is important to me. To 
summarize, a simple, correct untokenizer would form the foundation of 
token-based tools.

I have written token-based versions of python code beautifiers.  The 
present work started because I wanted to write a token-based on fstringify, 
which at present doesn't do anything on much of Leo's code base. Based on 
deep study of both black and fstringify, it is my strong opinion that 
python devs underestimate the difficulties of using ast's (parse trees) and 
overestimate the difficulties of using tokens.

At present, Leo's NullTokenBeautifier class (in leoBeautify.py 
<https://github.com/leo-editor/leo-editor/blob/beautify2/leo/core/leoBeautify.py>
 
in the beautify2 branch 
<https://github.com/leo-editor/leo-editor/tree/beautify2>) uses a *lightly 
modified* version of the original untokenize as the basis of 
NullTokenBeautifier.make_input_tokens. It is far from fun, elegant, simple 
or clear :-)  I'll soon rewrite make_input_tokens using the new untokenize 
code.

*First principles*

1. Code overrides documentation.

Neither the Lexical Analysis section of Python Language Reference 
<https://docs.python.org/3/reference/lexical_analysis.html> nor the docs 
for the tokenize module <https://docs.python.org/3/library/tokenize.html> 
were useful.

2. Don't believe code comments.

The module-level docstring for tokenize.py says: "tokenize(readline)...is 
designed to match the working of the Python tokenizer exactly, except that 
it produces COMMENT tokens for comments and gives type OP for all 
operators." To my knowledge, this assertion is nowhere justified, much less 
proven.  Let's hope I am missing something.

So tokenize.py is the ground truth, not tokenizer.c 
<https://github.com/python/cpython/blob/3.6/Parser/tokenizer.c>, and 
certainly not any document.


*Breakthrough: understanding row numbers and physical lines*

The breakthroughs came from reading tokenize.py, and in particular the 
_tokenize function.

I was trying to figure out just what the heck row numbers are.  Are they 
1-based?  To what, exactly, do they refer? Neither the docs nor the 
docstrings are of any help at all.  Yeah, I know they are indices.  
Carefully documenting that fact in the module's docstring is unhelpful :-)

Imagine my surprise when I discovered that the lnum var is set only once, 
at the start of the main loop.  This means that nothing fancy is going on.  
*Row numbers are simply indices into code.splitlines(True)!*

In other words: *code.splitlines(True) are the so-called physical lines 
mentioned in the Language reference.*

*Collapsing untokenize*

Armed with this new understanding, I wrote a dump_range method.  This was 
supposed to recover token text from 5-tuples. It is hardly elegant, but 
unlike add_whitespace it actually has a chance of working:

def dump_range(contents, start, end):
    lines = contents.splitlines(True)
    result = []
    s_row, s_col = start
    e_row, e_col = end
    if s_row == e_row == 0:
        return ''
    if s_row > len(lines):
        return ''
    col1 = s_col
    row = s_row
    if s_row == e_row:
        line = lines[row-1]
        return line[col1:e_col]
    # More than one line.
    while row <= e_row:
        line = lines[row-1]
        col2 = e_col if row == e_row else len(line)
        part = line[col1:col2]
        result.append(part)
        col1 = 0
        row += 1
    return ''.join(result)

This code may be buggy, but that's moot because...

*Breakthrough: scanning is never needed*

At some point I realized that the code above is needless complex.  If row 
numbers are indices into contents.splitlines, than we can we can convert 
row/column numbers directly!  All we need is an array of indices to the 
start of each row in contents.splitlines.

Traces had shown me that row zero is a special case, and that the first 
"real" token might have a row-number of 1, not zero.  We can handle that 
without effort by saying that line zero has zero length.  This resolves the 
confusion about indexing.  Indices are zero-based, but line zero has zero 
length.

So the code to compute indices is:

# Create the physical lines.
self.lines = self.contents.splitlines(True)
# Create the list of character offsets of the start of each physical line.
last_offset, self.offsets = 0, [0]
for line in self.lines:
    last_offset += len(line)
    self.offsets.append(last_offset)

Given this offset array, it is *trivial* to discover the actual text of a 
token, and any between-token whitespace:

# Unpack..
tok_type, val, start, end, line = token
s_row, s_col = start
e_row, e_col = end
kind = token_module.tok_name[tok_type].lower()
# Calculate the token's start/end offsets: character offsets into contents.
s_offset = self.offsets[max(0, s_row-1)] + s_col
e_offset = self.offsets[max(0, e_row-1)] + e_col
# Add any preceding between-token whitespace.
ws = self.contents[self.prev_offset:s_offset]
if ws:
    self.results.append(ws)
# Add the token, if it contributes any real text.
tok_s = self.contents[s_offset:e_offset]
if tok_s:
    self.results.append(tok_s)
# Update the ending offset.
self.prev_offset = e_offset

The Post Script shows the complete code, giving the context for the snippet 
above.

*Summary*

The new untokenize code is elegant, fast, sound, and easy to understand.

The code knows nothing about tokens themselves, only about token indices :-)

The way is now clear for proper unit testing of python's Untokenize class.

Edward

P. S. Here is Leo's present Untokenize class, in leoBeautify.py 
<https://github.com/leo-editor/leo-editor/blob/beautify2/leo/core/leoBeautify.py>
:

class Untokenize:
    
    def __init__(self, contents, trace=False):
        self.contents = contents # A unicode string.
        self.trace = trace
    
    def untokenize(self, tokens):

        # Create the physical lines.
        self.lines = self.contents.splitlines(True)
        # Create the list of character offsets of the start of each 
physical line.
        last_offset, self.offsets = 0, [0]
        for line in self.lines:
            last_offset += len(line)
            self.offsets.append(last_offset)
        # Trace lines & offsets.
        self.show_header()
        # Handle each token, appending tokens and between-token whitespace 
to results.
        self.prev_offset, self.results = -1, []
        for token in tokens:
            self.do_token(token)
        # Print results when tracing.
        self.show_results()
        # Return the concatentated results.
        return ''.join(self.results)

    def do_token(self, token):
        """Handle the given token, including between-token whitespace"""

        def show_tuple(aTuple):
            s = f"{aTuple[0]}..{aTuple[1]}"
            return f"{s:8}"

        # Unpack..
        tok_type, val, start, end, line = token
        s_row, s_col = start
        e_row, e_col = end
        kind = token_module.tok_name[tok_type].lower()
        # Calculate the token's start/end offsets: character offsets into 
contents.
        s_offset = self.offsets[max(0, s_row-1)] + s_col
        e_offset = self.offsets[max(0, e_row-1)] + e_col
        # Add any preceding between-token whitespace.
        ws = self.contents[self.prev_offset:s_offset]
        if ws:
            self.results.append(ws)
            if self.trace:
                print(
                    f"{'ws':>10} {ws!r:20} "
                    f"{show_tuple((self.prev_offset, s_offset)):>26} "
                    f"{ws!r}")
        # Add the token, if it contributes any real text.
        tok_s = self.contents[s_offset:e_offset]
        if tok_s:
            self.results.append(tok_s)
        if self.trace:
            print(
                f"{kind:>10} {val!r:20} "
                f"{show_tuple(start)} {show_tuple(end)} 
{show_tuple((s_offset, e_offset))} "
                f"{tok_s!r:15} {line!r}")
        # Update the ending offset.
        self.prev_offset = e_offset

Typical driver code (from within Leo) would be something like:

import io
import tokenize
import imp
import leo.core.leoBeautify as leoBeautify
imp.reload(leoBeautify)

contents = r'''print ( 'aa \
bb')
print('xx \
yy')
'''
tokens = tokenize.tokenize(io.BytesIO(contents.encode('utf-8')).readline)
results = leoBeautify.Untokenize(contents, trace=True).untokenize(tokens)
if results != contents:
    print('FAIL')

EKR

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/d8de47c0-7b76-432e-aecd-23b175d90026%40googlegroups.com.

ENB: A much better untokenizer

Reply via email to