Re: A new possible approach to importing python source files

vitalije Fri, 03 Dec 2021 07:17:52 -0800

I've attached the wrong version of script. Here is the correct one.

On Friday, December 3, 2021 at 4:09:17 PM UTC+1 vitalije wrote:


>
>> Clearly, you think differently than I do, and that's a very good thing 
>> :-) I'd like to get a feel for exactly how the code works, by "seeing it in 
>> action", so to speak.  I'll then attempt a theory of operation for my own 
>> benefit.
>>
>>
> I think that is a good idea. I would like to explain a little more how my 
> code works to make that easier.
>
> The function find_node_borders uses python tokenizer to search for 
> suitable node starting lines (code lines starting with token.NAME whose 
> string value is either def or class. For each found line it appends a list 
> [start_row, start_col, None, line_txt] to the list of possible results. The 
> third element which at this time is None, should be later changed to the 
> end line for this definition. The end of the definition might be noticed in 
> the generated tokens in two different cases. The first case is when we have 
> non comment python statement following the last line. In that case, 
> tokenize will emit token.DEDENT token. However, the unindented comment 
> lines will not trigger this DEDENT token, so we need to also check for the 
> COMMENT tokens having the starting column less then the body of the current 
> definition. In order to know the current indentaion we need to look for 
> token.INDENT tokens too. 
>
> So, the only tokens that we are interested in are:
>
>    1. token.INDENT for keeping track of lastindent
>    2. token.DEDENT and token.COMMENT whose starting col is less than 
>    lastindent . Whenever we encounter this case, we need to close all open 
>    definitions with the level of indentation grater than the current starting 
>    column of found token (DEDENT  or COMMENT)
>    3. token.NAME with the string value in ('def', 'class'). Whenever we 
>    encounter this case, we add another open definition to both the list of 
>    results and to the list of open definitions at this level of the 
>    indentation.
>
> After passing through all tokens we now have closed all open definitions 
> and our resulting list contains each definition in file with its starting 
> (row,col) position and also with the line number where it ends.  Finally we 
> filter our resulting list to exclude all definitions that are at the deeper 
> level and keep only those which start at the zero indentation. As I write 
> this, I've just realized that I can substantially simplify script by 
> ignoring those deeper definitions in the previous phase.
>
> Yes, I have further simplified and speed up code by ignoring all tokens 
> that do not start at the column zero. Instead of keeping a list of open 
> definitions we now have just one open definition (the last one).
>
> Finally, we can start emitting the nodes. We start with the [1, 1, ''] 
> node. This is the block of lines that comes before first definition. While 
> walking through the list of definitions, we check every time to see if the 
> end of the last node in list is less than the start line of this node. In 
> that case we have to insert one '...some definitions' block which starts 
> where last node ends and ends where this node starts. At the end we add 
> [end_of_last_node, None, ''] to our final list of nodes. These are the 
> lines of root node which come after 'at-others'.
>
> There are two functions that use the result of find_node_borders: 
> split_root and split_class. The first one is dealing with the root node and 
> generates the children of the root node for each node in the list. It also 
> sets the body of the root node. The other one split_class is very similar 
> function. It takes the body of the class definition without actual class 
> line, cuts as much white space at the beginning of each line, and use this 
> dedented text to calculate borders of its methods. Then does with the 
> resulting list of nodes almost the same as the split_root does, except it 
> adds indentation to at-others and also restores its first line (class line).
>
> Attached to this message is simplified version of the script.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to leo-editor+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/cd7f6885-397f-4292-8204-dea8a5cb7732n%40googlegroups.com.

import tokenize
import token
import io
from collections import defaultdict
c.frame.log.clearTab('Log')
def split_root(root):
    '''
    Parses the text of the body and separates all
    top level function definitions and class definitions
    in separate nodes which are all direct children of
    the root.
    
    In the second phase, this function can be called on
    each of the children with more than a certain threshold
    number of lines.
    '''
    def find_node_borders(txt):
        '''
        Returns a list of tuples (startrow, endrow, headline)
        for direct children of the node.
        '''
        inp = io.StringIO(txt)
        tokens = list(tokenize.generate_tokens(inp.readline))
        res = []
        lastindent = 0
        open_definition = None
        for i, tok in enumerate(tokens):
            row, col = tok[2]
            if col > 0: continue
            if tok[0] == token.INDENT:
                lastindent = len(tok[1])
                continue
            case_1 = tok[0] == token.COMMENT and lastindent > 0
            case_2 = tok[0] == token.DEDENT
            if open_definition and (case_1 or case_2):
                open_definition[1] = row
            elif tok[0] == token.NAME and tok[1] in ('def', 'class'):
                open_definition = [row, None, tok[-1].strip()]
                res.append(open_definition)
        i = 1
        nodes = [[1,1, '']]
        for a, b, x in res:
            b1 = nodes[-1][1]
            if a > b1:
                # there are some comments or declarations in between
                # these two nodes
                nodes.append([b1, a, '...some declarations'])
            nodes.append([a, b, make_headline(x)])
        nodes.append([nodes[-1][1], None, ''])
        return nodes
    def make_headline(line):
        line = line.strip()
        if line.startswith('class '):
            return line[5:].strip()[:-1]
        else:
            return line[4:].partition('(')[0].strip()
    def rename(p):
        toks = [x for x in tokenize.generate_tokens(io.StringIO(p.b).readline)
                if x[0] not in (token.NEWLINE, token.NL, token.ENDMARKER)]
        if all(x[0]==token.STRING for x in toks):
            p.h = '__doc__'
        elif all(x[0] == token.COMMENT for x in toks):
            p.h = '...comments'

    def split_class(p):
        lines = p.b.splitlines(True)
        if len(lines) < 20: return
        lws = [len(x) - len(x.lstrip()) for x in lines[1:] if x and not x.isspace()]
        ind = min(lws)
        def indent(x):
            return ' '*ind + x
        nlines = [x[ind:] if len(x) > ind else x for x in lines[1:]]
        txt = ''.join(nlines)
        nodes = find_node_borders(txt)
        a, b, h = nodes[0]
        def body(a, b):
            return ''.join(nlines[a-1:b and (b-1)])
        b1 = ''.join(lines[a:b]) + indent('@others\n')
        a, b, h = nodes.pop()
        b2 = ''.join(indent(x) for x in nlines[a-1:])
        p.b = f'{lines[0]}{b1}{b2}'
        for a, b, h in nodes[1:]:
            child = p.insertAsLastChild()
            child.h = h
            child.b = body(a, b)
            if h == '...some declarations':rename(child)
    root.deleteAllChildren()
    txt = root.b
    lines = txt.splitlines(True)
    def body(a, b):
        return ''.join(lines[a-1:b and (b-1)])
    nodes = find_node_borders(txt)
    a, b, h = nodes[0]
    root.b = f'{body(a, b)}@others\n{body(nodes[-1][0], None)}'
    for a, b, h in nodes[1:-1]:
        child = root.insertAsLastChild()
        child.h = h
        child.b = body(a, b)
        if child.b.startswith('class ') and (b - a) > 20:
            split_class(child)
        if h == '...some declarations':rename(child)
def import_py_file(p, fname):
    with open(fname, 'r') as inp:
        p.b = inp.read()
        split_root(p)
def import_one_level(fname):
    '''
    this function demonstrates usage of split_root function
    it loads given python file in the test node and checks
    to see if the import is perfect or not
    '''
    with open(fname, 'r') as inp:
        txt = inp.read()
    root = ensure_root(p, 'py import test node')
    root.b = txt
    split_root(root)
    txt2 = g.getScript(c, root, useSentinels=False)
    if txt != txt2:
        g.es('differrent')
    else:
        g.es('same')
def ensure_root(p, name):
    '''
    this is just a utility for testing script
    if there is no node in the outline with given name
    this function will add a node and set its headline
    to given name after the current position
    '''
    ps = c.find_h(name)
    if not ps:
        p1 = p.insertAfter()
        p1.h = name
        return p1
    else:
        return ps[0]
# you can choose whatever module here
# but for testing purposes we're
# going to import and parse difflib.py
# from the standard library
import difflib as module
import_one_level(module.__file__)
c.redraw()
root = ensure_root(p, 'py import test node')

Re: A new possible approach to importing python source files

Reply via email to