ENB: Instant history of a code checking breakthrough

Edward K. Ream Sun, 07 Jun 2015 05:03:08 -0700

There has been an amazing breakthrough in code analysis.  Not sure where it 
will lead, or how much more work I shall do on this project.


This is an Engineering notebook post.  You can safely ignore it.

This is an "instant" history of work completed in the last few days.  I am 
writing this now, before the details blurred. I'm pretty sure I have 
remembered the general timing of events correctly.

On Wednesday, June3, I posted a "farewell" message on the static type 
checking group.  But, that just got me thinking about code analysis again 
:-) Sigh.  This post 
<https://groups.google.com/d/msg/python-static-type-checking/reP0Ok0a12U/rcFi8jzHKBEJ>
 
discusses some intuitions.

On Thursday, June 4, I reviewed my old code analysis code.  I know it was 
Thursday because I was reviewed all of its problems while driving to a swim 
lesson.  In short, the code was clumsy and inflexible.

Sometime later, perhaps that night, I started revising one of my old tests. 
I stumbled across two major simplifications:

1. I revised the testing framework so that there would only ever be one 
"test".

For backup, a copy of the testing framework, @@button show-data, is in the 
attic (leoNotes.txt).  Originally, all the code was in a Test class defined 
as a child of the @button node.  The code soon moved into the new 
leoCheck.py file.

2. As a prototype, I decided to use regular expressions to "parse" the 
source code.

I had become aware of how fast regular expressions can be by studying 
Python's tokenizer code in Lib/tokenize.py.  That's probably what had me 
choose re's for the prototype.

Using regex's turned out to be a *stupendous *bit of good luck! My focus 
shifted from all the technical details of traversing parse trees to what I 
actually wanted to *see*, namely the relationships between *definitions *of 
functions/methods and all the *calls to* and *returns from* those routines.

The prototype code created globals dictionaries of classes, defs, calls to 
defs and return statements.  The show_results method created reports, 
organized by def name.

The result was amazing!  In just a few hours work I had a better picture of 
Leo's code than I had been able to produce in months of work with the old 
stc project.  Here is just one example:

printGc
    2 definitions...
        leoGlobals.py: def printGc(tag=None):
           leoTest.py: def printGc(message=None):

    1 call...
        leoTest.py::runGc: printGc(message=message)

    2 returns...
        leoGlobals.py: return None
           leoTest.py: return delta

The present code doesn't always find all calls to a function, partly 
because calls to F can have a prefix, like g.F.

Calls to library functions do not show up in such reports.  Instead, they 
are listed separately.  For example, here is the entry for the currentItem 
Qt method:

currentItem 4 calls...
          LeoQtTree::getCurrentItem: w.currentItem()
      LeoQListWidget::get_selection: self.currentItem()
    LeoQListWidget::select_callback: self.currentItem()
       LeoQListWidget::tab_callback: self.currentItem()

And here is the decode function from the standard library:

decode 4 calls...
    ZimImportController::parseZimIndex: urllib.unquote(result[2]).decode(
'utf-8')
              leoGlobals.py::toUnicode: s.decode(encoding,'strict')
              leoGlobals.py::toUnicode: s.decode(encoding,'replace')
    ZimImportController::parseZimIndex: result[1].decode('utf-8')


By Friday afternoon I was in a state of shock. I kept shaking my head, 
telling Rebecca how I couldn't believe how lucky I was to have stumbled 
upon using regex's.  I kept saying that I could *never *have created these 
reports had I been focused on parse trees.

*Getting real*

The simple regex's I used do not handle strings or comments properly.  My 
first thought was that perhaps the tokenizing beautifier could "parse" the 
code easily. Friday evening I played around with this approach, but I 
quickly realized it was going nowhere.

Here is the checking log for 72658e8, Friday, June 05, 2015 10:24:06 PM:

QQQ
Added an alternative tokenizing implementation, but...Wow: the prototype 
using simple line-oriented regex's and string find is MUCH simpler and 
faster than using tokenizing code.
QQQ

Early (1 a.m.) Saturday morning, thoughts about translating the regex-based 
prototype to an parse-tree based code did not allow me to sleep.  *Using 
the prototype as a guide path*, it took about 6 hours to create the present 
code.

Yes, in theory, I could have written the Ast traversers without the 
prototype, but in practice I would *never *have thought to do things this 
way.  I have found a cool new pattern by pure luck.

Here is the Ast Visitor for FunctionDef nodes in the ShowDataTraverser 
class:

# FunctionDef(identifier name, arguments args, stmt* body, expr* 
decorator_list)

def do_FunctionDef(self, node):
    # Format.
    args = self.formatter.format(node.args) if node.args else ''
    s = 'def %s(%s):' % (node.name, args)
    if self.trace: g.trace(s)
    # Enter the new context.
    context_tuple = self.fn, 'def', s
    self.context_stack.append(context_tuple)
    # Update controller data.
    def_tuple = self.context_stack[: -1], s
    aList = self.controller.defs_d.get(node.name, [])
    aList.append(def_tuple)
    self.controller.defs_d[node.name] = aList
    # Visit.
    for z in node.decorator_list:
        self.visit(z)
    self.visit(node.args)
    for z in node.body:
        self.visit(z)
    # Leave the context.
    self.context_stack.pop()

This is the simplest possible code! Instead of dealing with Ast nodes, this 
code using only strings.  s describes the function. The context stack 
contains strings describing enclosing def and class nodes.

Instead of injecting data into Ast nodes, the code creates entries in 
global dictionaries in the "controller" object, in this case, 
self.controller.defs_d.

*Speed*

The ShowDataTraverser class is a subclass of the AstFullTraverser class in 
leoAst.py.  All visitors must explicitly visit subnodes.  As a result, the 
visit method collapses:

def visit(self, node):
    '''
    Visit a *single* ast node.
    Visitors are responsible for visiting children!
    '''
    method = getattr(self, 'do_' + node.__class__.__name__)
    method(node)

Clearly, this is the fastest possible way to traverse parse trees.

Traversing parse trees is significantly faster than using tokens, and is 
almost as fast as using improper regex's:

Parse trees: 3.8 sec to generate all reports.
Tokens: 3.3 sec *just *for tokenizing.
Regex: 2.2 sec to generate all reports.

Here are the figures for the latest run over all of Leo's core files:

files: 69 lines: 86,312 chars: 3,534,400 classes: 210
defs: 3,994 calls: 6,682 undefined calls: 2,090 returns: 2,179
run done:  3.8 sec.

The statistic for undefined calls is probably inaccurate.

*Further work*

The present project has already demonstrated, for the first time, that my 
intuitions about code are reasonable.

The reports gather all data about a routine in one place.  In most cases, 
the names of *def *args correspond in an obvious way to the names of *call 
*args. 
Imo, this what allows us programmers to have confidence in our code.  We 
certainly do *not *do type analysis!

That is, the names of functions and ivars often indicates their types 
types.  Yes, the names can be misleading, which is why pylint is useful ;-)

Using *only *strings can create problems.  One idea is to use the Ast nodes 
(or their id's) as dictionary keys.  Easy: do_FunctionDef will just set 
node_tuple to def_tuple = node, self.context_stack[: -1], s

It would be possible to add information about assignments to vars and ivars 
merely by adding Ast visitors for assignments.  That way might lead to full 
type checking.  Not sure I'm going to do that.

There are a number of bugs in the reports.  I'll look into them. 
Furthermore, it's tempting to reorganize dictionaries in new ways.

*Summary*

It is almost infinitely easier to deal with Python dictionaries than to 
deal with Ast trees!

The prototype created a new pattern.  Node visitors have no smarts.  They 
simply create entries in global dictionaries.  Later code can *analyze the 
dictionaries*.

Enough for now.  I think I've described the big picture well enough.

Edward

P. S. The prototype code is no longer in leoCheck.py.  It's in the attic, 
in the node:

    A great prototype: scan (inaccurate) & helpers

EKR

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/leo-editor.
For more options, visit https://groups.google.com/d/optout.

ENB: Instant history of a code checking breakthrough

Reply via email to