There has been an amazing breakthrough in code analysis. Not sure where it will lead, or how much more work I shall do on this project.
This is an Engineering notebook post. You can safely ignore it. This is an "instant" history of work completed in the last few days. I am writing this now, before the details blurred. I'm pretty sure I have remembered the general timing of events correctly. On Wednesday, June3, I posted a "farewell" message on the static type checking group. But, that just got me thinking about code analysis again :-) Sigh. This post <https://groups.google.com/d/msg/python-static-type-checking/reP0Ok0a12U/rcFi8jzHKBEJ> discusses some intuitions. On Thursday, June 4, I reviewed my old code analysis code. I know it was Thursday because I was reviewed all of its problems while driving to a swim lesson. In short, the code was clumsy and inflexible. Sometime later, perhaps that night, I started revising one of my old tests. I stumbled across two major simplifications: 1. I revised the testing framework so that there would only ever be one "test". For backup, a copy of the testing framework, @@button show-data, is in the attic (leoNotes.txt). Originally, all the code was in a Test class defined as a child of the @button node. The code soon moved into the new leoCheck.py file. 2. As a prototype, I decided to use regular expressions to "parse" the source code. I had become aware of how fast regular expressions can be by studying Python's tokenizer code in Lib/tokenize.py. That's probably what had me choose re's for the prototype. Using regex's turned out to be a *stupendous *bit of good luck! My focus shifted from all the technical details of traversing parse trees to what I actually wanted to *see*, namely the relationships between *definitions *of functions/methods and all the *calls to* and *returns from* those routines. The prototype code created globals dictionaries of classes, defs, calls to defs and return statements. The show_results method created reports, organized by def name. The result was amazing! In just a few hours work I had a better picture of Leo's code than I had been able to produce in months of work with the old stc project. Here is just one example: printGc 2 definitions... leoGlobals.py: def printGc(tag=None): leoTest.py: def printGc(message=None): 1 call... leoTest.py::runGc: printGc(message=message) 2 returns... leoGlobals.py: return None leoTest.py: return delta The present code doesn't always find all calls to a function, partly because calls to F can have a prefix, like g.F. Calls to library functions do not show up in such reports. Instead, they are listed separately. For example, here is the entry for the currentItem Qt method: currentItem 4 calls... LeoQtTree::getCurrentItem: w.currentItem() LeoQListWidget::get_selection: self.currentItem() LeoQListWidget::select_callback: self.currentItem() LeoQListWidget::tab_callback: self.currentItem() And here is the decode function from the standard library: decode 4 calls... ZimImportController::parseZimIndex: urllib.unquote(result[2]).decode( 'utf-8') leoGlobals.py::toUnicode: s.decode(encoding,'strict') leoGlobals.py::toUnicode: s.decode(encoding,'replace') ZimImportController::parseZimIndex: result[1].decode('utf-8') By Friday afternoon I was in a state of shock. I kept shaking my head, telling Rebecca how I couldn't believe how lucky I was to have stumbled upon using regex's. I kept saying that I could *never *have created these reports had I been focused on parse trees. *Getting real* The simple regex's I used do not handle strings or comments properly. My first thought was that perhaps the tokenizing beautifier could "parse" the code easily. Friday evening I played around with this approach, but I quickly realized it was going nowhere. Here is the checking log for 72658e8, Friday, June 05, 2015 10:24:06 PM: QQQ Added an alternative tokenizing implementation, but...Wow: the prototype using simple line-oriented regex's and string find is MUCH simpler and faster than using tokenizing code. QQQ Early (1 a.m.) Saturday morning, thoughts about translating the regex-based prototype to an parse-tree based code did not allow me to sleep. *Using the prototype as a guide path*, it took about 6 hours to create the present code. Yes, in theory, I could have written the Ast traversers without the prototype, but in practice I would *never *have thought to do things this way. I have found a cool new pattern by pure luck. Here is the Ast Visitor for FunctionDef nodes in the ShowDataTraverser class: # FunctionDef(identifier name, arguments args, stmt* body, expr* decorator_list) def do_FunctionDef(self, node): # Format. args = self.formatter.format(node.args) if node.args else '' s = 'def %s(%s):' % (node.name, args) if self.trace: g.trace(s) # Enter the new context. context_tuple = self.fn, 'def', s self.context_stack.append(context_tuple) # Update controller data. def_tuple = self.context_stack[: -1], s aList = self.controller.defs_d.get(node.name, []) aList.append(def_tuple) self.controller.defs_d[node.name] = aList # Visit. for z in node.decorator_list: self.visit(z) self.visit(node.args) for z in node.body: self.visit(z) # Leave the context. self.context_stack.pop() This is the simplest possible code! Instead of dealing with Ast nodes, this code using only strings. s describes the function. The context stack contains strings describing enclosing def and class nodes. Instead of injecting data into Ast nodes, the code creates entries in global dictionaries in the "controller" object, in this case, self.controller.defs_d. *Speed* The ShowDataTraverser class is a subclass of the AstFullTraverser class in leoAst.py. All visitors must explicitly visit subnodes. As a result, the visit method collapses: def visit(self, node): ''' Visit a *single* ast node. Visitors are responsible for visiting children! ''' method = getattr(self, 'do_' + node.__class__.__name__) method(node) Clearly, this is the fastest possible way to traverse parse trees. Traversing parse trees is significantly faster than using tokens, and is almost as fast as using improper regex's: Parse trees: 3.8 sec to generate all reports. Tokens: 3.3 sec *just *for tokenizing. Regex: 2.2 sec to generate all reports. Here are the figures for the latest run over all of Leo's core files: files: 69 lines: 86,312 chars: 3,534,400 classes: 210 defs: 3,994 calls: 6,682 undefined calls: 2,090 returns: 2,179 run done: 3.8 sec. The statistic for undefined calls is probably inaccurate. *Further work* The present project has already demonstrated, for the first time, that my intuitions about code are reasonable. The reports gather all data about a routine in one place. In most cases, the names of *def *args correspond in an obvious way to the names of *call *args. Imo, this what allows us programmers to have confidence in our code. We certainly do *not *do type analysis! That is, the names of functions and ivars often indicates their types types. Yes, the names can be misleading, which is why pylint is useful ;-) Using *only *strings can create problems. One idea is to use the Ast nodes (or their id's) as dictionary keys. Easy: do_FunctionDef will just set node_tuple to def_tuple = node, self.context_stack[: -1], s It would be possible to add information about assignments to vars and ivars merely by adding Ast visitors for assignments. That way might lead to full type checking. Not sure I'm going to do that. There are a number of bugs in the reports. I'll look into them. Furthermore, it's tempting to reorganize dictionaries in new ways. *Summary* It is almost infinitely easier to deal with Python dictionaries than to deal with Ast trees! The prototype created a new pattern. Node visitors have no smarts. They simply create entries in global dictionaries. Later code can *analyze the dictionaries*. Enough for now. I think I've described the big picture well enough. Edward P. S. The prototype code is no longer in leoCheck.py. It's in the attic, in the node: A great prototype: scan (inaccurate) & helpers EKR -- You received this message because you are subscribed to the Google Groups "leo-editor" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/leo-editor. For more options, visit https://groups.google.com/d/optout.
