In another thread, http://groups.google.com/group/leo-editor/browse_thread/thread/55469ccb8c6dbb89 there has been discussion about AST trees as they might relate to my lint-oriented research.
I'd like to say a few words about the research here. Sure, it belong in the new-lint group, but there are many more eyes here. Please stop reading now if you are not interested in speculative, OT, posts. Topic 1: Parsing To state my conclusion first: this is not a big deal. Using Python's AST module works well enough, though I've played around with other schemes. QQQQQ >>> srcml seems like something more relevant to your "pylint" research. >> >> In what way? > > It's about annotating source code with semantic meaning. You can > handle srcML like AST tree I guess. > > Not that it would make things easier than just dealing with AST. QQQQQ The essence of the problem is that there really is no one "right" and "natural" representation of (Python) programs that is right for all purposes, or even completely right for just *one* purpose. After quite a bit of searching, I finally stumbled upon this excellent discussion of writing 2to3 fixers: http://python3porting.com/fixers.html For me, this paper was worth careful study. Reading between the lines a bit, one can glimpse the essential difficulties. For some purposes, one wants to ignore whitespace; for other purposes, one wants to mostly ignore program (parse) structure. You will see this tension when you study the examples in detail. Topic 2: Data structures Again, to state my conclusions first, nothing difficult needs to be done, although there are room for data-related simplifications. Indeed, the present code creates a symbol table for every context of the program. The code is a straightforward traversal of the AST trees given by the ast module. Contexts includes modules, classes, functions, methods, lambda and with statements, but only modules, classes and functions have non-local interest, so we can mostly ignore lambda and with statements. Each context contains symbol table entries for the vars/ivars (including classes, functions) defined in the context. It is straightforward to add other information to contexts: for example a list of assignment statements contained in the context. *Important*: creating these structures takes 7-8 seconds in total for the 32 files in Leo's core (including all the files of the qt gui). Thus, there is no need to cache data from one run to the next, a huge simplification. Topic 3: Type determination The essence of the "lint" problem is simply to determine the type (or set of types) for each var in every context. The present pylint program proves that this can be done for a large class of programs. The problem is merely that pylint is a) slow and b) difficult to extend to other kinds of analysis. There are known (very slow) algorithms for determining types in general. The problem with such algorithms is, in essence, that vars may take on more than one type. I call this the **splitting problem**. Every split var non-linearly increases the computation (and re-computation) required. This is why general methods fail: they are too slow to be useful. General methods can fail for other reasons, namely that it can be difficult if not impossible to determine whether all types of an object have, in fact, been computed. I'll address this kind of failure next. Topic 4: Cooperation between programmer and tools The problem with general lint-like approaches is that they provide no way for the programmer to assert what is supposed to be known about the program. This makes the job of the lint tool much harder than it needs to be. Judging by comments made about lint-like tools, one would imaging that programmers (that is, program designers) really are guessing about the types of their objects. That simply can not be true. If it were true it would be impossible to modify a large program like Leo safely. Instead, programmers mostly *do* know the types of objects. We see this throughout Leo's source code in the form of naming conventions: c, f, p, v, d, etc. and in the common kinds of chains: c.frame.body.bodyCtrl.widget. Thus, the designer's task is to *reduce splitting* by asserting the types of objects; the tool's task is to check those assertions. It's really that simple. Failed assertions are useful. They can indicate bugs (like simple misspellings), but they can also indicate where code is unnecessarily complex. For example, there are several areas in the qt gui code where I myself have difficulty understanding the types of vars. Let us be perfectly clear. This is not a benign feature of the Python language-- it is instead a defect in the design of the code. Nothing is gained by this uncertainty: the confusion makes it *more* difficult to extend and generalize the code, not less. In short, I am not interested in general-purpose type computation algorithms. I am interested in a partnership between me an my tools. This partnership can "cut the Gordian Knot" of type splitting by simply declaring that certain vars must have certain types. The tool check this assertion (a kind of induction) and reports where it fails. Topic 5: global analysis There is a tension between local and non-local analysis. Theoretically, any var in any module, class, method and function may be created, set or destroyed anywhere in the program, via assignments of the form: a.b.c.d = x where x is an arbitrary expression. This is assignment to **d's** context, not a's. Naturally, it depends on the type(s) of a in the *present* context, the type of b in a's context, etc. This kind of assignment is the reason why type computations can be so expensive: the computation is iterative, and when each new type assigned to a var splits the computation into more computations. Practically, however, we expect such assignments to be rare: most assignments are, and should be, local assignments. Furthermore, non- local assignments are obvious candidates for type assertions. Lint can check these assertions easily (by induction), and they reduce splitting by fiat. Edward -- You received this message because you are subscribed to the Google Groups "leo-editor" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/leo-editor?hl=en.
