OT: Some thoughts about lint-like research

Edward K. Ream Fri, 05 Aug 2011 08:18:20 -0700

In another thread, 
http://groups.google.com/group/leo-editor/browse_thread/thread/55469ccb8c6dbb89
there has been discussion about AST trees as they might relate to my
lint-oriented research.


I'd like to say a few words about the research here.  Sure, it belong
in the new-lint group, but there are many more eyes here.

Please stop reading now if you are not interested in speculative, OT,
posts.

Topic 1:  Parsing

To state my conclusion first: this is not a big deal.  Using Python's
AST module works well enough, though I've played around with other
schemes.

QQQQQ
>>> srcml seems like something more relevant to your "pylint" research.
>>
>> In what way?
>
> It's about annotating source code with semantic meaning. You can
> handle srcML like AST tree I guess.
>
> Not that it would make things easier than just dealing with AST.
QQQQQ

The essence of the problem is that there really is no one "right" and
"natural" representation of (Python) programs that is right for all
purposes, or even completely right for just *one* purpose.

After quite a bit of searching, I finally stumbled upon this excellent
discussion of writing 2to3 fixers:

http://python3porting.com/fixers.html

For me, this paper was worth careful study.

Reading between the lines a bit, one can glimpse the essential
difficulties.  For some purposes, one wants to ignore whitespace; for
other purposes, one wants to mostly ignore program (parse) structure.
You will see this tension when you study the examples in detail.

Topic 2: Data structures

Again, to state my conclusions first, nothing difficult needs to be
done, although there are room for data-related simplifications.

Indeed, the present code creates a symbol table for every context of
the program.  The code is a straightforward traversal of the AST trees
given by the ast module.  Contexts includes modules, classes,
functions, methods, lambda and with statements, but only modules,
classes and functions have non-local interest, so we can mostly ignore
lambda and with statements.

Each context contains symbol table entries for the vars/ivars
(including classes, functions) defined in the context.  It is
straightforward to add other information to contexts: for example a
list of assignment statements contained in the context.

*Important*: creating these structures takes 7-8 seconds in total for
the 32 files in Leo's core (including all the files of the qt gui).
Thus, there is no need to cache data from one run to the next, a huge
simplification.

Topic 3: Type determination

The essence of the "lint" problem is simply to determine the type (or
set of types) for each var in every context.

The present pylint program proves that this can be done for a large
class of programs.  The problem is merely that pylint is a) slow and
b) difficult to extend to other kinds of analysis.

There are known (very slow) algorithms for determining types in
general.  The problem with such algorithms is, in essence, that vars
may take on more than one type.  I call this the **splitting
problem**.   Every split var non-linearly increases the computation
(and re-computation) required.  This is why general methods fail: they
are too slow to be useful.

General methods can fail for other reasons, namely that it can be
difficult if not impossible to determine whether all types of an
object have, in fact, been computed.   I'll address this kind of
failure next.

Topic 4: Cooperation between programmer and tools

The problem with general lint-like approaches is that they provide no
way for the programmer to assert what is supposed to be known about
the program.  This makes the job of the lint tool much harder than it
needs to be.

Judging by comments made about lint-like tools, one would imaging that
programmers (that is, program designers) really are guessing about the
types of their objects.  That simply can not be true.  If it were true
it would be impossible to modify a large program like Leo safely.

Instead, programmers mostly *do* know the types of objects.  We see
this throughout Leo's source code in the form of naming conventions:
c, f, p, v, d, etc. and in the common kinds of chains:
c.frame.body.bodyCtrl.widget.

Thus, the designer's task is to *reduce splitting* by asserting the
types of objects;  the tool's task is to check those assertions.  It's
really that simple.

Failed assertions are useful.  They can indicate bugs (like simple
misspellings), but they can also indicate where code is unnecessarily
complex.

For example,  there are several areas in the qt gui code where I
myself have difficulty understanding the types of vars.  Let us be
perfectly clear.  This is not a benign feature of the Python language--
it is instead a defect in the design of the code.  Nothing is gained
by this uncertainty: the confusion makes it *more* difficult to extend
and generalize the code, not less.

In short, I am not interested in general-purpose type computation
algorithms.  I am interested in a partnership between me an my tools.
This partnership can "cut the Gordian Knot" of type splitting by
simply declaring that certain vars must have certain types.  The tool
check this assertion (a kind of induction) and reports where it fails.

Topic 5: global analysis

There is a tension between local and non-local analysis.
Theoretically, any var in any module, class, method and function may
be created, set or destroyed anywhere in the program, via assignments
of the form:

    a.b.c.d = x

where x is an arbitrary expression.  This is assignment to **d's**
context, not a's.  Naturally, it depends on the type(s) of a in the
*present* context, the type of b in a's context, etc.  This kind of
assignment is the reason why type computations can be so expensive:
the computation is iterative, and when each new type assigned to a var
splits the computation into more computations.

Practically, however, we expect such assignments to be rare: most
assignments are, and should be, local assignments.  Furthermore, non-
local assignments are obvious candidates for type assertions.  Lint
can check these assertions easily (by induction), and they reduce
splitting by fiat.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/leo-editor?hl=en.

OT: Some thoughts about lint-like research

Reply via email to