Re: New direction, new energy

Edward K. Ream Sun, 18 Sep 2011 10:23:20 -0700

On Sep 16, 8:10 am, "Edward K. Ream" <[email protected]> wrote:
> I haven't been this excited about programming and design for a
> long time.  To paraphrase Hokusai, I am an old man, mad about
> programming.


Warning: this post consists of mostly nerdy details.  Feel free to
skip.

The energy continues.  If anything, it is increasing.  Perhaps the
increase has come from confronting doubts head on.

I had planned to present the doubts as a dialog between "me" and what
I think of as the "fool" (court jester).  However, I'd rather not do
that now. Instead, I'd like to present the results of the inner
dialog.  BTW, engaging the fool has quieted him.  And yes, it's
definitely a he.

c-to-python
---------------

Two days ago, I rewrote c2py as the c-to-python command.  I wanted to
do this for several reasons:

1.  As an aid to studying C programs.  C programs suck, at least
visually.  c-to-python, integrated into Leo, allows me to ignore the
cruft.  c-to-python may be useful later in modified form, but for now
it suffices as it is.

2. To revisit the old code, and see how my programming practice has
changed.

This was a ton of fun fun.  I made the following changes:

A. Converted the code to a class.  Much better packaging.

B. Replaced listToString(aList) by ''.join(aList) and eliminated
stringToList completely by converting a few arguments from None to ''.
    This "little" change produced the following Aha...

C. [Important]  The big question I had was whether the code could have
been improved by using strings instead of lists.  The short answer is
"no": using lists will greatly reduce the load on the gc: only one
list actually gets created, and the actual changes to the list are
surprisingly few.

However, an excellent clarification appeared.  Some utilities never
modify their "aList" argument.  To highlight this fact, I introduced
the convention that s denotes any (immutable) *sequence*, rather than
a *string* as is usual in Leo's core.  With this convention in place,
the code looks like typical Leo code.  For example:

    def skip_past_line (self,s,i):

        while i < len(s) and s[i] != '\n':
            i += 1
        if i < len(s) and s[i] == '\n':
            i += 1
        return i

The fact that s is "really" a list is irrelevant.

This little change completely clarified the situation.  The code uses
"aList" when it might change aList, and uses "s" otherwise.  Imo, this
clarification is worth a lot of effort.

An important strategy
-----------------------------

Converting the old c2py script to a class was tedious work.
Afterwards, I resolved never to do any more tedious editing, revising
or formatting that could be done instead by a script.  Two come
immediately to mind: convert-functions-to-class and convert-to-pep8
names.

To generalize a bit, part of the new energy comes from seeing new ways
to use scripts.

indent-c-code
------------------

After the initial work on c-to-python was complete, I realized that it
would not work on improperly-indented C code.  Such code abounds in
swig.

We are getting to an interesting question.  When is it reasonable to
rewrite "perfectly good" existing C code?  The answer is,
"surprisingly often, but not always."

It would make no sense at all to rewrite code that is a) "frozen" and
b) usable as is.  However, there are many cases where one or both do
not apply.

For example, the gnu project has indent.c, a program with a gazillion
options that will beautify (and reindent) C programs any way one could
possibly want.  However, it is not usable as a Leo command.  Last
night I created the beautify-c command in an hour or two.

First however, I studied indent.c.  The main conclusions: a) the main
loop is insanely complex, b) beautify-c need only work to support *my*
preferences, provided it also would be the front end for c-to-python
and c) tokenizing the input before dealing with it probably makes
sense.

So I turned my attention to a tokenizer for C programs.  I've done
this perhaps dozens of times.  However, last night I found the way
that it is written in "the book".  Here it is:

    def tokenize (self,s):

        '''Tokenize comments, strings, identifiers, whitespace and
operators.'''

        i,n,result = 0,len(s),[]
        while i < n:
            # Loop invariant: j > i at end and s[i:j] is the new
token.
            j = i
            ch = s[i]
            if g.match(s,i,'//'):
                j = g.skip_line(s,i)
            elif g.match(s,i,'/*'):
                j = self.skip_block_comment(s,i)
            elif g.match(s,i,'-->'):
                j = i + 3
            elif ch in "'\"":
                j = g.skip_string(s,i)
            elif ch.isalpha() or ch == '_':
                j = g.skip_c_id(s,i)
            elif ch in ' \t':
                j = g.skip_ws(s,i)
            elif ch in '@\n': # Always separate tokens.
                j += 1
            else:
                j += 1
                # Accumulate everything else.
                while (
                    j < n and
                    not s[j].isspace() and
                    not s[j].isalpha() and
                    not s[j] in '"\'_@' and
                        # start of strings, identifiers, and single-
character tokens.
                    not g.match(s,j,'//') and
                    not g.match(s,j,'/*') and
                    not g.match(s,j,'-->')
                ):
                    j += 1

            assert j > i
            result.append(''.join(s[i:j]))
            i = j # Advance.

        return result

Let's forget the "else" clause: it's a hack to accumulate operator.
But the pattern is supremely elegant.  I'll never forget it:

    def tokenize (self,s):

        '''Tokenize comments, strings, identifiers, whitespace and
operators.'''

        i,n,result = 0,len(s),[]
        while i < n:
            # Loop invariant: j > i at end and s[i:j] is the new
token.
            j = i
            ch = s[i]
            if <ch starts token a>:
                j = <skip token a>
            elif <ch starts token b>
                j = <skip token b>
           ... and so on
            else:
                j += 1 # Create a one-character token.
            assert j > i
            result.append(''.join(s[i:j]))
            i = j # Advance.

        return result

This looks so simple.  Why didn't I ever see it before?  The reason is
that the tests:

    if <ch starts token whatever>

must be *complete* tests.  That is, it is *wrong* to try the usually-
natural optimization:

    if ch == '/':
        <generate a single-line comment, multi-line comment or C
operator>

When I went down that path all sorts of complications ensure and the
code gets very ugly in a big hurry.

So the price of elegance and simplicity is the *repeated* tests for
'/' at the start of two cases and in the else clause.  But these extra
tests are completely meaningless as far as speed goes!

Ok, I now had a truly elegant C tokenizer.  It was then fairly
straightforward to do the actual indent code.

That's enough for now.  I can hear the fool saying, "so what, why
should anyone care?"  I'll discuss the bigger picture in the next
reply.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/leo-editor?hl=en.

Re: New direction, new energy

Reply via email to