ISSUE: Leading whitespace at start of expression reading
in I-expressions

I propose a new specific interpretation for leading whitespace in an
indented I-expression, which I'll call the "most consistent" format.
This is different than my previous proposal... I think this one's better.
Below is an explanation of the problem, and my proposed resolution.

Thoughts?  After fiddling with the alternatives, I'm getting very worried that
it'd be easy to type in text that would APPEAR to mean one thing, but would
ACTUALLY mean something else.  That's definitely something to avoid.
My "most consistent" proposal completely avoids that, without being quite
as strict as Python's "thou shalt start at the left edge".


First, the problem. What should be done if the start of an expression
(I'll call that start-expr) begins with whitespace that
is NOT followed by comments or NL or EOF? E.G.:
start-expr -> hspace+ (not eol...)

An example should make it clear. Imagine you read this (three lines,
all indented to the same level at the TOPMOST level):
   x
   y
   z

One interpretation is that there should be 3 different results: x, y, and z.
But consider how this would be read.
You'd read in the indentation before x, and note
that as the "topmost" indentation.  Then you'd read in the indentation
before y, notice that it was the same as x's, and stop just before reading
the "y" and return with just "x".
But wait - if you did that, when you read "y" you would think that there
was <i>no</i> indentation (the previous read consumed it), and thus z
would be further indented... returning (y&nbsp;z).  Ooops, that can't be right.

Since essentially the dawn of Lisp in the 1950s
there has been a "read" function that reads an S-expression
from the input and returns it.
This is an extremely stable function interface, and one not easily changed
in fundamental ways.
In particular, no user of "read" expects it to <i>also</i> return some
state - such as the indentation that was read the <i>last</i> time read
was called - and certainly they aren't going to provide that information
back to "read" anyway.
Not only is this difficult to change for backwards-compatibility reasons,
it's not clear you should - simple interfaces are a good idea, if you can
get them, and adding such "indentation state" as a required parameter would
certainly complicate the interface.

In theory, you could "unget" all the indentation characters, so that the
next read would work correctly.
But the support for this is rare; for example,
Scheme doesn't even <i>have</i> a standard unget character function, and
the Common Lisp standard only supports one character unget (not enough!).

You could store "hidden state" inside the read function.
Problem is, character-reading is not the exclusive domain of the read function;
many other functions read characters, and they are unlikely to look at this
hidden state.
These functions tend to be low-level functions and in some implementations
are difficult to override.
What's more, you would have to store hidden state for each possible input
source, and this can become insane in the many implementations that support
support ports of non-files (such as from strings).
"Hidden state" could allow for all this, but the hideous complications of
<i>implementing</i> hidden state suggests that it'd be better to spec
something that does <i>not</i> require hidden state.




Possible solutions:

1. Simplest approach: Forbid it.  It's an error if it doesn't start on
left line.  Python does this.  You could argue that the spec requires
this, since there's no production that accepts an initial INDENT.
The example above would then be illegal.

But this is not very flexible; #2 appears to be a better option.

2. Most consistent: Allow indentation on initial line (and consider that
the indentation for that expression), as long all later lines have
a further indentation OR are on the left edge (including a blank line
ending in EOL or EOF, or a comment the left edge).
This at least LETS you indent each expression if you like,
with NO risk of misinterpretation of later lines.
Typical use: if you want to indent everything, separate with blank lines.
Then you initialize indent preprocessor, and have it do this:
start-expr -> INDENT expr DEDENT

The xyz example above would be illegal, and thus rejected.
However, if you inserted blank lines between x, y, and z, you'd be okay.

This is the most consistent and most flexible, and has no risk of
misinterpretation, so I propose this one.

3. Original implementation ignores hspace:
start-expr -> hspace+ start-expr
  $2

But when this is given the xyz example above, it will misleadingly
produce (x y z).  That kind of surprise seems undesirable, esp. given
that there is alternative #2.

4. Instead, could disable indent processing on initial hspace, to maximize
backwards-compatibility and simplify some command line use:
start-expr -> hspace+ s-expr
  $2

This would read in the "xyz" example as you would expect.  It would
also read in old text like this as it was originally intended:
  (define x 5) (define y 6)

However, other formats will be misinterpreted, e.g.,
  fact
    5
will be understood as:
  fact
  5
and not as (fact 5).

This is risky; on printouts, it might not at ALL be obvious when
expressions are indented like this - resulting in hard-to-debug code
and hidden defects.


In general, I think it's much wiser to reject text that might
be very easily misinterpreted by the reader.  So I suggest #2.



--- David A. Wheeler 

Reply via email to