Re: [Readable-discuss] Proposal: a concrete "pre-processor" [Implementation detail]

Alan Manuel Gloria Fri, 25 Jan 2013 17:28:14 -0800

On Sat, Jan 26, 2013 at 4:43 AM, David A. Wheeler <dwhee...@dwheeler.com> wrote:
> By the way, there's no need to implement a separate tokenizer process to 
> implement this in Scheme.
>
> I intend to implement the Scheme version in a way that quietly does it "at 
> the same time", making the actual implementation really easy.  In fact, 
> although it's not obvious, the BNF is specifically "rigged" to be especially 
> easy to implement using recursive descent.  INDENT and DEDENT can be 
> determined by just comparing the indent strings. The head/rest productions 
> actually end up looking a lot like a tokenizer, but without needing a 
> separate tokenizer.  I had to use "modes" in ANTLR because the parser cannot 
> communicate back to the lexer; in recursive descent that's a non-problem.


Actually, I think a tokenizer process will allow us to use a simple
parser combinator library, which means we don't need to worry about
calling protocols of productions on Scheme: everything uses the same
parser calling protocol, which of course can be based on Monads (^^);

So using a separate tokenizer is clearer IMO.

The only drawback is that we now need to use SAME.

A precis: the tokenizer is not a separate pass, but rather implemented
as a stateful procedure that will consume exactly one token on the
input stream.  This allows laziness, which allows us to leave as many
characters as possible on the port at any one time.  The ANTLR
architecture of having the tokenizer call a stateful parser procedure
is also possible, but I think it's easier to have the (more complex)
parser call the tokenizer than the reverse.

And I think we should also formalize the tokenizer, since behavior
like "comment-only lines are skipped" is NOT explicitly shown in the
BNF.

Formalizing the tokenizer also allows us to strip away the hspace's in
the t-expression parsing spec.

Here's a first cut:

<p>
The overall sweet-reader specifications is split into
three components:
</p>

<ol>
<li>
<def>n-expression reader</def> -
a standard Scheme parser with some extensions.
</li>
<li>
<def>tokenizer</def> -
convert a stream of characters into
a stream of tokens and tagged datums.
</li>
<li>
<def>t-expression reader</def> -
an overall parser that accepts
the tokenizer's generated stream
</li>
</ol>

<h2>N-expression Reader</h2>

<p>
The n-expression reader is
an ordinary Scheme reader,
with the following capabilities:
</p>

<ol>
<li>
It can return a parsed datum.
</li>
<li>
It can signal reading in a non-datum,
such as <code>#;</code> comment,
or a special <code>#!fold-case</code> reader option.
</li>
<li>
It can signal a parsing error.
</li>
</ol>

<p>
The second capability above is important;
typical Scheme readers can only either
return a parsed datum,
or signal a parsing error.
However, the tokenizer needs
to be able to detect
datum or multiline comments.
</p>

<p>
In addition, the n-expression reader
must support the following extensions
to the Scheme reader:
</p>

<ol>
<li>
SRFI-105 Curly-infix notation -
the n-expression reader shall always
have this notation enabled.
</li>
<li>
<code>datum( ...)</code> =>
<code>(datum ...)</code> -
There must not be a space between
"<var>datum</var>"
and the opening parenthesis.
The "<var>datum</var>" can be
any valid datum -
symbol, number, string, list, etc.
In particular, <code>f()()</code>
<em>MUST</em> parse to
<code>((f))</code>
</li>
<li>
<code>datum[ ... ]</code> =>
<code>($bracket-apply$ datum ...)</code> -
Again, no space between "<var>datum</var>"
and the opening bracket.
</li>
<li>
<code>datum{ ... }</code> =>
<code>(datum {...})</code>
</li>
</ol>

<p>
Essentially,
this component is just
a reader for n-expressions as described
in SRFI-105,
with the added capability
to signal parsing a comment
that is not started by "<code>;</code>"
</p>

<h2>Tokenizer</h2>

<p>
The tokenizer preprocesses the input character stream
in order to find special symbols
and insert tokens to signal indentation.
Its output is a stream of tokens.
Tokens are either one of the tokens
listed below,
or a datum.
</p>

<ul>
<li>INITIAL_INDENT_NO_BANG</li>
<li>INITIAL_INDENT_WITH_BANG</li>
<li>INDENT</li>
<li>SAME</li>
<li>DEDENT</li>
<li>BADDENT</li>
<li>QUOTE_SPACE</li>
<li>QUASIQUOTE_SPACE</li>
<li>UNQUOTE_SPACE</li>
<li>UNQUOTE_SPLICING_SPACE</li>
<li>SYNTAX_SPACE</li>
<li>QUASISYNTAX_SPACE</li>
<li>UNSYNTAX_SPACE</li>
<li>UNSYNTAX_SPLICING_SPACE</li>
<li>SCOMMENT</li>
<li>GROUP_SPLICE</li>
<li>SUBLIST</li>
<li>RESTART_BEGIN</li>
<li>RESTART_END</li>
<li>VT</li>
<li>FF</li>
<li>EOF</li>
</ul>

<p>
(in the reference implementation,
the above tokens are implemented
as symbols,
except for EOF
which is implemented
using the Scheme's end-of-file object.
Datums are implemented
as two-item lists
of the form
<code>(n-expr <em>&lt;datum&gt</em>;)</code>)
</p>

<p>
The tokenizer has two main overarching states:
</p>

<ol>
<li>
<def>initial indent state</def> -
the starting state.
In this state, the tokenizer skips over
completely empty lines
and completely <code>;</code>-comment lines.
It enters the other state
as soon as it finds
a token to generate.
The tokens
INITIAL_INDENT_WITH_BANG
and
INITIAL_INDENT_NO_BANG
can only be emitted in this state.
</li>
<li>
<def>indent processing state</def> -
In this state, the tokenizer skips over
completely <code>;</code>-comment lines.
It enters the other state when it encounters
a completely empty line
(it enters that state before generating a token).
</li>
</ol>

<p>
A <def>completely empty line</def>
is a line composed of indentation whitespace only,
followed by an end-of-line terminator.
A <def>completely <code>;</code>-comment line</def>
is a line composed of indentation whitespace only,
followed by a <code>;</code>,
followed by any number of non-end-of-line characters,
followed by an end-of-line terminator.
Crucially, the definition of
<def>indentation whitespace</def>
depends on the state:
if in the <em>initial indent state</em>,
indentation whitespace is either the space or tab character.
If in the <em>indent processing state</em>,
indentation whitespace is either the space, tab, or "<code>!</code>" character.
</p>

<p>
An <def>end-of-line terminator</def>
is a newline, carriage return, or
the Windows carriage return-linefeed sequence,
or, if Unicode is supported,
a next line, line separator, or paragraph separator character.
</p>

<p>
In addition to the above main states,
the tokenizer must keep track of the following:
</p>

<ul>
<li>
Whether it is at the start of a line or not -
initially, "start-of-line".
</li>
<li>
A stack of indentation strings -
initially empty.
</li>
</ol>

<p>
The tokenizer proceeds as follows:
</p>

<ol>
<li>
Starting in the <em>initial indent state</em>,
it skips over completely empty lines and
completely <code>;</code>-comment lines.
<ul>
<li>
If it finds at least one indentation character
that is not followed by a end-of-line terminator
or a "<code>;</code>" character,
it has found an initial indent.
<ul>
<li>
If it finds a "<code>!</code>" character after
the indent,
emit INITIAL_INDENT_WITH_BANG
and end processing.
</li>
<li>
Otherwise, emit INITIAL_INDENT_NO_BANG
and end processing.
</li>
</ul>
</li>
<li>
If it does not find any initial indent,
push an empty "" on the indent stack,
set <em>not-start-of-line</em>,
and enter the <em>indent processing state</em>.
</li>
</ul>
</li>
<li>
In the <em>indent processing state</em>,
skip over completely <code>;</code>-comment lines.
<ul>
<li>
If you find a completely empty line,
check the indent-stack's top.
If the indent-stack's top is the empty string "",
pop it off, emit SAME, and enter <em>initial indent state</em>.
Otherwise, pop off items
until you find the empty string "";
for every non-empty item, emit DEDENT,
and then enter <em>initial indent state</em>
after you pop off the empty string "".
</li>
<li>
If you find the end-of-file,
check the indent-stack's top.
<ul>
<li>
If the indent-stack's top is the empty string ",
pop it off and emit SAME,
then check if the indent-stack is empty.
<ul>
<li>
If the indent-stack is empty,
emit EOF and end processing.
</li>
<li>
Otherwise, emit RESTART_END,
then check the indent-stack's top again.
</li>
</ul>
</li>
<li>
While the indent-stack's top is not "",
pop it off and emit DEDENT.
When the indent-stack's top is "",
pop it off
and check if the indent-stack is empty
using the above procedure.
</li>
</ul>
</li>
<li>
If start-of-line,
take any number (0 or more) indentation characters
and put them in a string called the current-indentation.
Sample (but don't pop!) the top of indent-stack
into a string called the top-indentation.
Set not-start-of-line.
Then:
<ul>
<li>
If the current-indentation is not indent-compatible
with top-indentation,
emit BADDENT and end processing.
Two indentations are <def>indent-compatible</def>
if the shorter indentation is the initial substring
of the longer indentation,
or both are equal.
The empty string is indent-compatible with all strings.
</li>
<li>
If the current-indentation is longer than
the top-indentation,
push current-indentation on the indent-stack
and emit INDENT.
</li>
<li>
If the current-indentation is equal to
the top-indentation,
emit SAME.
</li>
<li>
While the current-indentation is shorter
than the top-indentation,
pop off the top indentation from the stack,
and emit DEDENT.
Repeat until the current-indentation is
equal to the top-indentation.
If the current-indentation is longer than
the top-indentation after popping,
emit BADDENT and end processing.
</li>
</ul>
</li>
<li>
Check for abbreviations followed by
at least one whitespace
(space, tab, newline, carriage-return,
and Unicode whitespaces if applicable)
or <code>;</code>-comments.
Consume the abbreviation
and all horizontal whitespace (space, tab)
and <code>;</code>-comments (not including end-of-line terminator).
The abbreviations, and the token to emit, are:
<ul>
<li><code>'</code> -> QUOTE_SPACE</li>
<li><code>`</code> -> QUASIQUOTE_SPACE</li>
<li><code>,</code> -> UNQUOTE_SPACE</li>
<li><code>,@</code> -> UNQUOTE_SPLICING_SPACE</li>
<li><code>#'</code> -> SYNTAX_SPACE</li>
<li><code>#`</code> -> QUASISYNTAX_SPACE</li>
<li><code>#,</code> -> UNSYNTAX_SPACE</li>
<li><code>#,@</code> -> UNSYNTAX_SPLICING_SPACE</li>
</ul>
</li>
<li>
If an abbreviation is not followed by whitespace,
consume the abbreviation,
then invoke the n-expression reader
until it yields a datum,
then emit that datum wrapped
in a list with the traditional meaning
for the abbreviation.
</li>
<li>
Consume all horizontal whitespace
and <code>;</code>-comments (without the end-of-line terminator).
</li>
<li>
If you find an FF or VT character,
consume it
and emit the corresponding token.
</li>
<li>
If you find an end-of-line terminator,
set start-of-line.
</li>
<li>
If you find a <code>{</code>, <code>[</code>, or <code>(</code>,
invoke the n-expression reader once.
If it yields a datum, emit that datum,
if it signals a comment, emit SCOMMENT.
</li>
<li>
Invoke the n-expression reader once.
<ul>
<li>If it signals a comment, emit SCOMMENT.</li>
<li>If it yields a <code>\\</code> symbol,
consume all horizontal whitespace (space, tab),
then emit GROUP_SPLICE.</li>
<li>If it yields a <code>$</code> symbol,
consume all horizontal whitespace (space, tab),
emit SUBLIST.</li>
<li>If it yields a <code>&lt;*</code> symbol,
consume all horizontal whitespace (space, tab),
emit RESTART_BEGIN,
and enter <em>initial indent state</em>.</li>
<li>if it yields a <code>*&gt;</code> symbol,
check the indent-stack's top.
<ul>
<li>
If the indent-stack's top is empty "",
pop it off,
emit SAME, and emit RESTART_END.
</li>
<li>
Otherwise,
pop it off until you find an empty "".
For every non-empty item popped off,
emit DEDENT.
After finding the "" and popping it off,
emit RESTART_END.
</li>
</ul>
</li>
<li>
Otherwise, emit the yielded datum.
</li>
</ul>
</li>
</ul>
</li>
</ul>

---

Separating the indentation processor above allows us to formally describe it.

Also, it lets us get away with a much simpler BNF for t-expr (but we
can't include the BNF for n-expr, except as part of n-expr reader).

1.  No more hspace
2.  No more comment_eol and EOL.
3.  The "comment-only lines are ignored" behavior is specified.

--

In addition, conceptually it allows us to use a more consistent parser
implementation for the Scheme side: for instance, all productions can
use some kind of "token stream" representation that has a peek-token
and read-token, and a consistent return protocol.  This is so
consistent we can use parser combinators, with all the benefits that
implies (consistency means bugs are reduced, combinators on valid
parsers yield valid parsers, etc.).


I'll see if I can hack together an implementation soon.

Sincerely,
AmkG

------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
_______________________________________________
Readable-discuss mailing list
Readable-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/readable-discuss

Re: [Readable-discuss] Proposal: a concrete "pre-processor" [Implementation detail]

Reply via email to