Re: [Readable-discuss] Proposal: a concrete "pre-processor" [Implementation detail]

Alan Manuel Gloria Sat, 26 Jan 2013 16:14:11 -0800

On Sun, Jan 27, 2013 at 12:16 AM, David A. Wheeler
<dwhee...@dwheeler.com> wrote:
>> everything uses the same
>> parser calling protocol, which of course can be based on Monads (^^);
>
> Hmm, I'm concerned that discussing Modads will send 1/4 of our audience
> running to the hills :-).


Monads: keeping readable Lisp obscure since the late 90's!

>
>
>> So using a separate tokenizer is clearer IMO.
>>
>> The only drawback is that we now need to use SAME.
>
> I don't think that's true.  The ANTLR implementation simply
> consumes same-indents and doesn't generate any tokens.
> As long as a tokenizer can just consume a character
> sequence, generate nothing, and then consume more characters
> to finally *get* a token, it seems to me it should be fine.
>
> Of course, I've been wrong before; if I *am* wrong, I'm curious as to why.

Mostly, it has to do with the reader having to consume only the data
it needs, and no more.

Consider this sequence:

define foo 1
define bar 2

Without SAME, that tokenizes to the sequence (n-expr define) (n-expr
foo) (n-expr 1) EOL (n-expr define) (n-expr bar) (n-expr 2) EOL EOF

The problem is that we need the parser to differentiate between that and this:

define foo 1
  meow

.. which has the sequence (n-expr define) (n-expr foo) (n-expr 1) EOL
INDENT (n-expr meow) EOL DEDENT EOF

After head consumes the EOL we look for either an INDENT (to enter
rest processing) or... nothing at all, in which case head is returned
verbatim.

But in the first case, the token after EOL is (n-expr define) - which,
when we looked for INDENT, has been removed from the port.  The parser
fails to find an INDENT and instead must somehow return the whole
(n-expr define) to the port - which can't be done because it's several
characters.

If instead we remove the EOL token and put in a SAME token, we can
define it-expr as:

it-expr : head SAME | head INDENT rest DEDENT ,,,

And our stream in the two different cases are:

(n-expr define) (n-expr foo) (n-expr 1) SAME ...
(n-expr define) (n-expr foo) (n-expr 1) INDENT ...

And we would be able to differentiate the two immediately without it.

>
>> A precis: the tokenizer is not a separate pass, but rather implemented
>> as a stateful procedure that will consume exactly one token on the
>> input stream.  This allows laziness, which allows us to leave as many
>> characters as possible on the port at any one time.  The ANTLR
>> architecture of having the tokenizer call a stateful parser procedure
>> is also possible, but I think it's easier to have the (more complex)
>> parser call the tokenizer than the reverse.
>
> (Nitpick: ANTLR calls a stateful lexer, not a parser.)
>
> Maybe.  Hard to know without comparing.

Hmm, the code structure looks like the lexer calls "emit()", so it
looks as if the parser is CPS'ed inside a stateful emit() function.
Compare to bison where the lexer returns a token type and token value,
so that the lexer is the stateful function.  Or are you using a
separate thread for the Java ANTLR lexer, with emit() being a channel
to the parser thread that calls a lexer function that just fetches
from the channel?

>
>> And I think we should also formalize the tokenizer, since behavior
>> like "comment-only lines are skipped" is NOT explicitly shown in the BNF.
>
> Okay!!   This is certainly sensible.  What worries me is that this is
> the kind of thing that was easy to fully describe in English, yet can be
> tricky to correctly formalize.  I wasn't trying to be snarky about my comment
> "look at the ANTLR process"; it turned out to take several tries before
> I got a clean and at-least-appears-to-be-correct implementation.
>
> Granted, the inability of the parser to influence the lexer in ANTLR made it
> a little more work; a different approach avoids that issue completely.
> E.G., a traditional recursive descent parser doesn't have that limitation at 
> all.
>
>> Formalizing the tokenizer also allows us to strip away the hspace's in
>> the t-expression parsing spec.
>
> I'm very leary of removing the hspace's from the parsing spec.
> SRFI-49 did that; the resulting BNF was certainly simpler, but it
> made it *much* more difficult to *correctly* implement the spec.
> If that all moves into the tokenizer, I'm concerned that
> it may not be obvious where it happens,
> especially for people implementing it using traditional
> recursive descent parsing approaches.
> I want people to be able to implement code that is "obviously correct";
> if the spec is rigged so that implementation is mostly 1-to-1 it'll
> be easier to accept.

hspace is significant in these cases:

1.  Indentation
2.  After abbreviation sequences ' ` , ,@ and their Scheme-only syntax variants.
3.  Must specifically be ignored after GROUP_SPLICE in order to handle
a top-level "foo bar \\ nitz kuu" sequence correctly.
4.  Must specifically be ignored after RESTART_BEGIN in order to
handle "let <* x v \\ y v2 *>" style.

These cases are explicitly handled by the tokenizer spec.  I've
generalized the third and fourth cases to consume horizontal
whitespaces after all special symbols.

(the tokenizer spec needs to be clearer BTW)

>
>> The overall sweet-reader specifications is split into
>> three components:...
>
> Gotta run, family commitments, I'll take a real look later.
>
> In the end, though, I suspect having several implementation trials
> is a good thing.  If nothing else, it'll prove that the specification is
> easy-enough to implement several ways.

The tokenizer can also be specced (and implemented!!) as a
recursive-descent tokenizer that calls an (emit x) function; this
gives significant clarity, since we don't have to mention an
indent-stack - the Scheme stack serves that purpose.  In the
implementation, we just use call/cc  in the emit function to suspend
execution.  I don't want to spec it that way since the readable
project wants implementability across multiple Lisps, and most Lisps
don't have call/cc.  Not even all "Scheme" implementations have a
call/cc, and many have an inefficient implementation (old Guile
versions for example).  But that style can be done, and be equivalent
to the current specifications.

------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
_______________________________________________
Readable-discuss mailing list
Readable-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/readable-discuss

Re: [Readable-discuss] Proposal: a concrete "pre-processor" [Implementation detail]

Reply via email to