On Sat, Jan 26, 2013 at 4:43 AM, David A. Wheeler <dwhee...@dwheeler.com> wrote: > By the way, there's no need to implement a separate tokenizer process to > implement this in Scheme. > > I intend to implement the Scheme version in a way that quietly does it "at > the same time", making the actual implementation really easy. In fact, > although it's not obvious, the BNF is specifically "rigged" to be especially > easy to implement using recursive descent. INDENT and DEDENT can be > determined by just comparing the indent strings. The head/rest productions > actually end up looking a lot like a tokenizer, but without needing a > separate tokenizer. I had to use "modes" in ANTLR because the parser cannot > communicate back to the lexer; in recursive descent that's a non-problem.
Actually, I think a tokenizer process will allow us to use a simple parser combinator library, which means we don't need to worry about calling protocols of productions on Scheme: everything uses the same parser calling protocol, which of course can be based on Monads (^^); So using a separate tokenizer is clearer IMO. The only drawback is that we now need to use SAME. A precis: the tokenizer is not a separate pass, but rather implemented as a stateful procedure that will consume exactly one token on the input stream. This allows laziness, which allows us to leave as many characters as possible on the port at any one time. The ANTLR architecture of having the tokenizer call a stateful parser procedure is also possible, but I think it's easier to have the (more complex) parser call the tokenizer than the reverse. And I think we should also formalize the tokenizer, since behavior like "comment-only lines are skipped" is NOT explicitly shown in the BNF. Formalizing the tokenizer also allows us to strip away the hspace's in the t-expression parsing spec. Here's a first cut: <p> The overall sweet-reader specifications is split into three components: </p> <ol> <li> <def>n-expression reader</def> - a standard Scheme parser with some extensions. </li> <li> <def>tokenizer</def> - convert a stream of characters into a stream of tokens and tagged datums. </li> <li> <def>t-expression reader</def> - an overall parser that accepts the tokenizer's generated stream </li> </ol> <h2>N-expression Reader</h2> <p> The n-expression reader is an ordinary Scheme reader, with the following capabilities: </p> <ol> <li> It can return a parsed datum. </li> <li> It can signal reading in a non-datum, such as <code>#;</code> comment, or a special <code>#!fold-case</code> reader option. </li> <li> It can signal a parsing error. </li> </ol> <p> The second capability above is important; typical Scheme readers can only either return a parsed datum, or signal a parsing error. However, the tokenizer needs to be able to detect datum or multiline comments. </p> <p> In addition, the n-expression reader must support the following extensions to the Scheme reader: </p> <ol> <li> SRFI-105 Curly-infix notation - the n-expression reader shall always have this notation enabled. </li> <li> <code>datum( ...)</code> => <code>(datum ...)</code> - There must not be a space between "<var>datum</var>" and the opening parenthesis. The "<var>datum</var>" can be any valid datum - symbol, number, string, list, etc. In particular, <code>f()()</code> <em>MUST</em> parse to <code>((f))</code> </li> <li> <code>datum[ ... ]</code> => <code>($bracket-apply$ datum ...)</code> - Again, no space between "<var>datum</var>" and the opening bracket. </li> <li> <code>datum{ ... }</code> => <code>(datum {...})</code> </li> </ol> <p> Essentially, this component is just a reader for n-expressions as described in SRFI-105, with the added capability to signal parsing a comment that is not started by "<code>;</code>" </p> <h2>Tokenizer</h2> <p> The tokenizer preprocesses the input character stream in order to find special symbols and insert tokens to signal indentation. Its output is a stream of tokens. Tokens are either one of the tokens listed below, or a datum. </p> <ul> <li>INITIAL_INDENT_NO_BANG</li> <li>INITIAL_INDENT_WITH_BANG</li> <li>INDENT</li> <li>SAME</li> <li>DEDENT</li> <li>BADDENT</li> <li>QUOTE_SPACE</li> <li>QUASIQUOTE_SPACE</li> <li>UNQUOTE_SPACE</li> <li>UNQUOTE_SPLICING_SPACE</li> <li>SYNTAX_SPACE</li> <li>QUASISYNTAX_SPACE</li> <li>UNSYNTAX_SPACE</li> <li>UNSYNTAX_SPLICING_SPACE</li> <li>SCOMMENT</li> <li>GROUP_SPLICE</li> <li>SUBLIST</li> <li>RESTART_BEGIN</li> <li>RESTART_END</li> <li>VT</li> <li>FF</li> <li>EOF</li> </ul> <p> (in the reference implementation, the above tokens are implemented as symbols, except for EOF which is implemented using the Scheme's end-of-file object. Datums are implemented as two-item lists of the form <code>(n-expr <em><datum></em>;)</code>) </p> <p> The tokenizer has two main overarching states: </p> <ol> <li> <def>initial indent state</def> - the starting state. In this state, the tokenizer skips over completely empty lines and completely <code>;</code>-comment lines. It enters the other state as soon as it finds a token to generate. The tokens INITIAL_INDENT_WITH_BANG and INITIAL_INDENT_NO_BANG can only be emitted in this state. </li> <li> <def>indent processing state</def> - In this state, the tokenizer skips over completely <code>;</code>-comment lines. It enters the other state when it encounters a completely empty line (it enters that state before generating a token). </li> </ol> <p> A <def>completely empty line</def> is a line composed of indentation whitespace only, followed by an end-of-line terminator. A <def>completely <code>;</code>-comment line</def> is a line composed of indentation whitespace only, followed by a <code>;</code>, followed by any number of non-end-of-line characters, followed by an end-of-line terminator. Crucially, the definition of <def>indentation whitespace</def> depends on the state: if in the <em>initial indent state</em>, indentation whitespace is either the space or tab character. If in the <em>indent processing state</em>, indentation whitespace is either the space, tab, or "<code>!</code>" character. </p> <p> An <def>end-of-line terminator</def> is a newline, carriage return, or the Windows carriage return-linefeed sequence, or, if Unicode is supported, a next line, line separator, or paragraph separator character. </p> <p> In addition to the above main states, the tokenizer must keep track of the following: </p> <ul> <li> Whether it is at the start of a line or not - initially, "start-of-line". </li> <li> A stack of indentation strings - initially empty. </li> </ol> <p> The tokenizer proceeds as follows: </p> <ol> <li> Starting in the <em>initial indent state</em>, it skips over completely empty lines and completely <code>;</code>-comment lines. <ul> <li> If it finds at least one indentation character that is not followed by a end-of-line terminator or a "<code>;</code>" character, it has found an initial indent. <ul> <li> If it finds a "<code>!</code>" character after the indent, emit INITIAL_INDENT_WITH_BANG and end processing. </li> <li> Otherwise, emit INITIAL_INDENT_NO_BANG and end processing. </li> </ul> </li> <li> If it does not find any initial indent, push an empty "" on the indent stack, set <em>not-start-of-line</em>, and enter the <em>indent processing state</em>. </li> </ul> </li> <li> In the <em>indent processing state</em>, skip over completely <code>;</code>-comment lines. <ul> <li> If you find a completely empty line, check the indent-stack's top. If the indent-stack's top is the empty string "", pop it off, emit SAME, and enter <em>initial indent state</em>. Otherwise, pop off items until you find the empty string ""; for every non-empty item, emit DEDENT, and then enter <em>initial indent state</em> after you pop off the empty string "". </li> <li> If you find the end-of-file, check the indent-stack's top. <ul> <li> If the indent-stack's top is the empty string ", pop it off and emit SAME, then check if the indent-stack is empty. <ul> <li> If the indent-stack is empty, emit EOF and end processing. </li> <li> Otherwise, emit RESTART_END, then check the indent-stack's top again. </li> </ul> </li> <li> While the indent-stack's top is not "", pop it off and emit DEDENT. When the indent-stack's top is "", pop it off and check if the indent-stack is empty using the above procedure. </li> </ul> </li> <li> If start-of-line, take any number (0 or more) indentation characters and put them in a string called the current-indentation. Sample (but don't pop!) the top of indent-stack into a string called the top-indentation. Set not-start-of-line. Then: <ul> <li> If the current-indentation is not indent-compatible with top-indentation, emit BADDENT and end processing. Two indentations are <def>indent-compatible</def> if the shorter indentation is the initial substring of the longer indentation, or both are equal. The empty string is indent-compatible with all strings. </li> <li> If the current-indentation is longer than the top-indentation, push current-indentation on the indent-stack and emit INDENT. </li> <li> If the current-indentation is equal to the top-indentation, emit SAME. </li> <li> While the current-indentation is shorter than the top-indentation, pop off the top indentation from the stack, and emit DEDENT. Repeat until the current-indentation is equal to the top-indentation. If the current-indentation is longer than the top-indentation after popping, emit BADDENT and end processing. </li> </ul> </li> <li> Check for abbreviations followed by at least one whitespace (space, tab, newline, carriage-return, and Unicode whitespaces if applicable) or <code>;</code>-comments. Consume the abbreviation and all horizontal whitespace (space, tab) and <code>;</code>-comments (not including end-of-line terminator). The abbreviations, and the token to emit, are: <ul> <li><code>'</code> -> QUOTE_SPACE</li> <li><code>`</code> -> QUASIQUOTE_SPACE</li> <li><code>,</code> -> UNQUOTE_SPACE</li> <li><code>,@</code> -> UNQUOTE_SPLICING_SPACE</li> <li><code>#'</code> -> SYNTAX_SPACE</li> <li><code>#`</code> -> QUASISYNTAX_SPACE</li> <li><code>#,</code> -> UNSYNTAX_SPACE</li> <li><code>#,@</code> -> UNSYNTAX_SPLICING_SPACE</li> </ul> </li> <li> If an abbreviation is not followed by whitespace, consume the abbreviation, then invoke the n-expression reader until it yields a datum, then emit that datum wrapped in a list with the traditional meaning for the abbreviation. </li> <li> Consume all horizontal whitespace and <code>;</code>-comments (without the end-of-line terminator). </li> <li> If you find an FF or VT character, consume it and emit the corresponding token. </li> <li> If you find an end-of-line terminator, set start-of-line. </li> <li> If you find a <code>{</code>, <code>[</code>, or <code>(</code>, invoke the n-expression reader once. If it yields a datum, emit that datum, if it signals a comment, emit SCOMMENT. </li> <li> Invoke the n-expression reader once. <ul> <li>If it signals a comment, emit SCOMMENT.</li> <li>If it yields a <code>\\</code> symbol, consume all horizontal whitespace (space, tab), then emit GROUP_SPLICE.</li> <li>If it yields a <code>$</code> symbol, consume all horizontal whitespace (space, tab), emit SUBLIST.</li> <li>If it yields a <code><*</code> symbol, consume all horizontal whitespace (space, tab), emit RESTART_BEGIN, and enter <em>initial indent state</em>.</li> <li>if it yields a <code>*></code> symbol, check the indent-stack's top. <ul> <li> If the indent-stack's top is empty "", pop it off, emit SAME, and emit RESTART_END. </li> <li> Otherwise, pop it off until you find an empty "". For every non-empty item popped off, emit DEDENT. After finding the "" and popping it off, emit RESTART_END. </li> </ul> </li> <li> Otherwise, emit the yielded datum. </li> </ul> </li> </ul> </li> </ul> --- Separating the indentation processor above allows us to formally describe it. Also, it lets us get away with a much simpler BNF for t-expr (but we can't include the BNF for n-expr, except as part of n-expr reader). 1. No more hspace 2. No more comment_eol and EOL. 3. The "comment-only lines are ignored" behavior is specified. -- In addition, conceptually it allows us to use a more consistent parser implementation for the Scheme side: for instance, all productions can use some kind of "token stream" representation that has a peek-token and read-token, and a consistent return protocol. This is so consistent we can use parser combinators, with all the benefits that implies (consistency means bugs are reduced, combinators on valid parsers yield valid parsers, etc.). I'll see if I can hack together an implementation soon. Sincerely, AmkG ------------------------------------------------------------------------------ Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d _______________________________________________ Readable-discuss mailing list Readable-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/readable-discuss