ANTLR / lexer (was: Backtick Hickup)

Allan Odgaard Mon, 03 Sep 2007 00:04:32 -0700

On Aug 27, 2007, at 11:02 PM, Eric Astor wrote:

[...]
Well - has anyone else looked into ANTLR 3.0 at all? The LL(*)grammar language it uses (an EBNF) allows for full backtrackingsupport, and unspecified lookahead as far as necessary. It's fairlywell-optimized, as I understand it, taking advantage of some of thepackrat-parsing ideas to save handling a single text sectionrepeatedly...

I am playing a bit with writing a Markdown parser now, since I havebeen involuntarily cut off from my regular project.


The main challenge is really the lexer. There are two problems here:

1. If we generate a token for all special characters, we end uphaving to deal with a lot of tokens in the parser (grammar). I don’tlike this, so I am making the lexer slightly context aware. I don’tthink this is really a problem, e.g. this is no different than havingthe lexer switch to another state when seeing e.g. string literals ina language where string literals themselves have a mini-grammar(escape codes), and thus benefit from their own lexer.

2. The thing about block-environments not having an end-marker perse, but rather have each line participating in the environment,prefixed with something.

My solution for #2 so far is to make the LF token special in the waythat it will encompass the leading prefix-stuff (from the next line).So effectively when the lexer sees ‘ > ’ then it outputs aQUOTE_START token and adds ‘ > ’ to a global stack (read by the rulefor the LF token). When the LF token matches ‘\n’ it goes throughthis stack, and if there is a pattern which does not match (from thestack) it pops the stack until (and incl.) the current one, andoutputs a «token»_STOP for each, and then outputs the LF token.

This approach means that generally we detect “end of block-levelconstruct” one LF after it actually did end, so e.g.:


    * This is a list item

    A paragraph below it.

Becomes:

    <ul><li>This is a list item
    </li></ul>
    <p>A paragraph below it</p>

This is because the empty line (included in the list item) couldactually have been part of the list item, we do not know that beforewe see the paragraph. In general though this shouldn’t matter (exceptfor stuff in <pre>) so I am not sure it is worth addressing -- thougha simple pattern-based re-ordering of tokens could fix it, or maybe Ican address it in the parser (grammar).


_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

ANTLR / lexer (was: Backtick Hickup)

Reply via email to