On Aug 27, 2007, at 11:02 PM, Eric Astor wrote:

[...]
Well - has anyone else looked into ANTLR 3.0 at all? The LL(*) grammar language it uses (an EBNF) allows for full backtracking support, and unspecified lookahead as far as necessary. It's fairly well-optimized, as I understand it, taking advantage of some of the packrat-parsing ideas to save handling a single text section repeatedly...

I am playing a bit with writing a Markdown parser now, since I have been involuntarily cut off from my regular project.

The main challenge is really the lexer. There are two problems here:

1. If we generate a token for all special characters, we end up having to deal with a lot of tokens in the parser (grammar). I don’t like this, so I am making the lexer slightly context aware. I don’t think this is really a problem, e.g. this is no different than having the lexer switch to another state when seeing e.g. string literals in a language where string literals themselves have a mini-grammar (escape codes), and thus benefit from their own lexer.

2. The thing about block-environments not having an end-marker per se, but rather have each line participating in the environment, prefixed with something.

My solution for #2 so far is to make the LF token special in the way that it will encompass the leading prefix-stuff (from the next line). So effectively when the lexer sees ‘ > ’ then it outputs a QUOTE_START token and adds ‘ > ’ to a global stack (read by the rule for the LF token). When the LF token matches ‘\n’ it goes through this stack, and if there is a pattern which does not match (from the stack) it pops the stack until (and incl.) the current one, and outputs a «token»_STOP for each, and then outputs the LF token.

This approach means that generally we detect “end of block-level construct” one LF after it actually did end, so e.g.:

    * This is a list item

    A paragraph below it.

Becomes:

    <ul><li>This is a list item
    </li></ul>
    <p>A paragraph below it</p>

This is because the empty line (included in the list item) could actually have been part of the list item, we do not know that before we see the paragraph. In general though this shouldn’t matter (except for stuff in <pre>) so I am not sure it is worth addressing -- though a simple pattern-based re-ordering of tokens could fix it, or maybe I can address it in the parser (grammar).

_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Reply via email to