I forked the topic, since this is an (interesting) topic of its own, not really related to the interpretation of code-spans.

On Aug 13, 2007, at 10:20 AM, Michel Fortin wrote:

[...] I know most Markdown parsers do not follow conventional parser wisdom, but IMO this is also the interpretation that suits an incremental tokenizer/parser best compared to your interpretation [...]
[...]
There is a lot of look-aheads in Markdown:

Are you talking about the spec or implementations? I believe when it comes to the implementation it would be more correct to say a lot of “iteratively performing search and replacement on the entire document”.

emphasis won't be applied if asterisks or underscores can't be matched in pairs; links won't be links if there's no suitable parenthesis after the closing bracket, Setext-style headers need the line of hyphens or equal signs following its content, the parsing mode for list items depends on whether or not it contains a blank line, etc.

All but the style thing is limited (fixed size) look-ahead. This is not a problem. But those “look to the end of the document each time you see X” is a huge problem (for performance, and performance for Markdown is bad) -- if there was interest in addressing this, it could be done, sure we would have to mildly tweak a few rules, but those rules are anyway not written down today, they are just de facto rules based on the current implementation, this is why I jumped in at this back-tick thing, because as I see it, we have a very unconventional parser (in markdown.pl and ported by you to PHP) and we let the language be defined by how this parser ends up dealing with edge-cases (like pairing two back-ticks with three back-ticks). But often setting that way of dealing with things as the standard, is just counter-productive to ever getting a “real” parser for Markdown.

There's no way to do a truly incremental parsing of Markdown... well, you could in a way, but you'd have to mutate many parts of the output document while parsing

I strongly disagree. TextMate does a very good job at syntax highlighting Markdown, and it is based 100% on an incremental parser -- in v2.0 there will be some new parser features which will allow for it to deal with 99% of all Markdown out there. Where it has problems is really in the edge-cases, but that is partly because these are undefined, and partly because when they come up, e.g. like this back-tick thing, they “get defined” in a bad way.

(and there you have my motivation for going into this thread)

(like HTML parsers do in browsers),

Say what?

or to delay the output of ambigus parts until the end of the document; all this surely defeats the purpose of an incremental parser.

I think you misunderstand my use of the term. By incremental I mean that it scans the input document byte-by-byte (and creates tokens, from which a parse-tree is built), never going back to already scanned bytes. So this gives it a somewhat linear time complexity with a low constant factor.

I believe I have already mentioned it, but for reference, markdown.pl takes almost 40s to convert the TextMate manual into HTML where TM 2’s parser, which parses the manual *exactly* the same uses less than a quarter of a second.

I believe the ruby parser (maku?) is also based on doing an incremental scan, I have not played with it yet, but I would think it also shows much better performance.

That said, another reason why I am focused on an incremental parser is because then we get closer to a formal grammar -- i.e. if we see token A we switch to state X, etc. rather than now where nested constructs are only made possible by dubious md5-transformations on subsets of the document.

The worst "look-ahead" (or most complex "mutations") would be for reference-style links which can have their definitions absolutely anywhere in the document. Interestingly, that's probably one of the most appreciated features of Markdown.

You may have misunderstood the incremental as giving incremental output -- that is not the case.

So basically you parse the document token-for-token. When you see e.g. [foo][bar] you create the link, but just keep the bar as its value -- when you see [bar]: http://domain.tld then you insert http:// domain.tld in the symbol table for the bar symbol.

When it gets time to write out the document (i.e. the full input has been parsed / parse tree built) you just look-up the links in the symbol table when they are written out.

I.e. this is really no problem.



_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Reply via email to