Incremental parser (was: Backtick Hickup)

Allan Odgaard Mon, 13 Aug 2007 18:57:28 -0700

I forked the topic, since this is an (interesting) topic of its own,not really related to the interpretation of code-spans.


On Aug 13, 2007, at 10:20 AM, Michel Fortin wrote:

[...] I know most Markdown parsers do not follow conventionalparser wisdom, but IMO this is also the interpretation that suitsan incremental tokenizer/parser best compared to yourinterpretation [...]
[...]
There is a lot of look-aheads in Markdown:

Are you talking about the spec or implementations? I believe when itcomes to the implementation it would be more correct to say a lot of“iteratively performing search and replacement on the entire document”.

emphasis won't be applied if asterisks or underscores can't bematched in pairs; links won't be links if there's no suitableparenthesis after the closing bracket, Setext-style headers needthe line of hyphens or equal signs following its content, theparsing mode for list items depends on whether or not it contains ablank line, etc.

All but the style thing is limited (fixed size) look-ahead. This isnot a problem. But those “look to the end of the document each timeyou see X” is a huge problem (for performance, and performance forMarkdown is bad) -- if there was interest in addressing this, itcould be done, sure we would have to mildly tweak a few rules, butthose rules are anyway not written down today, they are just de factorules based on the current implementation, this is why I jumped in atthis back-tick thing, because as I see it, we have a veryunconventional parser (in markdown.pl and ported by you to PHP) andwe let the language be defined by how this parser ends up dealingwith edge-cases (like pairing two back-ticks with three back-ticks).But often setting that way of dealing with things as the standard, isjust counter-productive to ever getting a “real” parser for Markdown.

There's no way to do a truly incremental parsing of Markdown...well, you could in a way, but you'd have to mutate many parts ofthe output document while parsing

I strongly disagree. TextMate does a very good job at syntaxhighlighting Markdown, and it is based 100% on an incremental parser-- in v2.0 there will be some new parser features which will allowfor it to deal with 99% of all Markdown out there. Where it hasproblems is really in the edge-cases, but that is partly becausethese are undefined, and partly because when they come up, e.g. likethis back-tick thing, they “get defined” in a bad way.


(and there you have my motivation for going into this thread)

(like HTML parsers do in browsers),


Say what?

or to delay the output of ambigus parts until the end of thedocument; all this surely defeats the purpose of an incrementalparser.

I think you misunderstand my use of the term. By incremental I meanthat it scans the input document byte-by-byte (and creates tokens,from which a parse-tree is built), never going back to alreadyscanned bytes. So this gives it a somewhat linear time complexitywith a low constant factor.

I believe I have already mentioned it, but for reference, markdown.pltakes almost 40s to convert the TextMate manual into HTML where TM2’s parser, which parses the manual *exactly* the same uses less thana quarter of a second.

I believe the ruby parser (maku?) is also based on doing anincremental scan, I have not played with it yet, but I would think italso shows much better performance.

That said, another reason why I am focused on an incremental parseris because then we get closer to a formal grammar -- i.e. if we seetoken A we switch to state X, etc. rather than now where nestedconstructs are only made possible by dubious md5-transformations onsubsets of the document.

The worst "look-ahead" (or most complex "mutations") would be forreference-style links which can have their definitions absolutelyanywhere in the document. Interestingly, that's probably one of themost appreciated features of Markdown.

You may have misunderstood the incremental as giving incrementaloutput -- that is not the case.

So basically you parse the document token-for-token. When you seee.g. [foo][bar] you create the link, but just keep the bar as itsvalue -- when you see [bar]: http://domain.tld then you insert http://domain.tld in the symbol table for the bar symbol.

When it gets time to write out the document (i.e. the full input hasbeen parsed / parse tree built) you just look-up the links in thesymbol table when they are written out.


I.e. this is really no problem.



_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Incremental parser (was: Backtick Hickup)

Reply via email to