I forked the topic, since this is an (interesting) topic of its own,
not really related to the interpretation of code-spans.
On Aug 13, 2007, at 10:20 AM, Michel Fortin wrote:
[...] I know most Markdown parsers do not follow conventional
parser wisdom, but IMO this is also the interpretation that suits
an incremental tokenizer/parser best compared to your
interpretation [...]
[...]
There is a lot of look-aheads in Markdown:
Are you talking about the spec or implementations? I believe when it
comes to the implementation it would be more correct to say a lot of
“iteratively performing search and replacement on the entire document”.
emphasis won't be applied if asterisks or underscores can't be
matched in pairs; links won't be links if there's no suitable
parenthesis after the closing bracket, Setext-style headers need
the line of hyphens or equal signs following its content, the
parsing mode for list items depends on whether or not it contains a
blank line, etc.
All but the style thing is limited (fixed size) look-ahead. This is
not a problem. But those “look to the end of the document each time
you see X” is a huge problem (for performance, and performance for
Markdown is bad) -- if there was interest in addressing this, it
could be done, sure we would have to mildly tweak a few rules, but
those rules are anyway not written down today, they are just de facto
rules based on the current implementation, this is why I jumped in at
this back-tick thing, because as I see it, we have a very
unconventional parser (in markdown.pl and ported by you to PHP) and
we let the language be defined by how this parser ends up dealing
with edge-cases (like pairing two back-ticks with three back-ticks).
But often setting that way of dealing with things as the standard, is
just counter-productive to ever getting a “real” parser for Markdown.
There's no way to do a truly incremental parsing of Markdown...
well, you could in a way, but you'd have to mutate many parts of
the output document while parsing
I strongly disagree. TextMate does a very good job at syntax
highlighting Markdown, and it is based 100% on an incremental parser
-- in v2.0 there will be some new parser features which will allow
for it to deal with 99% of all Markdown out there. Where it has
problems is really in the edge-cases, but that is partly because
these are undefined, and partly because when they come up, e.g. like
this back-tick thing, they “get defined” in a bad way.
(and there you have my motivation for going into this thread)
(like HTML parsers do in browsers),
Say what?
or to delay the output of ambigus parts until the end of the
document; all this surely defeats the purpose of an incremental
parser.
I think you misunderstand my use of the term. By incremental I mean
that it scans the input document byte-by-byte (and creates tokens,
from which a parse-tree is built), never going back to already
scanned bytes. So this gives it a somewhat linear time complexity
with a low constant factor.
I believe I have already mentioned it, but for reference, markdown.pl
takes almost 40s to convert the TextMate manual into HTML where TM
2’s parser, which parses the manual *exactly* the same uses less than
a quarter of a second.
I believe the ruby parser (maku?) is also based on doing an
incremental scan, I have not played with it yet, but I would think it
also shows much better performance.
That said, another reason why I am focused on an incremental parser
is because then we get closer to a formal grammar -- i.e. if we see
token A we switch to state X, etc. rather than now where nested
constructs are only made possible by dubious md5-transformations on
subsets of the document.
The worst "look-ahead" (or most complex "mutations") would be for
reference-style links which can have their definitions absolutely
anywhere in the document. Interestingly, that's probably one of the
most appreciated features of Markdown.
You may have misunderstood the incremental as giving incremental
output -- that is not the case.
So basically you parse the document token-for-token. When you see
e.g. [foo][bar] you create the link, but just keep the bar as its
value -- when you see [bar]: http://domain.tld then you insert http://
domain.tld in the symbol table for the bar symbol.
When it gets time to write out the document (i.e. the full input has
been parsed / parse tree built) you just look-up the links in the
symbol table when they are written out.
I.e. this is really no problem.
_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss