Hello,

as you might already know, I'm the primary author of libsoldout and its
integration into fossil to perform markdown-to-html conversion.

If you followed recent news, you might have heard of CommonMark[1],
which is an attempt to unify most implementations and extensions of
Markdown, by providing a non-ambiguous specification. It's an honorable
goal, so it makes sense to try to converge existing implementations
towards the new standard.

Unfortunately, the architecture of the parser makes it extremely
difficult to implement CommonMark, probably even more difficult than
writing a new parser from scratch. In the rest of the e-mail I will
detail why I think so, in case some of the brilliant minds find a
mistake in my reasoning and a way to implement CommonMark easily in
fossil.

In case I'm not wrong, it raises the question of changing the markdown
engine integrated in fossil, or purposefully forsake CommonMark support
(which might make sense if its adoption ends up not as wide as its
authors hope). Fortunately, there is no rush to take such a decision, as
a community we can reasonably to wait and see how CommonMark adoption
pans out.

[1]: http://commonmark.org/



The heart of the architecture is built around an online parser: the
input is considered as an infinite stream of characters, and each
component of the parser either consumes input characters or hand over
control to another component, with control transfer made in such a way
that there is no loop without input character consumption.

The main advantage of such an architecture is how easy it is to prove
that in actually terminates, and to prove upper bounds on memory usage.
When components are loosely coupled, which is the case here, it also
makes debugging much easier.

The main drawback is that there is no backtracking possible without
cheating, and very limited look-ahead without severely tightening the
coupling between components.

Moreover, when designing the parser, I enforced very loose coupling
between component by requiring all language elements to be individually
added or removed from the parser. The reason for that is that complete
Markdown is extremely powerful, especially because of raw HTML input
features. That's too powerful for untrusted input, like blog comments or
wikipages. So "unsafe" features have to be optional. But there are
different levels of "unsafety", for example one might want to forbid
titles in blog comments, to prevent untrusted users from messing with
the page layout. Or one might want to forbid all links for
more-untrusted users while allowing them for not-so-untrusted users.
So it seemed better to engineer the parser around making it possible to
allow or forbid any combination of features.

So the online-parser loop variant means that any active character must
have its semantics decided immediately, and the loose coupling means
other language elements cannot interfere in the semantics decision.

Other hand, CommonMark seems to have certain ideas about parser
architecture leaking into the specification. For example the notion of
precedence is directly at odds with the description of the previous
paragraph.

Consider for example the following ambiguous Markdown code, which is
example 239 of current CommonMark specification:
*foo`*`

When the leading star is encountered, my parser has to scan for the
closing star, and doing so without considering the backtick, since
code spans might very well have been disabled. So my parser processes
it as an emphasis that happen to contain a backtick.

Meanwhile, CommonMark prescribes code spans as having a higher
precedence than emphasis, so the example should be parsed as a code span
that happens to contain a star.

As you can imagine, this isn't an isolated example, otherwise working
around it or cheating would have been easy. Most the span-level
examples / specifications actually involve the more general rule of
having "leaf" span elements taking precedence over "container" span
elements. (Which again is fine by itself, I have nothing against it, it
is just poorly compatible with my existing design.)

The precedence of fenced code blocks over reference declarations raises
a similar problem, although to a smaller extent.

I admit I haven't yet looked deeply into the subtleties of block-level
language elements, but even if everything went best on that area, the
parser would still look ridiculous on the test suite without putting
tremendous work.


I will do my best to answer to any question or comments, but because of
various issues, I might need up to a few days to post answers.


Thanks for your attention,
Natacha

Attachment: pgpOk6adVqOQb.pgp
Description: PGP signature

_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Reply via email to