Hello, as you might already know, I'm the primary author of libsoldout and its integration into fossil to perform markdown-to-html conversion.
If you followed recent news, you might have heard of CommonMark[1], which is an attempt to unify most implementations and extensions of Markdown, by providing a non-ambiguous specification. It's an honorable goal, so it makes sense to try to converge existing implementations towards the new standard. Unfortunately, the architecture of the parser makes it extremely difficult to implement CommonMark, probably even more difficult than writing a new parser from scratch. In the rest of the e-mail I will detail why I think so, in case some of the brilliant minds find a mistake in my reasoning and a way to implement CommonMark easily in fossil. In case I'm not wrong, it raises the question of changing the markdown engine integrated in fossil, or purposefully forsake CommonMark support (which might make sense if its adoption ends up not as wide as its authors hope). Fortunately, there is no rush to take such a decision, as a community we can reasonably to wait and see how CommonMark adoption pans out. [1]: http://commonmark.org/ The heart of the architecture is built around an online parser: the input is considered as an infinite stream of characters, and each component of the parser either consumes input characters or hand over control to another component, with control transfer made in such a way that there is no loop without input character consumption. The main advantage of such an architecture is how easy it is to prove that in actually terminates, and to prove upper bounds on memory usage. When components are loosely coupled, which is the case here, it also makes debugging much easier. The main drawback is that there is no backtracking possible without cheating, and very limited look-ahead without severely tightening the coupling between components. Moreover, when designing the parser, I enforced very loose coupling between component by requiring all language elements to be individually added or removed from the parser. The reason for that is that complete Markdown is extremely powerful, especially because of raw HTML input features. That's too powerful for untrusted input, like blog comments or wikipages. So "unsafe" features have to be optional. But there are different levels of "unsafety", for example one might want to forbid titles in blog comments, to prevent untrusted users from messing with the page layout. Or one might want to forbid all links for more-untrusted users while allowing them for not-so-untrusted users. So it seemed better to engineer the parser around making it possible to allow or forbid any combination of features. So the online-parser loop variant means that any active character must have its semantics decided immediately, and the loose coupling means other language elements cannot interfere in the semantics decision. Other hand, CommonMark seems to have certain ideas about parser architecture leaking into the specification. For example the notion of precedence is directly at odds with the description of the previous paragraph. Consider for example the following ambiguous Markdown code, which is example 239 of current CommonMark specification: *foo`*` When the leading star is encountered, my parser has to scan for the closing star, and doing so without considering the backtick, since code spans might very well have been disabled. So my parser processes it as an emphasis that happen to contain a backtick. Meanwhile, CommonMark prescribes code spans as having a higher precedence than emphasis, so the example should be parsed as a code span that happens to contain a star. As you can imagine, this isn't an isolated example, otherwise working around it or cheating would have been easy. Most the span-level examples / specifications actually involve the more general rule of having "leaf" span elements taking precedence over "container" span elements. (Which again is fine by itself, I have nothing against it, it is just poorly compatible with my existing design.) The precedence of fenced code blocks over reference declarations raises a similar problem, although to a smaller extent. I admit I haven't yet looked deeply into the subtleties of block-level language elements, but even if everything went best on that area, the parser would still look ridiculous on the test suite without putting tremendous work. I will do my best to answer to any question or comments, but because of various issues, I might need up to a few days to post answers. Thanks for your attention, Natacha
pgpOk6adVqOQb.pgp
Description: PGP signature
_______________________________________________ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users